Memory leak in CW agent prometheus 1.247348.0b251302 #264

ashevtsov-wawa · 2021-09-02T16:42:11Z

After upgrading CW agent Prometheus from 1.247347.5b250583 to 1.247348.0b251302 the pod started getting killed by Kubernetes (OOMKilled).
Memory limit is set to 2000m. Tried increasing the limit up to 8000m to no avail.
Downgrading to 1.247347.5b250583 fixes the issue (with 2000m limit).
We run the agent in EKS 1.19.

We are experiencing this in a couple environments, each running over 120 pods (including those of daemonsets). Environments where this is not an issue have ~30-50 pods running.
Last messages in the container logs of the killed pods aren't consistent
one instance:

...
2021-09-02T15:49:55Z D! [outputs.cloudwatchlogs] Wrote batch of 179 metrics in 1.201294006s
2021-09-02T15:49:55Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2021-09-02T15:49:55Z D! [outputs.cloudwatchlogs] Pusher published 8 log events to group: /aws/containerinsights/redacted/prometheus stream: redacted with size 4 KB in 110.502908ms.
2021-09-02T15:49:56Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics

another instance (same cluster):

...
2021-09-02T15:57:29Z D! Drop metric with NaN or Inf value: &{map[app:redacted app_kubernetes_io_instance:redacted app_kubernetes_io_name:redacted application:redacted client_id:producer-1 instance:10.1.2.3:15020 istio_io_rev:default job:kubernetes-pods kafka_version:2.6.0 kubernetes_namespace:redacted kubernetes_pod_name:redacted-6dfd7cd497-qr2hb node_id:node--1 pod_template_hash:6dfd7cd497 prom_metric_type:gauge security_istio_io_tlsMode:istio service_istio_io_canonical_name:redacted service_istio_io_canonical_revision:0.1.0 spring_id:redacted.items.producer.producer-1 version:0.1.0] kafka_producer_node_response_rate kafka_producer_node_response_rate kubernetes-pods 10.1.2.3 NaN gauge 1630598180331}
2021-09-02T15:57:29Z D! Drop metric with NaN or Inf value: &{map[app:redacted app_kubernetes_io_instance:redacted app_kubernetes_io_name:redacted application:redacted client_id:producer-2 instance:10.1.2.3:15020 istio_io_rev:default job:kubernetes-pods kafka_version:2.6.0 kubernetes_namespace:redacted kubernetes_pod_name:redacted-6dfd7cd497-qr2hb node_id:node--1 pod_template_hash:6dfd7cd497 prom_metric_type:gauge security_istio_io_tlsMode:istio service_istio_io_canonical_name:redacted service_istio_io_canonical_revision:0.1.0 spring_id:redacted.items.producer.producer-2 version:0.1.0] kafka_producer_node_response_rate kafka_producer_node_response_rate kubernetes-pods 10.1.2.3 NaN gauge 1630598180331}
2021-09-02T15:57:29Z D! Drop metric with NaN or Inf value: &{map[app:redacted app_kubernetes_io_instance:redacted app_kubernetes_io_name:redacted application:redacted client_id:consumer-redacted-group-1 instance:10.1.2.3:15020 istio_io_rev:default job:kubernetes-pods kafka_version:2.6.0 kubernetes_namespace:redacted kubernetes_pod_name:redacted-6dfd7cd497-qr2hb pod_template_hash:6dfd7cd497 prom_metric_type:gauge security_istio_io_tlsMode:istio service_istio_io_canonical_name:redacted service_istio_io_canonical_revision:0.1.0 spring_id:redacted.items.consumer.consumer-redacted-group-1 version:0.1.0] kafka_consumer_last_poll_seconds_ago kafka_consumer_last_poll_seconds_ago kubernetes-pods 10.1.2.3 NaN gauge 1630598180331}
2021-09-02T15:57:29Z D! Drop metric with NaN or Inf value: &{map[app:redacted app_kubernetes_io_instance:redacted app_kubernetes_io_name:redacted application:redacted client_id:consumer-redacted-group-3 instance:10.1.2.3:15020 istio_io_rev:default job:kubernetes-pods kafka_version:2.6.0 kubernetes_namespace:redacted kubernetes_pod_name:redacted-6dfd7cd497-qr2hb pod_template_hash:6dfd7cd497 prom_metric_type:gauge security_istio_io_tlsMode:istio service_istio_io_canonical_name:redacted service_istio_io_canonical_revision:0.1.0 spring_id:redacted.items.consumer.consumer-redacted-group-3 version:0.1.0] kafka_consumer_last_poll_seconds_ago kafka_consumer_last_poll_seconds_ago kubernetes-pods 10.1.2.3 NaN gauge 1630598180331}

Let me know if you need any other information/logs that will help in troubleshooting.

The text was updated successfully, but these errors were encountered:

github-actions · 2021-12-29T00:07:02Z

This issue was marked stale due to lack of activity.

ashevtsov-wawa · 2022-01-26T01:36:38Z

Still happening with 1.247349.0b251399

jhnlsn · 2022-01-28T19:51:23Z

Hey Andrey, we believe that we have a fix for this issue in our latest release. It was related to data that was not paginated that was returned from the k8s api server.

Please keep an eye out for the 50 release, which should be coming in mid February

CraigHead · 2022-01-31T22:55:40Z

This is happening on non-EKS CWAgent as well. Specifically the Windows agent.

jhnlsn · 2022-02-01T14:32:45Z

@CraigHead could you describe your issue specifically, including related errors you are seeing. The issue listed in this ticket was related to contains being killed for OOM in EKS.

jhnlsn · 2022-03-01T20:08:41Z

This should be resolved with the latest version of the agent

ashevtsov-wawa · 2022-04-14T19:52:50Z

Still seeing pod being OOMKilled with 2500Mi memory limit when using public.ecr.aws/cloudwatch-agent/cloudwatch-agent:1.247350.0b251780 image
@jhnlsn can you re-open this issue or should I create a new one?

ashevtsov-wawa · 2022-04-14T20:09:52Z

I removed memory limits to see how much memory it would use. I stopped this experiment after it consumed 20GB.

khanhntd · 2022-07-07T11:51:03Z

Hey @ashevtsov-wawa,
After setting up the EKS Prometheus CloudWatchAgent with different version (e.g 1.247348, 1.247352) by following up this documentation, I was not able to detect the memory leak in EKS according to the memory usage and cpu usage

Therefore, for next course of action, would you help me in sharing the following information:

How you setup your environment and also is there any different between your setup with the public document that I have followed?
Instead of using v 1.247348, would you able to use this image public.ecr.aws/i7a4z2v8/cwagent-prometheus-metrics:353 (current image for v353) (there would be in exchange for memory in place of CPU Usage but I have not seen OOM so far with this image)?

nmamn · 2023-11-02T10:37:33Z

Hi,

Not sure if it is the same issue, but we faced a memory leak when the endpoint was unreachable. It seems CW agent would accumulate the connections / not clean everything ? and finally would get OOM killed after some time.

Fixing the network issue resolved our problem, but I believe it could be dealt with in the code, so that an unreachable endpoint does not end in an OOM.

thanks,

Nicolas

jefchien · 2023-11-02T16:07:47Z

@nmamn Can you provide some additional context into the issue you're seeing? Which version of the agent were you seeing this in? Were there any logs indicating that the agent was failing to reach the endpoint? It would help us debug the issue.

nar-git · 2024-10-02T00:13:42Z

We are facing a similar issue and reported here. Our agent is consuming more than 50Gi (limit) and getting OOMKilled

jhnlsn added aws/eks Amazon Elastic Kubernetes Service component/prometheus Prometheus labels Sep 29, 2021

github-actions bot added the Stale label Dec 29, 2021

github-actions bot removed the Stale label Jan 27, 2022

jhnlsn added this to the 1.247350.0 milestone Jan 28, 2022

SaxyPandaBear mentioned this issue Feb 23, 2022

bump version to 1.3.9 with new CloudWatch agent release aws-samples/amazon-cloudwatch-container-insights#87

Merged

jhnlsn closed this as completed Mar 1, 2022

khanhntd reopened this May 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in CW agent prometheus 1.247348.0b251302 #264

Memory leak in CW agent prometheus 1.247348.0b251302 #264

ashevtsov-wawa commented Sep 2, 2021 •

edited

Loading

github-actions bot commented Dec 29, 2021

ashevtsov-wawa commented Jan 26, 2022

jhnlsn commented Jan 28, 2022

CraigHead commented Jan 31, 2022

jhnlsn commented Feb 1, 2022

jhnlsn commented Mar 1, 2022

ashevtsov-wawa commented Apr 14, 2022 •

edited

Loading

ashevtsov-wawa commented Apr 14, 2022 •

edited

Loading

khanhntd commented Jul 7, 2022 •

edited

Loading

nmamn commented Nov 2, 2023

jefchien commented Nov 2, 2023

nar-git commented Oct 2, 2024

Memory leak in CW agent prometheus 1.247348.0b251302 #264

Memory leak in CW agent prometheus 1.247348.0b251302 #264

Comments

ashevtsov-wawa commented Sep 2, 2021 • edited Loading

github-actions bot commented Dec 29, 2021

ashevtsov-wawa commented Jan 26, 2022

jhnlsn commented Jan 28, 2022

CraigHead commented Jan 31, 2022

jhnlsn commented Feb 1, 2022

jhnlsn commented Mar 1, 2022

ashevtsov-wawa commented Apr 14, 2022 • edited Loading

ashevtsov-wawa commented Apr 14, 2022 • edited Loading

khanhntd commented Jul 7, 2022 • edited Loading

nmamn commented Nov 2, 2023

jefchien commented Nov 2, 2023

nar-git commented Oct 2, 2024

ashevtsov-wawa commented Sep 2, 2021 •

edited

Loading

ashevtsov-wawa commented Apr 14, 2022 •

edited

Loading

ashevtsov-wawa commented Apr 14, 2022 •

edited

Loading

khanhntd commented Jul 7, 2022 •

edited

Loading