Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in CW agent prometheus 1.247348.0b251302 #264

Open
ashevtsov-wawa opened this issue Sep 2, 2021 · 12 comments
Open

Memory leak in CW agent prometheus 1.247348.0b251302 #264

ashevtsov-wawa opened this issue Sep 2, 2021 · 12 comments
Labels
aws/eks Amazon Elastic Kubernetes Service component/prometheus Prometheus
Milestone

Comments

@ashevtsov-wawa
Copy link

ashevtsov-wawa commented Sep 2, 2021

After upgrading CW agent Prometheus from 1.247347.5b250583 to 1.247348.0b251302 the pod started getting killed by Kubernetes (OOMKilled).
Memory limit is set to 2000m. Tried increasing the limit up to 8000m to no avail.
Downgrading to 1.247347.5b250583 fixes the issue (with 2000m limit).
We run the agent in EKS 1.19.

We are experiencing this in a couple environments, each running over 120 pods (including those of daemonsets). Environments where this is not an issue have ~30-50 pods running.
Last messages in the container logs of the killed pods aren't consistent
one instance:

...
2021-09-02T15:49:55Z D! [outputs.cloudwatchlogs] Wrote batch of 179 metrics in 1.201294006s
2021-09-02T15:49:55Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics
2021-09-02T15:49:55Z D! [outputs.cloudwatchlogs] Pusher published 8 log events to group: /aws/containerinsights/redacted/prometheus stream: redacted with size 4 KB in 110.502908ms.
2021-09-02T15:49:56Z D! [outputs.cloudwatchlogs] Buffer fullness: 0 / 10000 metrics

another instance (same cluster):

...
2021-09-02T15:57:29Z D! Drop metric with NaN or Inf value: &{map[app:redacted app_kubernetes_io_instance:redacted app_kubernetes_io_name:redacted application:redacted client_id:producer-1 instance:10.1.2.3:15020 istio_io_rev:default job:kubernetes-pods kafka_version:2.6.0 kubernetes_namespace:redacted kubernetes_pod_name:redacted-6dfd7cd497-qr2hb node_id:node--1 pod_template_hash:6dfd7cd497 prom_metric_type:gauge security_istio_io_tlsMode:istio service_istio_io_canonical_name:redacted service_istio_io_canonical_revision:0.1.0 spring_id:redacted.items.producer.producer-1 version:0.1.0] kafka_producer_node_response_rate kafka_producer_node_response_rate kubernetes-pods 10.1.2.3 NaN gauge 1630598180331}
2021-09-02T15:57:29Z D! Drop metric with NaN or Inf value: &{map[app:redacted app_kubernetes_io_instance:redacted app_kubernetes_io_name:redacted application:redacted client_id:producer-2 instance:10.1.2.3:15020 istio_io_rev:default job:kubernetes-pods kafka_version:2.6.0 kubernetes_namespace:redacted kubernetes_pod_name:redacted-6dfd7cd497-qr2hb node_id:node--1 pod_template_hash:6dfd7cd497 prom_metric_type:gauge security_istio_io_tlsMode:istio service_istio_io_canonical_name:redacted service_istio_io_canonical_revision:0.1.0 spring_id:redacted.items.producer.producer-2 version:0.1.0] kafka_producer_node_response_rate kafka_producer_node_response_rate kubernetes-pods 10.1.2.3 NaN gauge 1630598180331}
2021-09-02T15:57:29Z D! Drop metric with NaN or Inf value: &{map[app:redacted app_kubernetes_io_instance:redacted app_kubernetes_io_name:redacted application:redacted client_id:consumer-redacted-group-1 instance:10.1.2.3:15020 istio_io_rev:default job:kubernetes-pods kafka_version:2.6.0 kubernetes_namespace:redacted kubernetes_pod_name:redacted-6dfd7cd497-qr2hb pod_template_hash:6dfd7cd497 prom_metric_type:gauge security_istio_io_tlsMode:istio service_istio_io_canonical_name:redacted service_istio_io_canonical_revision:0.1.0 spring_id:redacted.items.consumer.consumer-redacted-group-1 version:0.1.0] kafka_consumer_last_poll_seconds_ago kafka_consumer_last_poll_seconds_ago kubernetes-pods 10.1.2.3 NaN gauge 1630598180331}
2021-09-02T15:57:29Z D! Drop metric with NaN or Inf value: &{map[app:redacted app_kubernetes_io_instance:redacted app_kubernetes_io_name:redacted application:redacted client_id:consumer-redacted-group-3 instance:10.1.2.3:15020 istio_io_rev:default job:kubernetes-pods kafka_version:2.6.0 kubernetes_namespace:redacted kubernetes_pod_name:redacted-6dfd7cd497-qr2hb pod_template_hash:6dfd7cd497 prom_metric_type:gauge security_istio_io_tlsMode:istio service_istio_io_canonical_name:redacted service_istio_io_canonical_revision:0.1.0 spring_id:redacted.items.consumer.consumer-redacted-group-3 version:0.1.0] kafka_consumer_last_poll_seconds_ago kafka_consumer_last_poll_seconds_ago kubernetes-pods 10.1.2.3 NaN gauge 1630598180331}

Let me know if you need any other information/logs that will help in troubleshooting.

@jhnlsn jhnlsn added aws/eks Amazon Elastic Kubernetes Service component/prometheus Prometheus labels Sep 29, 2021
@github-actions
Copy link
Contributor

This issue was marked stale due to lack of activity.

@github-actions github-actions bot added the Stale label Dec 29, 2021
@ashevtsov-wawa
Copy link
Author

Still happening with 1.247349.0b251399

@github-actions github-actions bot removed the Stale label Jan 27, 2022
@jhnlsn jhnlsn added this to the 1.247350.0 milestone Jan 28, 2022
@jhnlsn
Copy link
Contributor

jhnlsn commented Jan 28, 2022

Hey Andrey, we believe that we have a fix for this issue in our latest release. It was related to data that was not paginated that was returned from the k8s api server.

Please keep an eye out for the 50 release, which should be coming in mid February

@CraigHead
Copy link

This is happening on non-EKS CWAgent as well. Specifically the Windows agent.

@jhnlsn
Copy link
Contributor

jhnlsn commented Feb 1, 2022

@CraigHead could you describe your issue specifically, including related errors you are seeing. The issue listed in this ticket was related to contains being killed for OOM in EKS.

@jhnlsn
Copy link
Contributor

jhnlsn commented Mar 1, 2022

This should be resolved with the latest version of the agent

@jhnlsn jhnlsn closed this as completed Mar 1, 2022
@ashevtsov-wawa
Copy link
Author

ashevtsov-wawa commented Apr 14, 2022

Still seeing pod being OOMKilled with 2500Mi memory limit when using public.ecr.aws/cloudwatch-agent/cloudwatch-agent:1.247350.0b251780 image
@jhnlsn can you re-open this issue or should I create a new one?

@ashevtsov-wawa
Copy link
Author

ashevtsov-wawa commented Apr 14, 2022

I removed memory limits to see how much memory it would use. I stopped this experiment after it consumed 20GB.

@khanhntd khanhntd reopened this May 16, 2022
@khanhntd
Copy link
Contributor

khanhntd commented Jul 7, 2022

Hey @ashevtsov-wawa,
After setting up the EKS Prometheus CloudWatchAgent with different version (e.g 1.247348, 1.247352) by following up this documentation, I was not able to detect the memory leak in EKS according to the memory usage and cpu usage

image

Therefore, for next course of action, would you help me in sharing the following information:

  • How you setup your environment and also is there any different between your setup with the public document that I have followed?
  • Instead of using v 1.247348, would you able to use this image public.ecr.aws/i7a4z2v8/cwagent-prometheus-metrics:353 (current image for v353) (there would be in exchange for memory in place of CPU Usage but I have not seen OOM so far with this image)?

@nmamn
Copy link

nmamn commented Nov 2, 2023

Hi,

Not sure if it is the same issue, but we faced a memory leak when the endpoint was unreachable. It seems CW agent would accumulate the connections / not clean everything ? and finally would get OOM killed after some time.

Fixing the network issue resolved our problem, but I believe it could be dealt with in the code, so that an unreachable endpoint does not end in an OOM.

thanks,

Nicolas

@jefchien
Copy link
Contributor

jefchien commented Nov 2, 2023

@nmamn Can you provide some additional context into the issue you're seeing? Which version of the agent were you seeing this in? Were there any logs indicating that the agent was failing to reach the endpoint? It would help us debug the issue.

@nar-git
Copy link

nar-git commented Oct 2, 2024

We are facing a similar issue and reported here. Our agent is consuming more than 50Gi (limit) and getting OOMKilled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws/eks Amazon Elastic Kubernetes Service component/prometheus Prometheus
Projects
None yet
Development

No branches or pull requests

7 participants