-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error: cannot start pipelines: cannot get pod from kubelet #1417
Comments
We are seeing a similar issue with image
Control-Plane: v1.28 After some retries the cloudwatch-agent pod is able to start normally. |
Could some retry be added to amazon-cloudwatch-agent/plugins/processors/k8sdecorator/stores/podstore.go Lines 75 to 78 in 3f78406
|
The issue on our side is related to csr certificates not available/approved for kubelet when the cloudwatch-agent initiates as we see the following errors in the logs before cloudwatch-agent restarts with success:
Given that we decided to add an InitContainer to keep checking if requests to the kubelet pods endpoint succeeds (it will return Unauthorized but that is ok) before starting the cloudwatch-agent container:
The workaround is working and we are not seeing cloudwatch-agent pods crashing and then restarting. However, would be better to have retried added to the cloudwatch-agent itself. |
Describe the bug
After upgrading EKS nodes to version v1.29 (v1.29.8-20241024) and deploying CloudWatch Agent v1.3xxx, the following error is encountered:
amazon-cloudwatch-agent/internal/k8sCommon/kubeletutil/kubeletclient.go
Line 35 in 6b25891
Note that the EKS Control-Plane was upgraded to v1.29 before proceeding with the node upgrade.
Steps to reproduce
At first, I upgraded the EKS cluster from version v1.28 to v1.29.
Then, I upgraded the node version from v1.27 to v1.29.
The reason for skipping one version is that I alternate between Blue and Green nodes.
After upgrading the node version to v1.29, the CloudWatch Agent started producing the aforementioned error.
What did you expect to see?
As a result of the cluster upgrade, the CloudWatch Agent is expected to no longer output errors. Specifically, when the CloudWatch Agent sends a request to the /pods endpoint on a running instance to retrieve pod data, the TLS error (tls: internal error) is expected not to occur.
What version did you use?
What config did you use?
We are using the container image
public.ecr.aws/cloudwatch-agent/cloudwatch-agent:1.300028.1b210
Environment
※ IMDSv2 is optional (= disabled).
Additional comment
A similar issue has been observed, but it remains unresolved. This error seems to occur even when IMDSv2 is enabled.
The text was updated successfully, but these errors were encountered: