error: cannot start pipelines: cannot get pod from kubelet #1417

GotoRen · 2024-11-11T03:59:00Z

Describe the bug

After upgrading EKS nodes to version v1.29 (v1.29.8-20241024) and deploying CloudWatch Agent v1.3xxx, the following error is encountered:

2024-11-06T06:49:35Z I! Starting AmazonCloudWatchAgent CWAgent/1.300028.1b210 (go1.20.7; linux; amd64)
2024-11-06T06:49:35Z I! AWS SDK log level not set
2024-11-06T06:49:35.353Z	info	service/telemetry.go:96	Skipping telemetry setup.	{"address": "", "level": "None"}
2024-11-06T06:49:35.356Z	info	service/service.go:131	Starting CWAgent...	{"Version": "1.300028.1b210", "NumCPU": 4}
2024-11-06T06:49:35.356Z	info	extensions/extensions.go:30	Starting extensions...
2024-11-06T06:49:35.374Z	info	host/ec2metadata.go:89	Fetch instance id and type from ec2 metadata	{"kind": "receiver", "name": "awscontainerinsightreceiver", "data_type": "metrics"}
2024-11-06T06:49:35.383Z	info	service/service.go:157	Starting shutdown...
2024-11-06T06:49:35.383Z	info	extensions/extensions.go:44	Stopping extensions...
2024-11-06T06:49:35.383Z	info	service/service.go:171	Shutdown complete.
Error: cannot start pipelines: cannot get pod from kubelet, err: call to /pods endpoint failed: Get "https://<host_ip>:10250/pods": remote error: tls: internal error
2024-11-06T06:49:35Z E! [telegraf] Error running agent: cannot start pipelines: cannot get pod from kubelet, err: call to /pods endpoint failed: Get "https://<host_ip>:10250/pods": remote error: tls: internal error

amazon-cloudwatch-agent/internal/k8sCommon/kubeletutil/kubeletclient.go

Line 35 in 6b25891

url := fmt.Sprintf("https://%s:%s/pods", k.KubeIP, k.Port)

Note that the EKS Control-Plane was upgraded to v1.29 before proceeding with the node upgrade.

Steps to reproduce

At first, I upgraded the EKS cluster from version v1.28 to v1.29.
Then, I upgraded the node version from v1.27 to v1.29.

The reason for skipping one version is that I alternate between Blue and Green nodes.

After upgrading the node version to v1.29, the CloudWatch Agent started producing the aforementioned error.

What did you expect to see?

As a result of the cluster upgrade, the CloudWatch Agent is expected to no longer output errors. Specifically, when the CloudWatch Agent sends a request to the /pods endpoint on a running instance to retrieve pod data, the TLS error (tls: internal error) is expected not to occur.

What version did you use?

Control-Plane: v1.29
Data-Plane (EKS node): v1.29.8-20241024
- kubelet: v1.29.8-eks-a737599
CloudWatch Agent: v1.300028.1b210

What config did you use?

We are using the container image public.ecr.aws/cloudwatch-agent/cloudwatch-agent:1.300028.1b210

Environment

AMI: AL2_x86_64
Instance type: c6i.2xlarge
OS architecture: linux (amd64)
OS image: Amazon Linux 2

※ IMDSv2 is optional (= disabled).

Additional comment

cannot get pod from kubelet, err: call to /pods endpoint failed: #1100

A similar issue has been observed, but it remains unresolved. This error seems to occur even when IMDSv2 is enabled.

The text was updated successfully, but these errors were encountered:

marianafranco · 2024-12-17T21:22:07Z

We are seeing a similar issue with image 1.247360.0b252689:

2024-12-17T01:08:12Z I! Starting AmazonCloudWatchAgent CWAgent/1.247360.0b252689 (go1.20.5; linux; amd64)
2024-12-17T01:08:12Z I! AWS SDK log level not set
2024-12-17T01:08:12Z I! Loaded inputs: cadvisor ethtool k8sapiserver
2024-12-17T01:08:12Z I! Loaded aggregators:
2024-12-17T01:08:12Z I! Loaded processors: ec2tagger (2x) k8sdecorator
2024-12-17T01:08:12Z I! Loaded outputs: cloudwatch cloudwatchlogs
2024-12-17T01:08:12Z I! Tags enabled:
2024-12-17T01:08:12Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"xxxxxx", Flush Interval:1s
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: Check EC2 Metadata.
2024-12-17T01:08:12Z I! [logagent] starting
2024-12-17T01:08:12Z I! [logagent] found plugin cloudwatchlogs is a log backend
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: Check EC2 Metadata.
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: Check EC2 Metadata.
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: Check EC2 Metadata.
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeeded
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2024-12-17T01:08:12Z I! cloudwatch: get unique roll up list [[InstanceId InstanceType AutoScalingGroupName] [InstanceType] [AutoScalingGroupName]]
2024-12-17T01:08:12Z I! cloudwatch: publish with ForceFlushInterval: 1m0s, Publish Jitter: 4.98331802s
I1217 01:08:12.742445       1 leaderelection.go:248] attempting to acquire leader lease logging/cwagent-clusterleader...
2024-12-17T01:08:12Z I! k8sapiserver Switch New Leader: xxxxxx
W1217 01:08:12.757366       1 manager.go:291] Could not configure a source for OOM detection, disabling OOM events: open /dev/kmsg: no such file or directory
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeeded
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeeded
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeeded
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2024-12-17T01:08:12Z E! error making HTTP request to https://<host-ip>:10250/pods: remote error: tls: internal error
2024-12-17T01:08:12Z I! Cannot get pod from kubelet, err: KubeClinet Access Failure
panic: Cannot get pod from kubelet, err: KubeClinet Access Failure

goroutine 231 [running]:
log.Panicf({0x36ad74b?, 0x7fb16aa9c268?}, {0xc000e33d58?, 0x60?, 0xc001396000?})
	log/log.go:391 +0x67
github.com/aws/amazon-cloudwatch-agent/plugins/processors/k8sdecorator/stores.NewPodStore({0xc00069ddf1, 0xb}, 0x0)
	github.com/aws/amazon-cloudwatch-agent/plugins/processors/k8sdecorator/stores/podstore.go:76 +0x26e
github.com/aws/amazon-cloudwatch-agent/plugins/processors/k8sdecorator.(*K8sDecorator).start(0xc000127340)
	github.com/aws/amazon-cloudwatch-agent/plugins/processors/k8sdecorator/k8sdecorator.go:76 +0x6b
github.com/aws/amazon-cloudwatch-agent/plugins/processors/k8sdecorator.(*K8sDecorator).Apply(0xc000127340, {0xc000cfa2d0, 0x1, 0xc000087ee8?})
	github.com/aws/amazon-cloudwatch-agent/plugins/processors/k8sdecorator/k8sdecorator.go:39 +0x50
github.com/influxdata/telegraf/plugins/processors.(*streamingProcessor).Add(0xc000998840, {0x3c71d30?, 0xc000f02fc0}, {0x3c5faa0, 0xc0008c6000})
	github.com/influxdata/[email protected]/plugins/processors/streamingprocessor.go:37 +0x90
github.com/influxdata/telegraf/models.(*RunningProcessor).Add(0xc000998a50, {0x3c71d30, 0xc000f02fc0}, {0x3c5faa0, 0xc0008c6000})
	github.com/influxdata/[email protected]/models/running_processor.go:95 +0xcb
github.com/influxdata/telegraf/agent.(*Agent).runProcessors.func1(0xc0005aadc8)
	github.com/influxdata/[email protected]/agent/agent.go:562 +0x145
created by github.com/influxdata/telegraf/agent.(*Agent).runProcessors
	github.com/influxdata/[email protected]/agent/agent.go:557 +0x3c

Control-Plane: v1.28
Data-Plane (EKS node): 1.28.15-20241121
kubelet: v1.28.15-eks-94953ac
CloudWatch Agent: 1.247360.0b252689

After some retries the cloudwatch-agent pod is able to start normally.

marianafranco · 2024-12-17T23:29:34Z

Could some retry be added to NewPodStore instead of immediately panic when the request to kubelet fails?

amazon-cloudwatch-agent/plugins/processors/k8sdecorator/stores/podstore.go

Lines 75 to 78 in 3f78406

    
           // Try to detect kubelet permission issue here 
        
           if _, err := podStore.kubeClient.ListPods(); err != nil { 
        
           	log.Panicf("Cannot get pod from kubelet, err: %v", err) 
        
           }

marianafranco · 2024-12-19T22:00:59Z

The issue on our side is related to csr certificates not available/approved for kubelet when the cloudwatch-agent initiates as we see the following errors in the logs before cloudwatch-agent restarts with success:

http: TLS handshake error from <host-ip>:46396: no serving certificate available for the kubelet

Given that we decided to add an InitContainer to keep checking if requests to the kubelet pods endpoint succeeds (it will return Unauthorized but that is ok) before starting the cloudwatch-agent container:

      initContainers:
      - args:
        - | 
          for i in {1..30}
          do
            curl -v --insecure --connect-timeout 5 --retry 5 --retry-connrefused https://$HOST_IP:10250/pods && break || echo "retrying kubelet request" && sleep 2
          done
        command:
        - sh
        - -c
        image: <some-linux-based-container-image>
        name: init

The workaround is working and we are not seeing cloudwatch-agent pods crashing and then restarting. However, would be better to have retried added to the cloudwatch-agent itself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error: cannot start pipelines: cannot get pod from kubelet #1417

error: cannot start pipelines: cannot get pod from kubelet #1417

GotoRen commented Nov 11, 2024 •

edited

Loading

marianafranco commented Dec 17, 2024 •

edited

Loading

marianafranco commented Dec 17, 2024

marianafranco commented Dec 19, 2024 •

edited

Loading

error: cannot start pipelines: cannot get pod from kubelet #1417

error: cannot start pipelines: cannot get pod from kubelet #1417

Comments

GotoRen commented Nov 11, 2024 • edited Loading

marianafranco commented Dec 17, 2024 • edited Loading

marianafranco commented Dec 17, 2024

marianafranco commented Dec 19, 2024 • edited Loading

GotoRen commented Nov 11, 2024 •

edited

Loading

marianafranco commented Dec 17, 2024 •

edited

Loading

marianafranco commented Dec 19, 2024 •

edited

Loading