Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error: cannot start pipelines: cannot get pod from kubelet #1417

Open
GotoRen opened this issue Nov 11, 2024 · 3 comments
Open

error: cannot start pipelines: cannot get pod from kubelet #1417

GotoRen opened this issue Nov 11, 2024 · 3 comments

Comments

@GotoRen
Copy link

GotoRen commented Nov 11, 2024

Describe the bug

After upgrading EKS nodes to version v1.29 (v1.29.8-20241024) and deploying CloudWatch Agent v1.3xxx, the following error is encountered:

2024-11-06T06:49:35Z I! Starting AmazonCloudWatchAgent CWAgent/1.300028.1b210 (go1.20.7; linux; amd64)
2024-11-06T06:49:35Z I! AWS SDK log level not set
2024-11-06T06:49:35.353Z	info	service/telemetry.go:96	Skipping telemetry setup.	{"address": "", "level": "None"}
2024-11-06T06:49:35.356Z	info	service/service.go:131	Starting CWAgent...	{"Version": "1.300028.1b210", "NumCPU": 4}
2024-11-06T06:49:35.356Z	info	extensions/extensions.go:30	Starting extensions...
2024-11-06T06:49:35.374Z	info	host/ec2metadata.go:89	Fetch instance id and type from ec2 metadata	{"kind": "receiver", "name": "awscontainerinsightreceiver", "data_type": "metrics"}
2024-11-06T06:49:35.383Z	info	service/service.go:157	Starting shutdown...
2024-11-06T06:49:35.383Z	info	extensions/extensions.go:44	Stopping extensions...
2024-11-06T06:49:35.383Z	info	service/service.go:171	Shutdown complete.
Error: cannot start pipelines: cannot get pod from kubelet, err: call to /pods endpoint failed: Get "https://<host_ip>:10250/pods": remote error: tls: internal error
2024-11-06T06:49:35Z E! [telegraf] Error running agent: cannot start pipelines: cannot get pod from kubelet, err: call to /pods endpoint failed: Get "https://<host_ip>:10250/pods": remote error: tls: internal error

url := fmt.Sprintf("https://%s:%s/pods", k.KubeIP, k.Port)

Note that the EKS Control-Plane was upgraded to v1.29 before proceeding with the node upgrade.

Steps to reproduce

At first, I upgraded the EKS cluster from version v1.28 to v1.29.
Then, I upgraded the node version from v1.27 to v1.29.

The reason for skipping one version is that I alternate between Blue and Green nodes.

After upgrading the node version to v1.29, the CloudWatch Agent started producing the aforementioned error.

What did you expect to see?

As a result of the cluster upgrade, the CloudWatch Agent is expected to no longer output errors. Specifically, when the CloudWatch Agent sends a request to the /pods endpoint on a running instance to retrieve pod data, the TLS error (tls: internal error) is expected not to occur.

What version did you use?

  • Control-Plane: v1.29
  • Data-Plane (EKS node): v1.29.8-20241024
    • kubelet: v1.29.8-eks-a737599
  • CloudWatch Agent: v1.300028.1b210

What config did you use?

We are using the container image public.ecr.aws/cloudwatch-agent/cloudwatch-agent:1.300028.1b210

Environment

  • AMI: AL2_x86_64
  • Instance type: c6i.2xlarge
  • OS architecture: linux (amd64)
  • OS image: Amazon Linux 2

※ IMDSv2 is optional (= disabled).

Additional comment

A similar issue has been observed, but it remains unresolved. This error seems to occur even when IMDSv2 is enabled.

@marianafranco
Copy link

marianafranco commented Dec 17, 2024

We are seeing a similar issue with image 1.247360.0b252689:

2024-12-17T01:08:12Z I! Starting AmazonCloudWatchAgent CWAgent/1.247360.0b252689 (go1.20.5; linux; amd64)
2024-12-17T01:08:12Z I! AWS SDK log level not set
2024-12-17T01:08:12Z I! Loaded inputs: cadvisor ethtool k8sapiserver
2024-12-17T01:08:12Z I! Loaded aggregators:
2024-12-17T01:08:12Z I! Loaded processors: ec2tagger (2x) k8sdecorator
2024-12-17T01:08:12Z I! Loaded outputs: cloudwatch cloudwatchlogs
2024-12-17T01:08:12Z I! Tags enabled:
2024-12-17T01:08:12Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"xxxxxx", Flush Interval:1s
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: Check EC2 Metadata.
2024-12-17T01:08:12Z I! [logagent] starting
2024-12-17T01:08:12Z I! [logagent] found plugin cloudwatchlogs is a log backend
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: Check EC2 Metadata.
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: Check EC2 Metadata.
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: Check EC2 Metadata.
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeeded
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2024-12-17T01:08:12Z I! cloudwatch: get unique roll up list [[InstanceId InstanceType AutoScalingGroupName] [InstanceType] [AutoScalingGroupName]]
2024-12-17T01:08:12Z I! cloudwatch: publish with ForceFlushInterval: 1m0s, Publish Jitter: 4.98331802s
I1217 01:08:12.742445       1 leaderelection.go:248] attempting to acquire leader lease logging/cwagent-clusterleader...
2024-12-17T01:08:12Z I! k8sapiserver Switch New Leader: xxxxxx
W1217 01:08:12.757366       1 manager.go:291] Could not configure a source for OOM detection, disabling OOM events: open /dev/kmsg: no such file or directory
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeeded
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeeded
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeeded
2024-12-17T01:08:12Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2024-12-17T01:08:12Z E! error making HTTP request to https://<host-ip>:10250/pods: remote error: tls: internal error
2024-12-17T01:08:12Z I! Cannot get pod from kubelet, err: KubeClinet Access Failure
panic: Cannot get pod from kubelet, err: KubeClinet Access Failure

goroutine 231 [running]:
log.Panicf({0x36ad74b?, 0x7fb16aa9c268?}, {0xc000e33d58?, 0x60?, 0xc001396000?})
	log/log.go:391 +0x67
github.com/aws/amazon-cloudwatch-agent/plugins/processors/k8sdecorator/stores.NewPodStore({0xc00069ddf1, 0xb}, 0x0)
	github.com/aws/amazon-cloudwatch-agent/plugins/processors/k8sdecorator/stores/podstore.go:76 +0x26e
github.com/aws/amazon-cloudwatch-agent/plugins/processors/k8sdecorator.(*K8sDecorator).start(0xc000127340)
	github.com/aws/amazon-cloudwatch-agent/plugins/processors/k8sdecorator/k8sdecorator.go:76 +0x6b
github.com/aws/amazon-cloudwatch-agent/plugins/processors/k8sdecorator.(*K8sDecorator).Apply(0xc000127340, {0xc000cfa2d0, 0x1, 0xc000087ee8?})
	github.com/aws/amazon-cloudwatch-agent/plugins/processors/k8sdecorator/k8sdecorator.go:39 +0x50
github.com/influxdata/telegraf/plugins/processors.(*streamingProcessor).Add(0xc000998840, {0x3c71d30?, 0xc000f02fc0}, {0x3c5faa0, 0xc0008c6000})
	github.com/influxdata/[email protected]/plugins/processors/streamingprocessor.go:37 +0x90
github.com/influxdata/telegraf/models.(*RunningProcessor).Add(0xc000998a50, {0x3c71d30, 0xc000f02fc0}, {0x3c5faa0, 0xc0008c6000})
	github.com/influxdata/[email protected]/models/running_processor.go:95 +0xcb
github.com/influxdata/telegraf/agent.(*Agent).runProcessors.func1(0xc0005aadc8)
	github.com/influxdata/[email protected]/agent/agent.go:562 +0x145
created by github.com/influxdata/telegraf/agent.(*Agent).runProcessors
	github.com/influxdata/[email protected]/agent/agent.go:557 +0x3c

Control-Plane: v1.28
Data-Plane (EKS node): 1.28.15-20241121
kubelet: v1.28.15-eks-94953ac
CloudWatch Agent: 1.247360.0b252689

After some retries the cloudwatch-agent pod is able to start normally.

@marianafranco
Copy link

Could some retry be added to NewPodStore instead of immediately panic when the request to kubelet fails?

// Try to detect kubelet permission issue here
if _, err := podStore.kubeClient.ListPods(); err != nil {
log.Panicf("Cannot get pod from kubelet, err: %v", err)
}

@marianafranco
Copy link

marianafranco commented Dec 19, 2024

The issue on our side is related to csr certificates not available/approved for kubelet when the cloudwatch-agent initiates as we see the following errors in the logs before cloudwatch-agent restarts with success:

http: TLS handshake error from <host-ip>:46396: no serving certificate available for the kubelet

Given that we decided to add an InitContainer to keep checking if requests to the kubelet pods endpoint succeeds (it will return Unauthorized but that is ok) before starting the cloudwatch-agent container:

      initContainers:
      - args:
        - | 
          for i in {1..30}
          do
            curl -v --insecure --connect-timeout 5 --retry 5 --retry-connrefused https://$HOST_IP:10250/pods && break || echo "retrying kubelet request" && sleep 2
          done
        command:
        - sh
        - -c
        image: <some-linux-based-container-image>
        name: init

The workaround is working and we are not seeing cloudwatch-agent pods crashing and then restarting. However, would be better to have retried added to the cloudwatch-agent itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants