Pod stuck in ContainerCreating after upgrading cluster to 1.29 #2980

zendesk-yumingdeng · 2024-07-05T07:45:51Z

What happened:

We are experiencing something similar to #2970, after upgrading our in-house clusters to 1.29.
After a new node is brought up (not this does not happen to every node), some pods that were scheduled to the node are stuck in the ContainerCreating status with the below event message:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "4e174229a28e7e3df61ece1a4320cc6581304664ea39186ab52281a283113a3a": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

No error messages can be found in the aws-cni pod logs on the node
Not many details can be found in /var/log/aws-routed-eni/plugin.log
Found below errors in /var/log/aws-routed-eni/ipamd.log

{"level":"error","ts":"2024-07-05T04:28:59.406Z","caller":"eventrecorder/eventrecorder.go:67","msg":"Cached client failed GET pod (aws-cni-9w9vm)"}
{"level":"error","ts":"2024-07-05T04:28:59.406Z","caller":"aws-k8s-agent/main.go:63","msg":"Failed to find host aws-node pod: Pod \"aws-cni-9w9vm\" not found"}
{"level":"error","ts":"2024-07-05T04:31:02.334Z","caller":"datastore/data_store.go:652","msg":"DataStore has no available IP/Prefix addresses"}
{"level":"warn","ts":"2024-07-05T04:31:02.352Z","caller":"ipamd/rpc_handler.go:230","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/16b5c95f6cb3ab32266a00048d184aff67a36d5ab730a4b3af3296b92ddff514/unknown"}
{"level":"warn","ts":"2024-07-05T04:34:37.660Z","caller":"ipamd/rpc_handler.go:230","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/12808abe62be1050ba4da91b52e7339d4926e93ee9dc02989dc8653415610a5d/unknown"}

Environment:

Kubernetes version (use kubectl version): v1.29.6
CNI Version: v1.14.1
OS (e.g: cat /etc/os-release): Ubuntu 22.04.4 LTS
Kernel (e.g. uname -a): Linux 6.5.0-1022-aws Name in example yaml is confusing #22~22.04.1-Ubuntu SMP Fri Jun 14 19:23:09 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

The text was updated successfully, but these errors were encountered:

yash97 · 2024-07-10T20:52:32Z

Hey @zendesk-yumingdeng ,

I noticed the log message: {"level":"error","ts":"2024-07-05T04:31:02.334Z","caller":"datastore/data_store.go:652","msg":"DataStore has no available IP/Prefix addresses"}

It looks like all IPs are exhausted. Could you let me know how many pods are running on the new node that was brought up? Also, what kind of node is it in terms of capacity? TIA!

pdallegrave · 2024-07-25T21:25:35Z

Hi,

We are facing the same issue. In a specific environment we have 3 nodes running c6g/c7g instances with medium size. VPC CNI is used with Security Groups per pod.
Kubernetes version is 1.30 and VPC CNIC v1.18.2-eksbuild.1

If we have 22 pods running, the 23rd cannot start with the same error:

Warning FailedCreatePodSandBox 4s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "83654e4c84178f6bafd9a9150c93eeb49de21c4bc03fd675d88a31c82b982b26": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

Logs from ipam:

{"level":"error","ts":"2024-07-25T21:17:01.523Z","caller":"datastore/data_store.go:607","msg":"DataStore has no available IP/Prefix addresses"}
{"level":"info","ts":"2024-07-25T21:17:01.523Z","caller":"rpc/rpc.pb.go:863","msg":"Send AddNetworkReply: IPv4Addr: , IPv6Addr: , DeviceNumber: -1, err: AssignPodIPv4Address: no available IP/Prefix addresses"}
{"level":"info","ts":"2024-07-25T21:17:01.544Z","caller":"rpc/rpc.pb.go:881","msg":"Received DelNetwork for Sandbox eb144583b3a9c07a29e909a932fcf09deb0eb9caaaeb7e5df081551f79cc0433"}
{"level":"debug","ts":"2024-07-25T21:17:01.544Z","caller":"rpc/rpc.pb.go:881","msg":"DelNetworkRequest: K8S_POD_NAME:\"coredns-58488c5db-c4ssp\" K8S_POD_NAMESPACE:\"kube-system\" K8S_POD_INFRA_CONTAINER_ID:\"eb144583b3a9c07a29e909a932fcf09deb0eb9caaaeb7e5df081551f79cc0433\" Reason:\"PodDeleted\" ContainerID:\"eb144583b3a9c07a29e909a932fcf09deb0eb9caaaeb7e5df081551f79cc0433\" IfName:\"eth0\" NetworkName:\"aws-cni\""}
{"level":"debug","ts":"2024-07-25T21:17:01.544Z","caller":"ipamd/rpc_handler.go:261","msg":"UnassignPodIPAddress: IP address pool stats: total 3, assigned 3, sandbox aws-cni/eb144583b3a9c07a29e909a932fcf09deb0eb9caaaeb7e5df081551f79cc0433/eth0"}
{"level":"debug","ts":"2024-07-25T21:17:01.544Z","caller":"ipamd/rpc_handler.go:261","msg":"UnassignPodIPAddress: Failed to find IPAM entry under full key, trying CRI-migrated version"}
{"level":"warn","ts":"2024-07-25T21:17:01.544Z","caller":"ipamd/rpc_handler.go:261","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/eb144583b3a9c07a29e909a932fcf09deb0eb9caaaeb7e5df081551f79cc0433/unknown"}
{"level":"info","ts":"2024-07-25T21:17:01.544Z","caller":"rpc/rpc.pb.go:881","msg":"Send DelNetworkReply: IPv4Addr: , IPv6Addr: , DeviceNumber: 0, err: datastore: unknown pod"}
{"level":"debug","ts":"2024-07-25T21:17:01.868Z","caller":"ipamd/ipamd.go:673","msg":"IP pool is too low: available (0) < ENI target (1) * addrsPerENI (3)"}

According to this document, each instance should accommodate 8 pods per node. However, another page uses the formula min((N * (M - 1)), meaning that only 6 IPs could be used (although I believe it should be (N*M)-1 since 1 IP is allocated to the node, and not 1 IP per ENI).

This issue started when we migrated the cluster from K8s version 1.28 to 1.30 and bumped the CNI version from 1.14.1 to 1.18.2.

frankh · 2024-07-31T10:03:56Z

we are having the same issue

pods get stuck due to all the network interfaces having the maximum numbers of IPs. The CNI plugin should not allow pods to be scheduled on nodes that don't have any ip capacity (k8s 1.30, CNI 1.18.2)

emcay · 2024-08-02T16:07:34Z

Same issue on 1.29 and 1.18.2. Would be nice if the plugin would not try to schedule where there is no or very little IP capacity.

philipg · 2024-08-06T13:35:38Z

seeing the same thing. anyone know what causes it or has a fix?

frankh · 2024-08-08T10:28:26Z

i fixed it by setting ENABLE_PREFIX_DELEGATION=true in the cni addon config (this means it allocates blocks of IPs instead of individual IPs per pods)

edblake00 · 2024-09-27T08:59:35Z

Same issue on 1.29 and 1.18.3.

orsenthil · 2024-10-17T16:47:42Z

The original issue here was

{"level":"error","ts":"2024-07-05T04:31:02.334Z","caller":"datastore/data_store.go:652","msg":"DataStore has no available IP/Prefix addresses"}

It means, the not enough IP was available the node.

ENABLE_PREFIX_DELEGATION=true is a way to resolve this.

For others who are experiencing, this this occur after any upgrade?
If there is a pattern for reproducing this, could you collect the CNI logs as mentioned in this doc - https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html#troubleshoot-cni and send it to [email protected]

orsenthil · 2024-10-17T16:55:12Z

Between the VPC CNI 1.14.x and later versions, there have changes to reduce the number of EC2 API calls (#2640) that sometimes inadvertently interfered with the previous behavior.

Using the proper values for WARM_IP_TARGET and MINIMUM_IP_TARGET as per this doc - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/eni-and-ip-target.md can help avoid any ip exhausion issue and container creation issue due to ip unavailability too.

kylos101 · 2024-10-24T15:36:29Z

Same issue on 1.29 and 1.18.2. Would be nice if the plugin would not try to schedule where there is no or very little IP capacity.

@emcay I think aws/containers-roadmap#2189 is related to what you are describing.

bciaraldi · 2024-12-10T17:08:52Z

We had this issue come back after upgrading recently. We're now k8s 1.31, cni 1.19.0 (previously k8s 1.30, cni 1.18.2). The only non-default configuration we run is

        - name: ENABLE_POD_ENI
          value: "true"
        - name: WARM_ENI_TARGET
          value: "0"`

If there is a pattern for reproducing this, could you collect the CNI logs as mentioned in this doc - docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html#troubleshoot-cni and send it to [email protected]

We're keeping an eye out and will do

zendesk-yumingdeng added the bug label Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod stuck in ContainerCreating after upgrading cluster to 1.29 #2980

Pod stuck in ContainerCreating after upgrading cluster to 1.29 #2980

zendesk-yumingdeng commented Jul 5, 2024

yash97 commented Jul 10, 2024 •

edited

Loading

pdallegrave commented Jul 25, 2024

frankh commented Jul 31, 2024 •

edited

Loading

emcay commented Aug 2, 2024

philipg commented Aug 6, 2024

frankh commented Aug 8, 2024 •

edited

Loading

edblake00 commented Sep 27, 2024

orsenthil commented Oct 17, 2024

orsenthil commented Oct 17, 2024

kylos101 commented Oct 24, 2024

bciaraldi commented Dec 10, 2024

Pod stuck in ContainerCreating after upgrading cluster to 1.29 #2980

Pod stuck in ContainerCreating after upgrading cluster to 1.29 #2980

Comments

zendesk-yumingdeng commented Jul 5, 2024

yash97 commented Jul 10, 2024 • edited Loading

pdallegrave commented Jul 25, 2024

frankh commented Jul 31, 2024 • edited Loading

emcay commented Aug 2, 2024

philipg commented Aug 6, 2024

frankh commented Aug 8, 2024 • edited Loading

edblake00 commented Sep 27, 2024

orsenthil commented Oct 17, 2024

orsenthil commented Oct 17, 2024

kylos101 commented Oct 24, 2024

bciaraldi commented Dec 10, 2024

yash97 commented Jul 10, 2024 •

edited

Loading

frankh commented Jul 31, 2024 •

edited

Loading

frankh commented Aug 8, 2024 •

edited

Loading