CNI ipamd reconcilation of in-use addresses #3109

hbhasker · 2024-11-08T19:58:10Z

What happened:
We noticed that on some hosts the CNI was thinking 64 IPs were in use with pods that had long terminated. When we checked the node had only 59 pods (including ones that were using host networking) but CNI clearly thought there were 64 pods running and failed allocating new IPs to pods because all IPs were in use (we set the max pods to 64 for the node). We spent sometime trying to figure out how that happens but I guess it can happen if somehow the CNI missing the delete event or fails to process it.

I was trying to read the code and see if there is some race where a delete and create can race causing CNI to incorrectly reject the delete and then proceed to add the IP to ipamd as allocated. In which case the IP remains in use even though the pod is gone. (Its possible I am misunderstanding what kubelet /crio do when a pod is terminated and if the CNI fails the DelNetwork request with an error).

Mostly looking to understand if this is a known issue? Looks like CNI does reconcile its database on a restart but maybe it needs to reconcile it periodically to prevent this?

Environment:

Kubernetes version (use kubectl version): 1.26
CNI Version: 1.18.2
OS (e.g: cat /etc/os-release): Amazon Linux 2023
Kernel (e.g. uname -a): 6.1

The text was updated successfully, but these errors were encountered:

The CNI today only reconciles its datastore with existing pods at startup but never again. Sometimes its possible that IPAMD goes out of sync with the kubelet's view of the pods running on the node if it fails or is temporarily unreachable by the CNI plugin handling the DelNetwork call from the contrainer runtime. In such cases the CNI continues to consider the pods IP allocated and will not free it as it will never see a DelNetwork again. This results in CNI failing to assign IP's to new pods. This change adds a reconcile loop which periodically (once a minute) reconciles its allocated IPs with existence of pod's veth devices. If the veth device is not found then it free's up the corresponding allocation making the IP available for reuse. Fixes aws#3109

orsenthil · 2024-11-13T19:28:58Z

Hi @hbhasker , thank you for this report. Could you confirm this by looking at the ipamd.log If you share the logs in [email protected], we can look over it too.

hbhasker · 2024-11-13T20:26:03Z

I did confirm by looking at the json for the datastore as well. It clearly had pods in there that had already terminated on the node. I will see if i run into another occurence of the same and capture more information.

The CNI today only reconciles its datastore with existing pods at startup but never again. Sometimes its possible that IPAMD goes out of sync with the kubelet's view of the pods running on the node if it fails or is temporarily unreachable by the CNI plugin handling the DelNetwork call from the contrainer runtime. In such cases the CNI continues to consider the pods IP allocated and will not free it as it will never see a DelNetwork again. This results in CNI failing to assign IP's to new pods. This change adds a reconcile loop which periodically (once a minute) reconciles its allocated IPs with existence of pod's veth devices. If the veth device is not found then it free's up the corresponding allocation making the IP available for reuse. Fixes aws#3109

jayanthvn · 2024-11-14T21:08:29Z

@orsenthil - We will have to check plugin logs if the delete request landed on CNI. Since kubelet is the source of truth. I don't think we should add more reconcilers rather check why the event was missed or not received..

hbhasker · 2024-11-27T18:43:51Z

Any update on this ?

hbhasker · 2024-12-02T19:08:54Z

We just hit this on another node where IPAMD thinks there are 64 pods running when in fact there are less than 50. It assumes all 64 IPs are allocated and refuses to allocate any IPs failing pod startup on the node.

orsenthil · 2024-12-02T19:12:13Z

@hbhasker - Could you collect the logs on this node and share to the email mentioned - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#collecting-node-level-tech-support-bundle-for-offline-troubleshooting

hbhasker · 2024-12-03T19:11:14Z

@orsenthil I have opened an AWS support case and shared relevant information via that channel as I prefer not to send logs from our system via email. I have asked our AWS partners to coordinate with you so that they can get the relevant information to you.

orsenthil · 2024-12-04T00:11:39Z

Thank you, @hbhasker. I will review the logs.

hbhasker · 2024-12-04T00:18:33Z

Something we observed today. It might also be a bug in prefix delegation. We noticed that the CNI wasn't attaching new prefixes after 4 of them even when all IP were in use. I will write some tests to check the logic where it checks if data store is low on IPs.

The CNI today only reconciles its datastore with existing pods at startup but never again. Sometimes its possible that IPAMD goes out of sync with the kubelet's view of the pods running on the node if it fails or is temporarily unreachable by the CNI plugin handling the DelNetwork call from the contrainer runtime. In such cases the CNI continues to consider the pods IP allocated and will not free it as it will never see a DelNetwork again. This results in CNI failing to assign IP's to new pods. This change adds a reconcile loop which periodically (once a minute) reconciles its allocated IPs with existence of pod's veth devices. If the veth device is not found then it free's up the corresponding allocation making the IP available for reuse. Fixes aws#3109

orsenthil · 2024-12-04T23:35:19Z

When we checked the node had only 59 pods (including ones that were using host networking) but CNI clearly thought there were 64 pods running and failed allocating new IPs to pods because all IPs were in use (we set the max pods to 64 for the node).

I looked the initial logs that you had updated. I see that assigned IPs did go to 64 in the logs, I noticed the ip assignment going from 55 to 58 and reallocating ips, and then increasing. In general, I couldn't see an error with del network calls too. Something else might be happening here. I have requested to log bundles (on a working and non-working) nodes through the support case.

hbhasker · 2024-12-06T21:17:19Z

So I was looking through the code and I realized getPrefixesNeeded as written is buggy. If WARM_IP_TARGET is defined it returns # of IPs required instead of # of prefixes.

See:

amazon-vpc-cni-k8s/pkg/ipamd/ipamd.go

Line 2253 in a121a8a

toAllocate = max(toAllocate, short)

that should return toAllocate/16 + 1. That said this may or may not be the issue, I am still trying to understand the code flow a bit.

jayanthvn · 2024-12-06T21:25:56Z

@hbhasker - You can check these lines -

amazon-vpc-cni-k8s/pkg/ipamd/ipamd.go

Lines 2247 to 2249 in a121a8a

    
           // datastoreTargetState already has complex math so adding Prefix target will make it even more complex. 
        
           short, _, warmIPTargetsDefined := c.datastoreTargetState(nil) 
        
           shortPrefixes, warmPrefixTargetDefined := c.datastorePrefixTargetState()

datastoreTargetState we would compute the needed prefixes -

amazon-vpc-cni-k8s/pkg/ipamd/ipamd.go

Lines 1860 to 1872 in a121a8a

    
           // Over will have number of IPs more than needed but with PD we would have allocated in chunks of /28 
        
           // Say assigned = 1, warm ip target = 16, this will need 2 prefixes. But over will return 15. 
        
           // Hence we need to check if 'over' number of IPs are needed to maintain the warm targets 
        
           prefixNeededForWarmIP := datastore.DivCeil(stats.AssignedIPs+c.warmIPTarget, numIPsPerPrefix) 
        
           prefixNeededForMinIP := datastore.DivCeil(c.minimumIPTarget, numIPsPerPrefix) 
        
           // over will be number of prefixes over than needed but could be spread across used prefixes, 
        
           // say, after couple of pod churns, 3 prefixes are allocated with 1 IP each assigned and warm ip target is 15 
        
           // (J : is this needed? since we have to walk thru the loop of prefixes) 
        
           freePrefixes := c.dataStore.GetFreePrefixes() 
        
           overPrefix := max(min(freePrefixes, stats.TotalPrefixes-prefixNeededForWarmIP), 0) 
        
           overPrefix = max(min(overPrefix, stats.TotalPrefixes-prefixNeededForMinIP), 0) 
        
           return shortPrefix, overPrefix, true

hbhasker · 2024-12-06T21:42:58Z

Ah thanks for that. Maybe a cleanup would be to not overload the function to handle both prefix and non prefix and make two functions and be explicit about which one is called. It would make the code a lot more easier to reason about.

hbhasker · 2024-12-06T22:38:38Z

So I was looking again at what might have happened and why Prefix Delegation wasn't attaching new IPs and I think the reason is we set maxPods at 62 and this check causes it to return and not attach new IPs

amazon-vpc-cni-k8s/pkg/ipamd/ipamd.go

Line 2161 in a121a8a

if stats.TotalIPs >= c.maxPods {

So there is no bug in Prefix delegation but just a bug in inconsistency between ipamd datastore and the container runtime's view of pods running on the node.

hbhasker added needs investigation question labels Nov 8, 2024

hbhasker linked a pull request Nov 13, 2024 that will close this issue

Add support to reconcile allocated Pod IPs. #3113

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNI ipamd reconcilation of in-use addresses #3109

CNI ipamd reconcilation of in-use addresses #3109

hbhasker commented Nov 8, 2024 •

edited

Loading

orsenthil commented Nov 13, 2024

hbhasker commented Nov 13, 2024

jayanthvn commented Nov 14, 2024

hbhasker commented Nov 27, 2024

hbhasker commented Dec 2, 2024

orsenthil commented Dec 2, 2024

hbhasker commented Dec 3, 2024

orsenthil commented Dec 4, 2024

hbhasker commented Dec 4, 2024

orsenthil commented Dec 4, 2024

hbhasker commented Dec 6, 2024

jayanthvn commented Dec 6, 2024

hbhasker commented Dec 6, 2024

hbhasker commented Dec 6, 2024

CNI ipamd reconcilation of in-use addresses #3109

CNI ipamd reconcilation of in-use addresses #3109

Comments

hbhasker commented Nov 8, 2024 • edited Loading

orsenthil commented Nov 13, 2024

hbhasker commented Nov 13, 2024

jayanthvn commented Nov 14, 2024

hbhasker commented Nov 27, 2024

hbhasker commented Dec 2, 2024

orsenthil commented Dec 2, 2024

hbhasker commented Dec 3, 2024

orsenthil commented Dec 4, 2024

hbhasker commented Dec 4, 2024

orsenthil commented Dec 4, 2024

hbhasker commented Dec 6, 2024

jayanthvn commented Dec 6, 2024

hbhasker commented Dec 6, 2024

hbhasker commented Dec 6, 2024

hbhasker commented Nov 8, 2024 •

edited

Loading