-
Notifications
You must be signed in to change notification settings - Fork 520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Nvidia variant] Slow shutdown after upgrade #4347
Comments
Thanks for this report! Do you think you could check the EC2 console logs for any suspicious messages? I'll attempt to reproduce, but do you have any pointers about your environment that might be influencing the result? Any additional devices (volumes, network mounts) or network configurations that you think might be relevant could be useful. |
Hi @cbgbt thanks for prompt response!
Also I have checked Console Log, showing from first noted Error, al previous were in status OK
During investigation, I have removed all pods that are used specifically by nvidia machines and had approximately the same config as on other nodeClasses/Pools (except that they are non-gpu instances). So possible that slow shutdown was there even before, but was hidden by how the Karpenter used to handle nodes (remove it from EKS cluster, and leave EC2 shutdown in the background, opposed to new version v1.+ where it marks it NonReady and waits for a EC2 machine to be completely terminated) which triggered our alerts If you need any more info, I'll be happy to provide. |
Context
We are running relatively short-lived (few mins - 2h) gpu jobs on our EKS cluster, scaling intensively.
bottlerocket-aws-k8s-1.28-x86_64-v1.20.2-536d69d0
tobottlerocket-aws-k8s-1.30-x86_64-v1.26.2-360b7a38
We have started receiving alerts during Nodes scale-downs from prometheus alert manager, like:
This led us to Karpenter logs:
Issue is that during shutdown procedure our
kubelet
metrics scraping starts failing, and EC2 machine has already disconnected, but Karpenter does not proceed with removing them, as it waits for EC2 NodeTerminated
state. As this does not happen next 5-9 minutes (seen different numbers, never less than 5mins) health check/metrics queries time-out and we get hit with multiple alerts on each nvidia variant scaledown.Image I'm using:
Bottlerocket OS 1.26.2 (aws-k8s-1.30-nvidia)
What I expected to happen:
EC2 machine should go into terminated state much faster, presumably around the time it takes for non-nvidia variant
What actually happened:
Node shutdown takes 5+ minutes, no errors reported on karpenter, and the journald logs that I have managed to catch before being disconnected:
How to reproduce the problem:
bottlerocket-aws-k8s-1.30-nvidia-x86_64-v1.26.2-360b7a38
AMI (also happens with newest build for EKS 1.30 variant)The text was updated successfully, but these errors were encountered: