Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Nvidia variant] Slow shutdown after upgrade #4347

Open
sanjinp opened this issue Dec 24, 2024 · 2 comments
Open

[Nvidia variant] Slow shutdown after upgrade #4347

sanjinp opened this issue Dec 24, 2024 · 2 comments
Labels
status/needs-triage Pending triage or re-evaluation type/bug Something isn't working

Comments

@sanjinp
Copy link

sanjinp commented Dec 24, 2024

Context

We are running relatively short-lived (few mins - 2h) gpu jobs on our EKS cluster, scaling intensively.

  1. After upgrading EKS and bottlerocket AMIs from:
    bottlerocket-aws-k8s-1.28-x86_64-v1.20.2-536d69d0 to
    bottlerocket-aws-k8s-1.30-x86_64-v1.26.2-360b7a38

We have started receiving alerts during Nodes scale-downs from prometheus alert manager, like:

Alert: One or more targets are unreachable.
 Description: 33.33% of the kubelet/kube-prometheus-stack-kubelet targets in kube-system namespace are down.

This led us to Karpenter logs:

{"level":"INFO","time":"2024-12-24T**09:09:30**.804Z","logger":"controller","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (0 pods) ip-xx-xx-xx-xx.us-east-2.compute.internal/g4dn.xlarge/on-demand","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"aca176ce-e29f-4898-901d-18787e3647b6","command-id":"63b57d39-fdf1-4363-9007-6d0cd069633a","reason":"empty"}
...
{"level":"INFO","time":"2024-12-24T**09:15:24**.829Z","logger":"controller","message":"deleted node","commit":"a2875e3","controller":"node.termination","controllerGroup":"","controllerKind":"Node","Node":{"name":"ip-xx-xx-xx-xx.us-east-2.compute.internal"},"namespace":"","name":" ip-xx-xx-xx-xx.us-east-2.compute.internal","reconcileID":"b9244d6a-767a-4e10-86d6-92d18247b573"}

Issue is that during shutdown procedure our kubelet metrics scraping starts failing, and EC2 machine has already disconnected, but Karpenter does not proceed with removing them, as it waits for EC2 Node Terminated state. As this does not happen next 5-9 minutes (seen different numbers, never less than 5mins) health check/metrics queries time-out and we get hit with multiple alerts on each nvidia variant scaledown.

Image I'm using:
Bottlerocket OS 1.26.2 (aws-k8s-1.30-nvidia)

What I expected to happen:
EC2 machine should go into terminated state much faster, presumably around the time it takes for non-nvidia variant

What actually happened:
Node shutdown takes 5+ minutes, no errors reported on karpenter, and the journald logs that I have managed to catch before being disconnected:

Dec 24 09:45:25 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:25.365283234Z" level=info msg="StopContainer for \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\" with timeout 2 (s)"
Dec 24 09:45:25 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:25.365658983Z" level=info msg="Stop container \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\" with signal terminated"
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.376846500Z" level=info msg="Kill container \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd[1]: cri-containerd-a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1.scope: Deactivated successfully.
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd[1]: run-containerd-io.containerd.runtime.v2.task-k8s.io-a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1-rootfs.mount: Deactivated successfully.
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.552717188Z" level=info msg="shim disconnected" id=a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1 namespace=k8s.io
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.552775266Z" level=warning msg="cleaning up after shim disconnected" id=a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1 namespace=k8s.io
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.552789317Z" level=info msg="cleaning up dead shim" namespace=k8s.io
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.570484159Z" level=info msg="StopContainer for \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\" returns successfully"
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.571063746Z" level=info msg="StopPodSandbox for \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.571098765Z" level=info msg="Container to stop \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\" must be in running or unknown state, current state \"CONTAINER_EXITED\""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd[1]: run-containerd-io.containerd.grpc.v1.cri-sandboxes-c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611-shm.mount: Deactivated successfully.
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd[1]: cri-containerd-c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611.scope: Deactivated successfully.
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.644048686Z" level=info msg="shim disconnected" id=c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611 namespace=k8s.io
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.644125709Z" level=warning msg="cleaning up after shim disconnected" id=c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611 namespace=k8s.io
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.644144951Z" level=info msg="cleaning up dead shim" namespace=k8s.io
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd[1]: run-containerd-io.containerd.runtime.v2.task-k8s.io-c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611-rootfs.mount: Deactivated successfully.
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: 2024-12-24 09:45:27.717744754 +0000 UTC m=+0.004813718 write error: can't make directories for new logfile: mkdir /host: read-only file system
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: 2024-12-24 09:45:27.717818589 +0000 UTC m=+0.004887546 write error: can't make directories for new logfile: mkdir /host: read-only file system
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.723519664Z" level=info msg="TearDown network for sandbox \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\" successfully"
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.723549925Z" level=info msg="StopPodSandbox for \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\" returns successfully"
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd[1]: run-netns-cni\x2dbc681b81\x2d8da7\x2d2106\x2dea0e\x2db24b7c8e9d86.mount: Deactivated successfully.
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd-networkd[1197]: eni2b36f300ae6: Link DOWN
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd-networkd[1197]: eni2b36f300ae6: Lost carrier
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.788451    1544 reconciler_common.go:161] "operationExecutor.UnmountVolume started for volume \"core-path\" (UniqueName: \"kubernetes.io/host-path/1e37ebad-a19a-4852-9e84-e806b1219859-core-path\") pod \"1e37ebad-a19a-4852-9e84-e806b1219859\" (UID: \"1e37ebad-a19a-4852-9e84-e806b1219859\") "
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.788494    1544 reconciler_common.go:161] "operationExecutor.UnmountVolume started for volume \"aws-iam-token\" (UniqueName: \"kubernetes.io/projected/1e37ebad-a19a-4852-9e84-e806b1219859-aws-iam-token\") pod \"1e37ebad-a19a-4852-9e84-e806b1219859\" (UID: \"1e37ebad-a19a-4852-9e84-e806b1219859\") "
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.788521    1544 reconciler_common.go:161] "operationExecutor.UnmountVolume started for volume \"kube-api-access-gcdwp\" (UniqueName: \"kubernetes.io/projected/1e37ebad-a19a-4852-9e84-e806b1219859-kube-api-access-gcdwp\") pod \"1e37ebad-a19a-4852-9e84-e806b1219859\" (UID: \"1e37ebad-a19a-4852-9e84-e806b1219859\") "
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.788526    1544 operation_generator.go:887] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/1e37ebad-a19a-4852-9e84-e806b1219859-core-path" (OuterVolumeSpecName: "core-path") pod "1e37ebad-a19a-4852-9e84-e806b1219859" (UID: "1e37ebad-a19a-4852-9e84-e806b1219859"). InnerVolumeSpecName "core-path". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.788552    1544 reconciler_common.go:161] "operationExecutor.UnmountVolume started for volume \"tmp\" (UniqueName: \"kubernetes.io/empty-dir/1e37ebad-a19a-4852-9e84-e806b1219859-tmp\") pod \"1e37ebad-a19a-4852-9e84-e806b1219859\" (UID: \"1e37ebad-a19a-4852-9e84-e806b1219859\") "
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.788620    1544 reconciler_common.go:289] "Volume detached for volume \"core-path\" (UniqueName: \"kubernetes.io/host-path/1e37ebad-a19a-4852-9e84-e806b1219859-core-path\") on node \"ip-xx-xx-xx-xx.us-east-2.compute.internal\" DevicePath \"\""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.788812    1544 operation_generator.go:887] UnmountVolume.TearDown succeeded for volume "kubernetes.io/empty-dir/1e37ebad-a19a-4852-9e84-e806b1219859-tmp" (OuterVolumeSpecName: "tmp") pod "1e37ebad-a19a-4852-9e84-e806b1219859" (UID: "1e37ebad-a19a-4852-9e84-e806b1219859"). InnerVolumeSpecName "tmp". PluginName "kubernetes.io/empty-dir", VolumeGidValue ""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.790690    1544 operation_generator.go:887] UnmountVolume.TearDown succeeded for volume "kubernetes.io/projected/1e37ebad-a19a-4852-9e84-e806b1219859-kube-api-access-gcdwp" (OuterVolumeSpecName: "kube-api-access-gcdwp") pod "1e37ebad-a19a-4852-9e84-e806b1219859" (UID: "1e37ebad-a19a-4852-9e84-e806b1219859"). InnerVolumeSpecName "kube-api-access-gcdwp". PluginName "kubernetes.io/projected", VolumeGidValue ""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.790738    1544 operation_generator.go:887] UnmountVolume.TearDown succeeded for volume "kubernetes.io/projected/1e37ebad-a19a-4852-9e84-e806b1219859-aws-iam-token" (OuterVolumeSpecName: "aws-iam-token") pod "1e37ebad-a19a-4852-9e84-e806b1219859" (UID: "1e37ebad-a19a-4852-9e84-e806b1219859"). InnerVolumeSpecName "aws-iam-token". PluginName "kubernetes.io/projected", VolumeGidValue ""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd[1]: var-lib-kubelet-pods-1e37ebad\x2da19a\x2d4852\x2d9e84\x2de806b1219859-volumes-kubernetes.io\x7eprojected-kube\x2dapi\x2daccess\x2dgcdwp.mount: Deactivated successfully.
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd[1]: var-lib-kubelet-pods-1e37ebad\x2da19a\x2d4852\x2d9e84\x2de806b1219859-volumes-kubernetes.io\x7eprojected-aws\x2diam\x2dtoken.mount: Deactivated successfully.
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.888794    1544 reconciler_common.go:289] "Volume detached for volume \"aws-iam-token\" (UniqueName: \"kubernetes.io/projected/1e37ebad-a19a-4852-9e84-e806b1219859-aws-iam-token\") on node \"ip-xx-xx-xx-xx.us-east-2.compute.internal\" DevicePath \"\""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.888833    1544 reconciler_common.go:289] "Volume detached for volume \"kube-api-access-gcdwp\" (UniqueName: \"kubernetes.io/projected/1e37ebad-a19a-4852-9e84-e806b1219859-kube-api-access-gcdwp\") on node \"ip-xx-xx-xx-xx.us-east-2.compute.internal\" DevicePath \"\""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.888846    1544 reconciler_common.go:289] "Volume detached for volume \"tmp\" (UniqueName: \"kubernetes.io/empty-dir/1e37ebad-a19a-4852-9e84-e806b1219859-tmp\") on node \"ip-xx-xx-xx-xx.us-east-2.compute.internal\" DevicePath \"\""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.921626    1544 kubelet_volumes.go:163] "Cleaned up orphaned pod volumes dir" podUID="1e37ebad-a19a-4852-9e84-e806b1219859" path="/var/lib/kubelet/pods/1e37ebad-a19a-4852-9e84-e806b1219859/volumes"
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd[1]: Removed slice libcontainer container kubepods-burstable-pod1e37ebad_a19a_4852_9e84_e806b1219859.slice.
Dec 24 09:45:28 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:28.329359    1544 scope.go:117] "RemoveContainer" containerID="a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1"
Dec 24 09:45:28 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd[1]: kubepods-burstable-pod1e37ebad_a19a_4852_9e84_e806b1219859.slice: Failed to open /run/systemd/transient/kubepods-burstable-pod1e37ebad_a19a_4852_9e84_e806b1219859.slice: No such file or directory
Dec 24 09:45:28 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:28.330830934Z" level=info msg="RemoveContainer for \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\""
Dec 24 09:45:28 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:28.333926033Z" level=info msg="RemoveContainer for \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\" returns successfully"
Dec 24 09:45:28 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:28.334130    1544 scope.go:117] "RemoveContainer" containerID="a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1"
Dec 24 09:45:28 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:28.334322404Z" level=error msg="ContainerStatus for \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\" failed" error="rpc error: code = NotFound desc = an error occurred when try to find container \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\": not found"
Dec 24 09:45:28 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: E1224 09:45:28.334468    1544 remote_runtime.go:432] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\": not found" containerID="a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1"
Dec 24 09:45:28 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:28.334505    1544 pod_container_deletor.go:53] "DeleteContainer returned error" containerID={"Type":"containerd","ID":"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1"} err="failed to get container status \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\": rpc error: code = NotFound desc = an error occurred when try to find container \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\": not found"
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:46:15.880259964Z" level=info msg="StopPodSandbox for \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\""
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: 2024-12-24 09:46:15.898322951 +0000 UTC m=+0.004926890 write error: can't make directories for new logfile: mkdir /host: read-only file system
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: 2024-12-24 09:46:15.898404069 +0000 UTC m=+0.005007982 write error: can't make directories for new logfile: mkdir /host: read-only file system
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:46:15.902816735Z" level=info msg="TearDown network for sandbox \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\" successfully"
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:46:15.902845703Z" level=info msg="StopPodSandbox for \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\" returns successfully"
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:46:15.903346648Z" level=info msg="RemovePodSandbox for \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\""
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:46:15.903378313Z" level=info msg="Forcibly stopping sandbox \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\""
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: 2024-12-24 09:46:15.92104592 +0000 UTC m=+0.005342552 write error: can't make directories for new logfile: mkdir /host: read-only file system
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: 2024-12-24 09:46:15.921147761 +0000 UTC m=+0.005444388 write error: can't make directories for new logfile: mkdir /host: read-only file system
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:46:15.925076206Z" level=info msg="TearDown network for sandbox \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\" successfully"
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:46:15.928024195Z" level=warning msg="Failed to get podSandbox status for container event for sandboxID \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\": an error occurred when try to find sandbox: not found. Sending the event with nil podSandboxStatus."
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:46:15.928079760Z" level=info msg="RemovePodSandbox \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\" returns successfully"

How to reproduce the problem:

  1. Use bottlerocket-aws-k8s-1.30-nvidia-x86_64-v1.26.2-360b7a38 AMI (also happens with newest build for EKS 1.30 variant)
  2. Scale down deployment in order to trigger Karpenter disruption with EmptyNode, or even drain the Node, any way to trigger termination
  3. EC2 should be in Shutting-down state for a few minutes
@sanjinp sanjinp added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Dec 24, 2024
@cbgbt
Copy link
Contributor

cbgbt commented Dec 24, 2024

Thanks for this report! Do you think you could check the EC2 console logs for any suspicious messages?

I'll attempt to reproduce, but do you have any pointers about your environment that might be influencing the result? Any additional devices (volumes, network mounts) or network configurations that you think might be relevant could be useful.

@sanjinp
Copy link
Author

sanjinp commented Dec 25, 2024

Hi @cbgbt thanks for prompt response!
Our setup seems ordinary, but sending our template just in case:

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-object-detection
spec:
  instanceProfile: <instance_profile>
  subnetSelectorTerms:
    - tags:
        Name: "Service VPC Private*"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: secondary
  amiFamily: "Bottlerocket"
  amiSelectorTerms:
    - name: "bottlerocket-aws-k8s-1.30-x86_64-v1.26.2-360b7a38"
  userData: |
    [settings.boot.kernel-parameters]
    "crashkernel" = [
      "2G-:256M",
    ]
    [settings.kernel.sysctl]
    "net.core.rmem_max" = "16777216"
    "net.core.wmem_max" = "16777216"
    "kernel.core_pattern" = "/core-dumps/core-%p.%h.%t"
    [settings.ntp]
    "time-servers" = [
      "169.254.169.123",
      "time.google.com"
    ]
    [settings.kubernetes]
    "image-gc-low-threshold-percent" = "85"
    "image-gc-high-threshold-percent" = "90"
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 4Gi
        volumeType: gp3
        encrypted: true
    - deviceName: /dev/xvdb
      ebs:
        volumeSize: 40Gi
        volumeType: gp3
        encrypted: true

  tags:
    ManagedBy: "Karpenter"
    karpenter.sh/discovery: secondary
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-object-detection
spec:
  template:
    metadata:
      labels:
        node.kubernetes/role: "gpu"
        node.kubernetes/type: "internal"
        dedicated: object-detection
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-object-detection
      requirements:
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["on-demand"]
        - key: "node.kubernetes.io/instance-type"
          operator: In
          values:
            - "g4dn.xlarge"
            - "g4dn.2xlarge"
            - "g4dn.4xlarge"
            - "g5.xlarge"
            - "g5.2xlarge"
            - "g5.4xlarge"
      taints:
        - key: dedicated
          value: object-detection
          effect: NoSchedule
  limits:
    nvidia.com/gpu: 50
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30s

Also I have checked Console Log, showing from first noted Error, al previous were in status OK
e.g.

...
[FAILED] Failed unmounting Kernel Modules (Read-Write).
[  OK  ] Stopped Flush Journal to Persistent Storage.
[  OK  ] Stopped target Swaps.
         Unmounting /var/lib/kubelet...
[  OK  ] Stopped Prepare Opt Directory (/opt).
         Unmounting Opt Directory (/opt)...
[  OK  ] Stopped Prepare Var Directory (/var).
[  OK  ] Unmounted Opt Directory (/opt).
[  OK  ] Unmounted /var/lib/kubelet.
[  OK  ] Stopped target Preparation for Local File Systems.
         Unmounting Var Directory (/var)...
[  OK  ] Stopped Create Static Device Nodes in /dev.
[  OK  ] Stopped Create System Users.
[  OK  ] Stopped Remount Root and Kernel File Systems.
[  OK  ] Reached target System Shutdown.
[  OK  ] Unmounted Var Directory (/var).
         Unmounting Local Directory (/local)...
[  OK  ] Unmounted Local Directory (/local).
[  OK  ] Reached target Unmount All Filesystems.
[  OK  ] Reached target Late Shutdown Services.
[  OK  ] Finished System Power Off.
[  OK  ] Reached target System Power Off.
[  194.586797] (direxec)[5943]: Failed to execute /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system-shutdown/mdadm.shutdown: No such file or directory
[  194.649214] (sd-executor)[5940]: /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system-shutdown/mdadm.shutdown failed with exit status 1.
[  194.712196] systemd-shutdown[1]: Failed to finalize DM devices, ignoring.
[  195.564778] reboot: Power down

During investigation, I have removed all pods that are used specifically by nvidia machines and had approximately the same config as on other nodeClasses/Pools (except that they are non-gpu instances).
Non-gpu instances are stopped much faster, within a minute or so.
I have also tried reverting to older AMI but interestingly enough - instance shutdown also took 5-10 minutes to shutdown.

So possible that slow shutdown was there even before, but was hidden by how the Karpenter used to handle nodes (remove it from EKS cluster, and leave EC2 shutdown in the background, opposed to new version v1.+ where it marks it NonReady and waits for a EC2 machine to be completely terminated) which triggered our alerts

If you need any more info, I'll be happy to provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/needs-triage Pending triage or re-evaluation type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants