Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter 1.1.0 not scaling down empty nodes #7466

Open
bianchi2 opened this issue Dec 2, 2024 · 4 comments
Open

Karpenter 1.1.0 not scaling down empty nodes #7466

bianchi2 opened this issue Dec 2, 2024 · 4 comments
Labels
bug Something isn't working lifecycle/stale triage/needs-information Marks that the issue still needs more information to properly triage

Comments

@bianchi2
Copy link

bianchi2 commented Dec 2, 2024

Description

After upgrading to 1.1.0 I noticed that karpenter will not delete nodeclaim and terminate nodes. My nodeclass:

spec:
  disruption:
    budgets:
    - nodes: 10%
    consolidateAfter: 5m
    consolidationPolicy: WhenEmpty

When there are no pods on the node (excluding daemonset pods), Karpenter marks node for deletion:

{"level":"INFO","time":"2024-12-01T22:55:33.696Z","logger":"controller","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (0 pods) ip-10-227-188-70.eu-west-1.compute.internal/t3a.xlarge/spot","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"2f85dc51-06d5-44a2-9715-58a307a44f80","command-id":"f978a60a-173c-429c-b331-14925fd92687","reason":"empty"}
{"level":"INFO","time":"2024-12-01T22:55:34.066Z","logger":"controller","message":"tainted node","commit":"a2875e3","controller":"node.termination","controllerGroup":"","controllerKind":"Node","Node":{"name":"ip-10-227-188-70.eu-west-1.compute.internal"},"namespace":"","name":"ip-10-227-188-70.eu-west-1.compute.internal","reconcileID":"6918b3ab-4a5c-4baf-b00f-f3a170a6339b","taint.Key":"karpenter.sh/disrupted","taint.Value":"","taint.Effect":"NoSchedule"}

And indeed, the node is tainted. However, nothing happens after that. Nodeclaim is never deleted, and as a result empty nodes are just hanging around unschedulable (because of the taint).

I enabled debug logs but found nothing there, no errors or warnings. I am 100% certain there are no non-daemonset pods on the dangling nodes. Sometimes, scale down happens, maybe 1 time out of 10. In all cases I tested by scheduling 2 pods so that Karpenter creates 2 nodes. There are no other karpenter managed nodes in the cluster.

@bianchi2 bianchi2 added bug Something isn't working needs-triage Issues that need to be triaged labels Dec 2, 2024
@rknightion
Copy link

What budgets do you have set on the nodepool?

@bianchi2
Copy link
Author

bianchi2 commented Dec 2, 2024

That's my nodepool spec:

  disruption:
    budgets:
    - nodes: 10%
    consolidateAfter: 5m
    consolidationPolicy: WhenEmpty

@jonathan-innis
Copy link
Contributor

And indeed, the node is tainted. However, nothing happens after that. Nodeclaim is never deleted, and as a result empty nodes are just hanging around unschedulable

Can you share the NodeClaim and Node manifests (-o yaml) output? If you have any apiserver audit logs that show what Karpenter acted on that would also be helpful. Consider grabbing a full dump of the Karpenter logs along with the describe output of the NodeClaim and the Node as well

@jonathan-innis jonathan-innis added triage/needs-information Marks that the issue still needs more information to properly triage and removed needs-triage Issues that need to be triaged labels Dec 10, 2024
Copy link
Contributor

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lifecycle/stale triage/needs-information Marks that the issue still needs more information to properly triage
Projects
None yet
Development

No branches or pull requests

3 participants