Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unwanted second node scheduled after ~150 seconds during initial node provisioning #7515

Open
lefterisALEX opened this issue Dec 11, 2024 · 4 comments
Labels
bug Something isn't working needs-triage Issues that need to be triaged

Comments

@lefterisALEX
Copy link

Description

We are trying to setup karpenter and we currently have only one node pool that launches GPU nodes when needed.
The issue will describe below might not be relate with the GPU nodes , but in general with notes that bootstrap takes more time.

Observed Behavior:
When karpenter is requesting a new node this is launched and after around 150 seconds we noticed a second node (ip-10-28-112-201.eu-west-1.compute.internal ) is also launched and joined the cluster.

  kubectl get nodes
NAME                                          STATUS   ROLES    AGE     VERSION
ip-10-28-103-105.eu-west-1.compute.internal   Ready    <none>   3h14m   v1.29.8-eks-a737599
ip-10-28-106-68.eu-west-1.compute.internal    Ready    <none>   3h14m   v1.29.8-eks-a737599
ip-10-28-109-44.eu-west-1.compute.internal    Ready    <none>   3h14m   v1.29.8-eks-a737599
ip-10-28-109-80.eu-west-1.compute.internal    Ready    <none>   3h14m   v1.29.8-eks-a737599
ip-10-28-112-201.eu-west-1.compute.internal   Ready    <none>   36s     v1.29.8-eks-a737599
ip-10-28-118-20.eu-west-1.compute.internal    Ready    <none>   3h14m   v1.29.8-eks-a737599
ip-10-28-119-29.eu-west-1.compute.internal    Ready    <none>   3h14m   v1.29.8-eks-a737599
ip-10-28-126-122.eu-west-1.compute.internal   Ready    <none>   3h14m   v1.29.8-eks-a737599
ip-10-28-126-186.eu-west-1.compute.internal   Ready    <none>   3h14m   v1.29.8-eks-a737599
ip-10-28-82-193.eu-west-1.compute.internal    Ready    <none>   3h14m   v1.29.8-eks-a737599
ip-10-28-88-152.eu-west-1.compute.internal    Ready    <none>   3h14m   v1.29.8-eks-a737599
ip-10-28-91-37.eu-west-1.compute.internal     Ready    <none>   3h14m   v1.29.8-eks-a737599
ip-10-28-93-241.eu-west-1.compute.internal    Ready    <none>   3h14m   v1.29.8-eks-a737599
ip-10-28-99-95.eu-west-1.compute.internal     Ready    <none>   2m58s   v1.29.8-eks-a737599

we could also see a 2nd nodeclaim

❯  kubectl get nodeclaim
NAME        TYPE        CAPACITY    ZONE         NODE                                          READY     AGE
gpu-2mz56   g5.xlarge   on-demand   eu-west-1c   ip-10-28-99-95.eu-west-1.compute.internal     True      4m52s
gpu-z5gcl   g5.xlarge   on-demand   eu-west-1a   ip-10-28-112-201.eu-west-1.compute.internal   Unknown   2m30s

Attached files and logs
nodeclaims.yaml.zip
karpenter.log

Expected Behavior:

Perhaphs that is behavior as designed? If that's the case will be nice if we can configure the time karpenter needs to consider a node unhealthy before requesting a node claim.

Reproduction Steps (Please include YAML):

Nodepool

❯  k get nodepool -oyaml
apiVersion: v1
items:
- apiVersion: karpenter.sh/v1
  kind: NodePool
  metadata:
    annotations:
      karpenter.sh/nodepool-hash: "11244393610233646919"
      karpenter.sh/nodepool-hash-version: v3
      karpenter.sh/stored-version-migrated: "true"
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"karpenter.sh/v1","kind":"NodePool","metadata":{"annotations":{},"name":"gpu"},"spec":{"disruption":{"consolidateAfter":"1m","consolidationPolicy":"WhenEmpty","expireAfter":"720h"},"limits":{"cp
    creationTimestamp: "2024-12-11T11:16:36Z"
    generation: 1
    name: gpu
    resourceVersion: "296449"
    uid: 9b332214-bd85-4f46-933c-58e37b58b6c0
  spec:
    disruption:
      budgets:
      - nodes: 10%
      consolidateAfter: 1m
      consolidationPolicy: WhenEmpty
    limits:
      cpu: 99
      nvidia.com/gpu: 10
    template:
      metadata:
        labels:
          k8s.amazonaws.com/accelerator: nvidia
          nvidia.com/gpu: "true"
      spec:
        expireAfter: 720h
        nodeClassRef:
          group: karpenter.k8s.aws
          kind: EC2NodeClass
          name: gpu-worker
        requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values:
          - on-demand
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values:
          - g
        - key: karpenter.k8s.aws/instance-cpu
          operator: In
          values:
          - "4"
        - key: karpenter.k8s.aws/instance-hypervisor
          operator: In
          values:
          - nitro
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values:
          - "4"
        - key: kubernetes.io/arch
          operator: In
          values:
          - amd64
        startupTaints:
        - effect: NoExecute
          key: node.cilium.io/agent-not-ready
          value: "true"
        taints:
        - effect: NoSchedule
          key: nvidia.com/gpu

The yaml file we use to create the pod which needs GPU

❯  k get pods -n test-nvidia-ffc57y -oyaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"test-cuda-vectoradd-0vgcic","namespace":"test-nvidia-ffc57y"},"spec":{"containers":[{"image":"nvidia/samples:vectoradd-cuda11.2.1","name":"
    creationTimestamp: "2024-12-11T15:16:07Z"
    name: test-cuda-vectoradd-0vgcic
    namespace: test-nvidia-ffc57y
    resourceVersion: "294977"
    uid: 88a2fcd4-39fc-45f9-b8b1-bab8877f39a5
  spec:
    containers:
    - image: nvidia/samples:vectoradd-cuda11.2.1
      imagePullPolicy: IfNotPresent
      name: test-cuda-vectoradd-0vgcic
      resources:
        limits:
          nvidia.com/gpu: "1"
        requests:
          nvidia.com/gpu: "1"
      securityContext:
        runAsUser: 1000
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-api-access-9tpx4
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: OnFailure
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: default
    serviceAccountName: default
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    - effect: NoSchedule
      key: nvidia.com/gpu
      operator: Exists
    volumes:
    - name: kube-api-access-9tpx4
      projected:
        defaultMode: 420
        sources:
        - serviceAccountToken:
            expirationSeconds: 3607
            path: token
        - configMap:
            items:
            - key: ca.crt
              path: ca.crt
            name: kube-root-ca.crt
        - downwardAPI:
            items:
            - fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
              path: namespace
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2024-12-11T15:16:07Z"
      message: '0/12 nodes are available: 3 node(s) had untolerated taint {node.kubernetes.io/ingress:
        }, 3 node(s) had untolerated taint {node.kubernetes.io/system: }, 6 Insufficient
        nvidia.com/gpu. preemption: 0/12 nodes are available: 6 No preemption victims
        found for incoming pod, 6 Preemption is not helpful for scheduling.'
      reason: Unschedulable
      status: "False"
      type: PodScheduled
    phase: Pending
    qosClass: BestEffort
kind: List
metadata:
  resourceVersion: ""

Versions:

  • Chart Version: 1.0.7
  • Kubernetes Version (kubectl version): 1.29
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@lefterisALEX lefterisALEX added bug Something isn't working needs-triage Issues that need to be triaged labels Dec 11, 2024
@jigisha620
Copy link
Contributor

Hi @lefterisALEX ,
From the logs that you have shared, it doesn't look like the node became unhealthy. It was disrupted because it was empty. This is something that can be configured through nodePool's disruption block. You can increase this time or set it to Never if you don't want consolidation to occur.

consolidateAfter: 1m | Never

You can read more about consolidation here - https://karpenter.sh/v1.0/concepts/disruption/#consolidation

@lefterisALEX
Copy link
Author

thanks for the quick reply. I haven't try setting Never, but we see some issue if we set consolidateAfter: 30m for example.
A second GPU node is still launched with 3 minutes.

@lefterisALEX
Copy link
Author

ok i think figured out something more. After the node is bootstraped we apply some changes in the /etc/systemd/system/kubelet.service and we stop/start kubelet. The node becomes NotReady for some seconds and that triggers karpenter to consider that node unhealty, create a new nodeclaim and a new node is created. Is the timer that karpenter needs to consider a node unhealthy something we can configure?

@jigisha620
Copy link
Contributor

jigisha620 commented Dec 13, 2024

Karpenter should ideally schedule pod to the existing nodeClaim and not create a new one. You could be running into a similar issue discussed here - #6355

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage Issues that need to be triaged
Projects
None yet
Development

No branches or pull requests

2 participants