You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are trying to setup karpenter and we currently have only one node pool that launches GPU nodes when needed.
The issue will describe below might not be relate with the GPU nodes , but in general with notes that bootstrap takes more time.
Observed Behavior:
When karpenter is requesting a new node this is launched and after around 150 seconds we noticed a second node (ip-10-28-112-201.eu-west-1.compute.internal ) is also launched and joined the cluster.
❯ kubectl get nodeclaim
NAME TYPE CAPACITY ZONE NODE READY AGE
gpu-2mz56 g5.xlarge on-demand eu-west-1c ip-10-28-99-95.eu-west-1.compute.internal True 4m52s
gpu-z5gcl g5.xlarge on-demand eu-west-1a ip-10-28-112-201.eu-west-1.compute.internal Unknown 2m30s
Perhaphs that is behavior as designed? If that's the case will be nice if we can configure the time karpenter needs to consider a node unhealthy before requesting a node claim.
Hi @lefterisALEX ,
From the logs that you have shared, it doesn't look like the node became unhealthy. It was disrupted because it was empty. This is something that can be configured through nodePool's disruption block. You can increase this time or set it to Never if you don't want consolidation to occur.
thanks for the quick reply. I haven't try setting Never, but we see some issue if we set consolidateAfter: 30m for example.
A second GPU node is still launched with 3 minutes.
ok i think figured out something more. After the node is bootstraped we apply some changes in the /etc/systemd/system/kubelet.service and we stop/start kubelet. The node becomes NotReady for some seconds and that triggers karpenter to consider that node unhealty, create a new nodeclaim and a new node is created. Is the timer that karpenter needs to consider a node unhealthy something we can configure?
Karpenter should ideally schedule pod to the existing nodeClaim and not create a new one. You could be running into a similar issue discussed here - #6355
Description
We are trying to setup karpenter and we currently have only one node pool that launches GPU nodes when needed.
The issue will describe below might not be relate with the GPU nodes , but in general with notes that bootstrap takes more time.
Observed Behavior:
When karpenter is requesting a new node this is launched and after around 150 seconds we noticed a second node (
ip-10-28-112-201.eu-west-1.compute.internal
) is also launched and joined the cluster.we could also see a 2nd
nodeclaim
Attached files and logs
nodeclaims.yaml.zip
karpenter.log
Expected Behavior:
Perhaphs that is behavior as designed? If that's the case will be nice if we can configure the time karpenter needs to consider a node unhealthy before requesting a node claim.
Reproduction Steps (Please include YAML):
Nodepool
The yaml file we use to create the pod which needs GPU
Versions:
1.0.7
kubectl version
):1.29
The text was updated successfully, but these errors were encountered: