if nodepool is configured to run both on spot and ondemand, add waitAndRetry period before it falls back to ondemand in case spot is not available. #7490

lasnoles · 2024-12-06T04:37:31Z

Description

When Karpenter nodepool is configured to run on both ondemand and spot, Karpenter just check once whether there is spot. If there is not, it will immediately fallback to ondemand.

This has lead business to about 20% cost increase. The business is okay to wait for few minutes before it falls back to ondemand.

Can we please consider adding this feature in?

jonathan-innis · 2024-12-10T23:12:58Z

The business is okay to wait for few minutes before it falls back to ondemand

What's been your experience with consolidation? Consolidation should basically completely mitigate this problem. Even if we pick the on-demand instance, we should go back to spot when it becomes available.

I don't think we are really going to consider waiting to launch capacity since there are really niche use-cases (if any) that require that you wait.

lasnoles · 2024-12-11T04:21:15Z

we have enabled consolidation. However, the consolidation has the following problems:

It adds into the interruption rates, which would cause some instablity.
It doesn't seem to work as effective as what you mentioned. when we compare between CAS and Karpenter, Karpenter used up more ondemand instance overall, specially when we configure the preferred (not mandatory) AZ load balanced affitity.
Above obvervation are based on 1 month data compare between CAS and Karpenter. we noticed CAS can get more spot instance in general.

dom-raven · 2024-12-12T13:28:24Z

The business is okay to wait for few minutes before it falls back to ondemand

What's been your experience with consolidation? Consolidation should basically completely mitigate this problem. Even if we pick the on-demand instance, we should go back to spot when it becomes available.

I don't think we are really going to consider waiting to launch capacity since there are really niche use-cases (if any) that require that you wait.

I would like to add a question to this, we actually have 2 node pools one for on-demand and one for spot, the spot node pool is weighted at 50, and the other as 0.

What is the difference in having two node pools weighted vs having the one node pool with both capacity types in the array?

lasnoles · 2024-12-13T01:12:26Z

The two nodepool approach is doable, but it is fixed ratio in nature, and would require us to manually adjust based on spot availability in AWS.
In case there is really no spot instance, the capacity of eks is not able to fullfill business SLA and we have to manually adjust.
While using waitandretry approach, we can control how long to wait for spot, and fallback to ondemand automatically. In this way, we both solve the problems of using too much ondemand, but also ensure the SLA is met.

lasnoles added feature New feature or request needs-triage Issues that need to be triaged labels Dec 6, 2024

jonathan-innis added triage/needs-information Marks that the issue still needs more information to properly triage and removed needs-triage Issues that need to be triaged labels Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

if nodepool is configured to run both on spot and ondemand, add waitAndRetry period before it falls back to ondemand in case spot is not available. #7490

if nodepool is configured to run both on spot and ondemand, add waitAndRetry period before it falls back to ondemand in case spot is not available. #7490

lasnoles commented Dec 6, 2024

jonathan-innis commented Dec 10, 2024

lasnoles commented Dec 11, 2024

dom-raven commented Dec 12, 2024

lasnoles commented Dec 13, 2024

if nodepool is configured to run both on spot and ondemand, add waitAndRetry period before it falls back to ondemand in case spot is not available. #7490

if nodepool is configured to run both on spot and ondemand, add waitAndRetry period before it falls back to ondemand in case spot is not available. #7490

Comments

lasnoles commented Dec 6, 2024

Description

jonathan-innis commented Dec 10, 2024

lasnoles commented Dec 11, 2024

dom-raven commented Dec 12, 2024

lasnoles commented Dec 13, 2024