Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

if nodepool is configured to run both on spot and ondemand, add waitAndRetry period before it falls back to ondemand in case spot is not available. #7490

Open
lasnoles opened this issue Dec 6, 2024 · 4 comments
Labels
feature New feature or request triage/needs-information Marks that the issue still needs more information to properly triage

Comments

@lasnoles
Copy link

lasnoles commented Dec 6, 2024

Description

When Karpenter nodepool is configured to run on both ondemand and spot, Karpenter just check once whether there is spot. If there is not, it will immediately fallback to ondemand.

This has lead business to about 20% cost increase. The business is okay to wait for few minutes before it falls back to ondemand.

Can we please consider adding this feature in?

@lasnoles lasnoles added feature New feature or request needs-triage Issues that need to be triaged labels Dec 6, 2024
@jonathan-innis
Copy link
Contributor

The business is okay to wait for few minutes before it falls back to ondemand

What's been your experience with consolidation? Consolidation should basically completely mitigate this problem. Even if we pick the on-demand instance, we should go back to spot when it becomes available.

I don't think we are really going to consider waiting to launch capacity since there are really niche use-cases (if any) that require that you wait.

@jonathan-innis jonathan-innis added triage/needs-information Marks that the issue still needs more information to properly triage and removed needs-triage Issues that need to be triaged labels Dec 10, 2024
@lasnoles
Copy link
Author

we have enabled consolidation. However, the consolidation has the following problems:

  1. It adds into the interruption rates, which would cause some instablity.
  2. It doesn't seem to work as effective as what you mentioned. when we compare between CAS and Karpenter, Karpenter used up more ondemand instance overall, specially when we configure the preferred (not mandatory) AZ load balanced affitity.
    Above obvervation are based on 1 month data compare between CAS and Karpenter. we noticed CAS can get more spot instance in general.

@dom-raven
Copy link

The business is okay to wait for few minutes before it falls back to ondemand

What's been your experience with consolidation? Consolidation should basically completely mitigate this problem. Even if we pick the on-demand instance, we should go back to spot when it becomes available.

I don't think we are really going to consider waiting to launch capacity since there are really niche use-cases (if any) that require that you wait.

I would like to add a question to this, we actually have 2 node pools one for on-demand and one for spot, the spot node pool is weighted at 50, and the other as 0.

What is the difference in having two node pools weighted vs having the one node pool with both capacity types in the array?

@lasnoles
Copy link
Author

The two nodepool approach is doable, but it is fixed ratio in nature, and would require us to manually adjust based on spot availability in AWS.
In case there is really no spot instance, the capacity of eks is not able to fullfill business SLA and we have to manually adjust.
While using waitandretry approach, we can control how long to wait for spot, and fallback to ondemand automatically. In this way, we both solve the problems of using too much ondemand, but also ensure the SLA is met.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request triage/needs-information Marks that the issue still needs more information to properly triage
Projects
None yet
Development

No branches or pull requests

3 participants