Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional notes about instanceStorePolicy for EC2 Node Class #7543

Open
bclodius opened this issue Dec 19, 2024 · 1 comment
Open

Additional notes about instanceStorePolicy for EC2 Node Class #7543

bclodius opened this issue Dec 19, 2024 · 1 comment
Labels
documentation Improvements or additions to documentation triage/accepted Indicates that the issue has been accepted as a valid issue

Comments

@bclodius
Copy link

bclodius commented Dec 19, 2024

Background

We have a few disk sensitive workloads that leverage i3en instance types. i3en instance types include an nvme ssd instance-store volume. Recently we faced some disk space exhaustion issues as we weren't properly setting ephemeral-storage requests for our workload. After we updated the ephemeral-storage request to appropriately match the disk needed for the workload some of our pods would not schedule (stuck in Pending) and new nodes would not get created to assist with the ephemeral-storage needs.

After lots of debugging we found that the root cause was the ephemeral-storage request was larger than the default volumes in our nodeclasses blockDeviceMappings. We already RAID0 our instance-store volumes but Karpenter did not know that so instance-store volumes were never considered during scheduling. Once we set instanceStorePolicy: RAID0 in our ec2nodeclass resource it started creating new i3en nodes which unblocked our workload that was stuck in Pending.

Feedback

Currently the documentation here mentions that Karpenter will ignore instance-store volumes unless the value is set to RAID0. It also mentions If you intend to use these volumes for faster node ephemeral-storage, set instanceStorePolicy to RAID0

Ask: For future discovery it might be helpful if we add some additional notes along the lines of this setting is likely to be useful for workloads that leverage dense storage instance types or require the low latency from instance-stores that are nvm e ssd based. Another note to add could be Warning: Even if you already RAID0 your volumes Karpenter won't know this up front during scheduling which can lead to confusing behavior and deadlock scheduling workloads that request ephemeral-storage dependent on instance-store volume size

@bclodius bclodius added documentation Improvements or additions to documentation needs-triage Issues that need to be triaged labels Dec 19, 2024
@jmdeal
Copy link
Contributor

jmdeal commented Dec 23, 2024

This seems like a reasonable addition to me, we welcome contributions!

@jmdeal jmdeal added triage/accepted Indicates that the issue has been accepted as a valid issue and removed needs-triage Issues that need to be triaged labels Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation triage/accepted Indicates that the issue has been accepted as a valid issue
Projects
None yet
Development

No branches or pull requests

2 participants