Additional notes about instanceStorePolicy for EC2 Node Class #7543

bclodius · 2024-12-19T17:53:34Z

Background

We have a few disk sensitive workloads that leverage i3en instance types. i3en instance types include an nvme ssd instance-store volume. Recently we faced some disk space exhaustion issues as we weren't properly setting ephemeral-storage requests for our workload. After we updated the ephemeral-storage request to appropriately match the disk needed for the workload some of our pods would not schedule (stuck in Pending) and new nodes would not get created to assist with the ephemeral-storage needs.

After lots of debugging we found that the root cause was the ephemeral-storage request was larger than the default volumes in our nodeclasses blockDeviceMappings. We already RAID0 our instance-store volumes but Karpenter did not know that so instance-store volumes were never considered during scheduling. Once we set instanceStorePolicy: RAID0 in our ec2nodeclass resource it started creating new i3en nodes which unblocked our workload that was stuck in Pending.

Feedback

Currently the documentation here mentions that Karpenter will ignore instance-store volumes unless the value is set to RAID0. It also mentions If you intend to use these volumes for faster node ephemeral-storage, set instanceStorePolicy to RAID0

Ask: For future discovery it might be helpful if we add some additional notes along the lines of this setting is likely to be useful for workloads that leverage dense storage instance types or require the low latency from instance-stores that are nvm e ssd based. Another note to add could be Warning: Even if you already RAID0 your volumes Karpenter won't know this up front during scheduling which can lead to confusing behavior and deadlock scheduling workloads that request ephemeral-storage dependent on instance-store volume size

The text was updated successfully, but these errors were encountered:

jmdeal · 2024-12-23T20:09:03Z

This seems like a reasonable addition to me, we welcome contributions!

bclodius added documentation Improvements or additions to documentation needs-triage Issues that need to be triaged labels Dec 19, 2024

jmdeal added triage/accepted Indicates that the issue has been accepted as a valid issue and removed needs-triage Issues that need to be triaged labels Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional notes about instanceStorePolicy for EC2 Node Class #7543

Additional notes about instanceStorePolicy for EC2 Node Class #7543

bclodius commented Dec 19, 2024 •

edited

Loading

jmdeal commented Dec 23, 2024

Additional notes about instanceStorePolicy for EC2 Node Class #7543

Additional notes about instanceStorePolicy for EC2 Node Class #7543

Comments

bclodius commented Dec 19, 2024 • edited Loading

Background

Feedback

jmdeal commented Dec 23, 2024

bclodius commented Dec 19, 2024 •

edited

Loading