Additional notes about instanceStorePolicy for EC2 Node Class #7543
Labels
documentation
Improvements or additions to documentation
triage/accepted
Indicates that the issue has been accepted as a valid issue
Background
We have a few disk sensitive workloads that leverage
i3en
instance types.i3en
instance types include an nvme ssd instance-store volume. Recently we faced some disk space exhaustion issues as we weren't properly settingephemeral-storage
requests for our workload. After we updated theephemeral-storage
request to appropriately match the disk needed for the workload some of our pods would not schedule (stuck in Pending) and new nodes would not get created to assist with theephemeral-storage
needs.After lots of debugging we found that the root cause was the
ephemeral-storage
request was larger than the default volumes in our nodeclassesblockDeviceMappings
. We already RAID0 our instance-store volumes but Karpenter did not know that so instance-store volumes were never considered during scheduling. Once we setinstanceStorePolicy: RAID0
in ourec2nodeclass
resource it started creating newi3en
nodes which unblocked our workload that was stuck in Pending.Feedback
Currently the documentation here mentions that Karpenter will ignore instance-store volumes unless the value is set to
RAID0
. It also mentionsIf you intend to use these volumes for faster node ephemeral-storage, set instanceStorePolicy to RAID0
Ask: For future discovery it might be helpful if we add some additional notes along the lines of
this setting is likely to be useful for workloads that leverage dense storage instance types or require the low latency from instance-stores that are nvm e ssd based
. Another note to add could beWarning: Even if you already RAID0 your volumes Karpenter won't know this up front during scheduling which can lead to confusing behavior and deadlock scheduling workloads that request ephemeral-storage dependent on instance-store volume size
The text was updated successfully, but these errors were encountered: