-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose metric with volumes stuck detaching #2255
Comments
/feature I will bring this up with my team. I agree that some metric for this 'volume not detaching' would be useful for operators. Would a metric for "Pods stuck in containerCreating due to FailedAttachVolume for more than x seconds" also solve this problem? There are many possible sources of this "Volume stuck detaching" issue, each with different symptoms, yet the main pain-point seems to be when a stateful workload migrates nodes and can't start. Or perhaps the 'awaiting detach' you proposed metric could have different labels for different cases:
Sidenote: Are there other cases that you have seen with recent EBS CSI Driver versions that we are not aware of? IIRC my team had believed that most of #1302's reports were either fixed in the driver, or due to autoscalars not waiting for volume unmounts / detaches before terminating instances. We believe we've solved these autoscalar races via pre-stop hook or PRs on relevant autoscalers (like Karpenter via kubernetes-sigs/karpenter#1294) |
Team agreed that this is worth prioritizing, we'll keep you in the design loop, thanks @jsafrane |
/priority important-longterm |
Is your feature request related to a problem? Please describe.
Sometimes a volume gets stuck at detaching state, most usually because the host where it is attached is unhealthy. A careful, manually initiated force detach and/or node shutdown may be needed it that case.
The CSI driver should expose a metric about what volumes (with PV/PVC as a label?) are waiting for detach longer than X seconds (minutes?), so a cluster admin can set up an alert and manually investigate what is going on.
Describe alternatives you've considered
A similar metric in the kube-controller-manager or the external-attacher could be a good alternative, still, we've seen such an issue only with AWS EBS.
Additional context
See #1302 for an example of such issue.
The text was updated successfully, but these errors were encountered: