Expose metric with volumes stuck detaching #2255

jsafrane · 2024-12-09T17:39:54Z

Is your feature request related to a problem? Please describe.
Sometimes a volume gets stuck at detaching state, most usually because the host where it is attached is unhealthy. A careful, manually initiated force detach and/or node shutdown may be needed it that case.

The CSI driver should expose a metric about what volumes (with PV/PVC as a label?) are waiting for detach longer than X seconds (minutes?), so a cluster admin can set up an alert and manually investigate what is going on.

Describe alternatives you've considered
A similar metric in the kube-controller-manager or the external-attacher could be a good alternative, still, we've seen such an issue only with AWS EBS.

Additional context
See #1302 for an example of such issue.

AndrewSirenko · 2024-12-10T20:18:35Z

/feature

I will bring this up with my team. I agree that some metric for this 'volume not detaching' would be useful for operators.

Would a metric for "Pods stuck in containerCreating due to FailedAttachVolume for more than x seconds" also solve this problem?

There are many possible sources of this "Volume stuck detaching" issue, each with different symptoms, yet the main pain-point seems to be when a stateful workload migrates nodes and can't start.

Or perhaps the 'awaiting detach' you proposed metric could have different labels for different cases:

Unhealthy Node -> Unhealthy Kubelet -> VolumeManager cannot ensure NodeUnstageVolume succeeded.

This results in Volume never being cleared form node.Status.VolumeInUse -> VolumeAttachment never marked for deletion -> ControllerUnpublishVolume never called

Unhealthy Node does not let EBS Node service unmount volume and 6-min AD Controller force detach timer elapses -> EBS CSI Controller's EC2 DetachVolume call in ControllerUnpublish will fail until instance terminated or force-detach manually issued.

Sidenote: Are there other cases that you have seen with recent EBS CSI Driver versions that we are not aware of?

IIRC my team had believed that most of #1302's reports were either fixed in the driver, or due to autoscalars not waiting for volume unmounts / detaches before terminating instances. We believe we've solved these autoscalar races via pre-stop hook or PRs on relevant autoscalers (like Karpenter via kubernetes-sigs/karpenter#1294)

AndrewSirenko · 2024-12-11T16:46:41Z

Team agreed that this is worth prioritizing, we'll keep you in the design loop, thanks @jsafrane

AndrewSirenko · 2024-12-11T16:48:38Z

/priority important-longterm

kubernetes-sigs deleted a comment from k8s-ci-robot Dec 11, 2024

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose metric with volumes stuck detaching #2255

Expose metric with volumes stuck detaching #2255

jsafrane commented Dec 9, 2024

AndrewSirenko commented Dec 10, 2024 •

edited

Loading

AndrewSirenko commented Dec 11, 2024 •

edited

Loading

AndrewSirenko commented Dec 11, 2024

Expose metric with volumes stuck detaching #2255

Expose metric with volumes stuck detaching #2255

Comments

jsafrane commented Dec 9, 2024

AndrewSirenko commented Dec 10, 2024 • edited Loading

AndrewSirenko commented Dec 11, 2024 • edited Loading

AndrewSirenko commented Dec 11, 2024

AndrewSirenko commented Dec 10, 2024 •

edited

Loading

AndrewSirenko commented Dec 11, 2024 •

edited

Loading