Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose metric with volumes stuck detaching #2255

Open
jsafrane opened this issue Dec 9, 2024 · 3 comments
Open

Expose metric with volumes stuck detaching #2255

jsafrane opened this issue Dec 9, 2024 · 3 comments
Labels
priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@jsafrane
Copy link
Contributor

jsafrane commented Dec 9, 2024

Is your feature request related to a problem? Please describe.
Sometimes a volume gets stuck at detaching state, most usually because the host where it is attached is unhealthy. A careful, manually initiated force detach and/or node shutdown may be needed it that case.

The CSI driver should expose a metric about what volumes (with PV/PVC as a label?) are waiting for detach longer than X seconds (minutes?), so a cluster admin can set up an alert and manually investigate what is going on.

Describe alternatives you've considered
A similar metric in the kube-controller-manager or the external-attacher could be a good alternative, still, we've seen such an issue only with AWS EBS.

Additional context
See #1302 for an example of such issue.

@AndrewSirenko
Copy link
Contributor

AndrewSirenko commented Dec 10, 2024

/feature

I will bring this up with my team. I agree that some metric for this 'volume not detaching' would be useful for operators.


Would a metric for "Pods stuck in containerCreating due to FailedAttachVolume for more than x seconds" also solve this problem?

There are many possible sources of this "Volume stuck detaching" issue, each with different symptoms, yet the main pain-point seems to be when a stateful workload migrates nodes and can't start.

Or perhaps the 'awaiting detach' you proposed metric could have different labels for different cases:

  1. Unhealthy Node -> Unhealthy Kubelet -> VolumeManager cannot ensure NodeUnstageVolume succeeded.
  • This results in Volume never being cleared form node.Status.VolumeInUse -> VolumeAttachment never marked for deletion -> ControllerUnpublishVolume never called
  1. Unhealthy Node does not let EBS Node service unmount volume and 6-min AD Controller force detach timer elapses -> EBS CSI Controller's EC2 DetachVolume call in ControllerUnpublish will fail until instance terminated or force-detach manually issued.

Sidenote: Are there other cases that you have seen with recent EBS CSI Driver versions that we are not aware of?

IIRC my team had believed that most of #1302's reports were either fixed in the driver, or due to autoscalars not waiting for volume unmounts / detaches before terminating instances. We believe we've solved these autoscalar races via pre-stop hook or PRs on relevant autoscalers (like Karpenter via kubernetes-sigs/karpenter#1294)

@AndrewSirenko
Copy link
Contributor

AndrewSirenko commented Dec 11, 2024

Team agreed that this is worth prioritizing, we'll keep you in the design loop, thanks @jsafrane

@kubernetes-sigs kubernetes-sigs deleted a comment from k8s-ci-robot Dec 11, 2024
@kubernetes-sigs kubernetes-sigs deleted a comment from k8s-ci-robot Dec 11, 2024
@AndrewSirenko
Copy link
Contributor

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

3 participants