[Feat] Add pod scheduling to timeout, or cancel stuck jobs after new timeout #11025

ADustyOldMuffin · 2023-05-02T14:26:22Z

Summary

When trying to schedule jobs/templates in a workflow I'd like to schedule them on specific nodes, but these jobs are not critical and if I can't schedule them then I'd like to just skip them. The issue though is the current timeouts available don't count pending/unschedulable pods so if they can't be scheduled they hang up the entire workflow.

Use Cases

When you have a workflow with steps that might not schedule in Kubernetes, I'd like a way to time them out or stop them after sitting for so long in a pending state.

Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

Gerrit-K · 2023-10-13T09:38:44Z

I think this request would fix a similar problem I've encountered a few times. We have CLI scripts that submit workflows and one of the user-controlled inputs is the container image. So if the input is wrong, it might cause invalid reference format on the pod, which is then effectively unschedulable. Of course, we can (and did) add checks on the client side to prevent this, but it would additionally be nice if argo was able to detect these cases and recover from it (i.e. fail automatically).

ElQDuck · 2024-02-15T09:05:11Z

I have a similar problem with pods in an infinit pending state because e.g. a referenced PVC cant be found. Non of the timeouts (activeDeadlineSeconds, timeout) work. I can see a message in argo with the problem but i cant define a timeout for such use case.

Unschedulable: 0/1 nodes are available: persistentvolumeclaim "non-existent" not found. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..

tooptoop4 · 2024-05-21T12:11:07Z

i have a similar issue (aws/amazon-vpc-cni-k8s#2808) where new nodes go from:
1.kubelet ready
2. kubelet network notready
3. back to kubelet ready

where pods scheduled at point 1 become stuck in pending. would be great if argo had a pending timeout that when met would allow retrying to run new pod

ADustyOldMuffin added the type/feature Feature request label May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Add pod scheduling to timeout, or cancel stuck jobs after new timeout #11025

[Feat] Add pod scheduling to timeout, or cancel stuck jobs after new timeout #11025

ADustyOldMuffin commented May 2, 2023

Gerrit-K commented Oct 13, 2023

ElQDuck commented Feb 15, 2024

tooptoop4 commented May 21, 2024 •

edited

Loading

[Feat] Add pod scheduling to timeout, or cancel stuck jobs after new timeout #11025

[Feat] Add pod scheduling to timeout, or cancel stuck jobs after new timeout #11025

Comments

ADustyOldMuffin commented May 2, 2023

Summary

Use Cases

Gerrit-K commented Oct 13, 2023

ElQDuck commented Feb 15, 2024

tooptoop4 commented May 21, 2024 • edited Loading

tooptoop4 commented May 21, 2024 •

edited

Loading