Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Add pod scheduling to timeout, or cancel stuck jobs after new timeout #11025

Open
ADustyOldMuffin opened this issue May 2, 2023 · 3 comments
Labels
type/feature Feature request

Comments

@ADustyOldMuffin
Copy link

Summary

When trying to schedule jobs/templates in a workflow I'd like to schedule them on specific nodes, but these jobs are not critical and if I can't schedule them then I'd like to just skip them. The issue though is the current timeouts available don't count pending/unschedulable pods so if they can't be scheduled they hang up the entire workflow.

Use Cases

When you have a workflow with steps that might not schedule in Kubernetes, I'd like a way to time them out or stop them after sitting for so long in a pending state.


Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

@ADustyOldMuffin ADustyOldMuffin added the type/feature Feature request label May 2, 2023
@Gerrit-K
Copy link

I think this request would fix a similar problem I've encountered a few times. We have CLI scripts that submit workflows and one of the user-controlled inputs is the container image. So if the input is wrong, it might cause invalid reference format on the pod, which is then effectively unschedulable. Of course, we can (and did) add checks on the client side to prevent this, but it would additionally be nice if argo was able to detect these cases and recover from it (i.e. fail automatically).

@ElQDuck
Copy link

ElQDuck commented Feb 15, 2024

I have a similar problem with pods in an infinit pending state because e.g. a referenced PVC cant be found. Non of the timeouts (activeDeadlineSeconds, timeout) work. I can see a message in argo with the problem but i cant define a timeout for such use case.

Unschedulable: 0/1 nodes are available: persistentvolumeclaim "non-existent" not found. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..

@tooptoop4
Copy link
Contributor

tooptoop4 commented May 21, 2024

i have a similar issue (aws/amazon-vpc-cni-k8s#2808) where new nodes go from:
1.kubelet ready
2. kubelet network notready
3. back to kubelet ready

where pods scheduled at point 1 become stuck in pending. would be great if argo had a pending timeout that when met would allow retrying to run new pod

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature Feature request
Projects
None yet
Development

No branches or pull requests

4 participants