You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When trying to schedule jobs/templates in a workflow I'd like to schedule them on specific nodes, but these jobs are not critical and if I can't schedule them then I'd like to just skip them. The issue though is the current timeouts available don't count pending/unschedulable pods so if they can't be scheduled they hang up the entire workflow.
Use Cases
When you have a workflow with steps that might not schedule in Kubernetes, I'd like a way to time them out or stop them after sitting for so long in a pending state.
Message from the maintainers:
Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.
The text was updated successfully, but these errors were encountered:
I think this request would fix a similar problem I've encountered a few times. We have CLI scripts that submit workflows and one of the user-controlled inputs is the container image. So if the input is wrong, it might cause invalid reference format on the pod, which is then effectively unschedulable. Of course, we can (and did) add checks on the client side to prevent this, but it would additionally be nice if argo was able to detect these cases and recover from it (i.e. fail automatically).
I have a similar problem with pods in an infinit pending state because e.g. a referenced PVC cant be found. Non of the timeouts (activeDeadlineSeconds, timeout) work. I can see a message in argo with the problem but i cant define a timeout for such use case.
Unschedulable: 0/1 nodes are available: persistentvolumeclaim "non-existent" not found. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
i have a similar issue (aws/amazon-vpc-cni-k8s#2808) where new nodes go from:
1.kubelet ready
2. kubelet network notready
3. back to kubelet ready
where pods scheduled at point 1 become stuck in pending. would be great if argo had a pending timeout that when met would allow retrying to run new pod
Summary
When trying to schedule jobs/templates in a workflow I'd like to schedule them on specific nodes, but these jobs are not critical and if I can't schedule them then I'd like to just skip them. The issue though is the current timeouts available don't count pending/unschedulable pods so if they can't be scheduled they hang up the entire workflow.
Use Cases
When you have a workflow with steps that might not schedule in Kubernetes, I'd like a way to time them out or stop them after sitting for so long in a pending state.
Message from the maintainers:
Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.
The text was updated successfully, but these errors were encountered: