Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to restart the training JOB when one training process fails in cluster environment to recover the training? #2269

Open
kevinsummer219 opened this issue Sep 25, 2024 · 7 comments

Comments

@kevinsummer219
Copy link

What you would like to be added?

I use pytorchJOB.

Why is this needed?

I think, that is valid idea. So in such restart policy our controller should re-create all PyTorchJob's pods in case of single Pod failure.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

@kevinsummer219
Copy link
Author

The current situation is that when a Pod encounters an exception, it can be automatically restarted. However, as for multi-node training, all Pods need to be restarted. How do I configure this YAML file? Thanks!

@kevinsummer219
Copy link
Author

The pytorchJOB YAML configuration file I used:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: train-poc
namespace: default
labels:
kueue.x-k8s.io/queue-name: user-queue
spec:
runPolicy:
backoffLimit: 5
nprocPerNode: gpu
pytorchReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:

@kevinsummer219 kevinsummer219 changed the title How to restart all pods when one pod fails? How to restart the training JOB when one training process fails in cluster environment to recover the training? Sep 25, 2024
@kannon92
Copy link
Contributor

kannon92 commented Oct 8, 2024

I believe that the v2 Training Operator will solve this by utilizing JobSet Failure Policies.

I'm unsure of the path for v1 (@tenzen-y @andreyvelich?).

@andreyvelich
Copy link
Member

@kubeflow/wg-training-leads @kuizhiqing can we leverage the RestartPolicy API in V1 to restart all replica's pods in case of failure ?

Yes, with V2, we can use FailurePolicy within JobSet.

@ltm920716
Copy link

hi @kevinsummer219
have you found the way to restart all pods? I find a issue #2200 (comment) but do not know where to set job.restartpolicy.

and how to use JobSet, I am sorry that I cannot find the doc for this @andreyvelich

@andreyvelich
Copy link
Member

Hi @ltm920716, we are working on Kubeflow Training V2 API where that would be possible: https://github.com/kubeflow/training-operator/tree/master/docs/proposals/2170-kubeflow-training-v2

You can find the torch-distributed Runtime where you can configure JobSet spec:
https://github.com/kubeflow/training-operator/blob/master/manifests/v2/base/runtimes/pre-training/torch-distributed.yaml#L12

@ltm920716
Copy link

Hi @ltm920716, we are working on Kubeflow Training V2 API where that would be possible: https://github.com/kubeflow/training-operator/tree/master/docs/proposals/2170-kubeflow-training-v2

You can find the torch-distributed Runtime where you can configure JobSet spec: https://github.com/kubeflow/training-operator/blob/master/manifests/v2/base/runtimes/pre-training/torch-distributed.yaml#L12

ok,very kind of you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants