How to restart the training JOB when one training process fails in cluster environment to recover the training? #2269

kevinsummer219 · 2024-09-25T08:40:54Z

What you would like to be added?

I use pytorchJOB.

Why is this needed?

I think, that is valid idea. So in such restart policy our controller should re-create all PyTorchJob's pods in case of single Pod failure.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

kevinsummer219 · 2024-09-25T08:48:12Z

The current situation is that when a Pod encounters an exception, it can be automatically restarted. However, as for multi-node training, all Pods need to be restarted. How do I configure this YAML file? Thanks!

kevinsummer219 · 2024-09-25T10:36:48Z

The pytorchJOB YAML configuration file I used：
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: train-poc
namespace: default
labels:
kueue.x-k8s.io/queue-name: user-queue
spec:
runPolicy:
backoffLimit: 5
nprocPerNode: gpu
pytorchReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:

kannon92 · 2024-10-08T13:07:13Z

I believe that the v2 Training Operator will solve this by utilizing JobSet Failure Policies.

I'm unsure of the path for v1 (@tenzen-y @andreyvelich?).

andreyvelich · 2024-10-08T18:55:51Z

@kubeflow/wg-training-leads @kuizhiqing can we leverage the RestartPolicy API in V1 to restart all replica's pods in case of failure ?

Yes, with V2, we can use FailurePolicy within JobSet.

ltm920716 · 2024-12-23T04:55:01Z

hi @kevinsummer219 ，
have you found the way to restart all pods？ I find a issue #2200 (comment) but do not know where to set job.restartpolicy.

and how to use JobSet, I am sorry that I cannot find the doc for this @andreyvelich

andreyvelich · 2024-12-23T17:02:35Z

Hi @ltm920716, we are working on Kubeflow Training V2 API where that would be possible: https://github.com/kubeflow/training-operator/tree/master/docs/proposals/2170-kubeflow-training-v2

You can find the torch-distributed Runtime where you can configure JobSet spec:
https://github.com/kubeflow/training-operator/blob/master/manifests/v2/base/runtimes/pre-training/torch-distributed.yaml#L12

ltm920716 · 2024-12-24T02:33:19Z

Hi @ltm920716, we are working on Kubeflow Training V2 API where that would be possible: https://github.com/kubeflow/training-operator/tree/master/docs/proposals/2170-kubeflow-training-v2

You can find the torch-distributed Runtime where you can configure JobSet spec: https://github.com/kubeflow/training-operator/blob/master/manifests/v2/base/runtimes/pre-training/torch-distributed.yaml#L12

ok，very kind of you！

kevinsummer219 added kind/feature lifecycle/needs-triage labels Sep 25, 2024

kevinsummer219 changed the title ~~How to restart all pods when one pod fails?~~ How to restart the training JOB when one training process fails in cluster environment to recover the training? Sep 25, 2024

Syulin7 mentioned this issue Oct 25, 2024

Pytorch job running with pod exception unable to recover after retry #2300

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to restart the training JOB when one training process fails in cluster environment to recover the training? #2269

How to restart the training JOB when one training process fails in cluster environment to recover the training? #2269

kevinsummer219 commented Sep 25, 2024

kevinsummer219 commented Sep 25, 2024

kevinsummer219 commented Sep 25, 2024

kannon92 commented Oct 8, 2024

andreyvelich commented Oct 8, 2024

ltm920716 commented Dec 23, 2024

andreyvelich commented Dec 23, 2024

ltm920716 commented Dec 24, 2024

How to restart the training JOB when one training process fails in cluster environment to recover the training? #2269

How to restart the training JOB when one training process fails in cluster environment to recover the training? #2269

Comments

kevinsummer219 commented Sep 25, 2024

What you would like to be added?

Why is this needed?

Love this feature?

kevinsummer219 commented Sep 25, 2024

kevinsummer219 commented Sep 25, 2024

kannon92 commented Oct 8, 2024

andreyvelich commented Oct 8, 2024

ltm920716 commented Dec 23, 2024

andreyvelich commented Dec 23, 2024

ltm920716 commented Dec 24, 2024