Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Support hundreds and thousands worker nodes for a single training Job #2318

Open
tenzen-y opened this issue Nov 4, 2024 · 2 comments
Assignees

Comments

@tenzen-y
Copy link
Member

tenzen-y commented Nov 4, 2024

What you would like to be added?

We should support the multiple replicas per a replicatedJob like:

[...]
spec:
  replicatedJobs:
  - name:
    replicas: 5
[...]

Why is this needed?

Currently, we enforce 1 to the JobSet ReplicatedJob replicas:

for _, rJob := range jobSetTemplateSpec.Spec.ReplicatedJobs {
// By default every ReplicatedJob has only 1 replica.
opts = append(opts, runtime.WithPodSpecReplicas(rJob.Name, 1, rJob.Template.Spec.Template.Spec))

However, when the single worker replicatedJob has batch/v1 Job with hundreds and thousands of completions (.spec.completions), this brings us a significant reconciling delay since the job-controller (combined within kube-controller-manager) reconciliation will take much longer time due to thousands of Pods, then following Jobs will be stuck in the workqueue.

spec:
  replicatedJobs:
  - name: training-node
    replicas: 1
    template:
      spec:
        completions: 2000
        parallelism: 2000

After that, the kube-controller-manger workqueue depth will be much deeper, which could potentially cause a memory leak.
Finally, the kube-controller-manager continues to restart, and any kind of Workload (even StatefulSet and Deployment) will fall unhandled.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

@tenzen-y
Copy link
Member Author

tenzen-y commented Nov 4, 2024

/remove-label lifecycle/needs-triage

@AydanPirani
Copy link

/assign

I can work on this!

@tenzen-y Quick question - I was looking at runtime.go#L71, and it seems that we already have support for PodSpecReplicas.

Does that mean that the majority of the implementation will be on the parsing side of things? (Adding a parameter for job name)

Of course, this is all assuming that there is already support to run multiple replicas - can you confirm this?

Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants