Manage Kubeflow TrainJobs in a multi-cluster environment #2358

andreyvelich · 2024-12-19T21:46:31Z

What you would like to be added?

More and more organizations are managing Kubernetes in a multi-cluster setup to effectively manage capacity and workload placement. For instance, we can find a few initiatives:

Multi queue in Kueue project: https://kueue.sigs.k8s.io/docs/concepts/multikueue/
SIG Multicluster: https://multicluster.sigs.k8s.io/.

Currently, Kubeflow Training doesn't offer any best practices for managing TrainJobs across multiple Kubernetes clusters.
Our Python SDK fully relies on kubeconfig or Access Token to communicate with Kubernetes API server.

I would like to initiate this discussion to explore various options for enabling ML Engineers and Data Scientists to interact with Kubeflow TrainJobs in a multi-cluster environment.

cc @kubeflow/wg-training-leads @saileshd1402 @Electronic-Waste @seanlaii @kannon92 @astefanutti @bigsur0 @akshaychitneni @shravan-achar

Love this feature?

Give it a 👍 We prioritize the features with most 👍

The text was updated successfully, but these errors were encountered:

andreyvelich added kind/feature kind/discussion area/sdk labels Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manage Kubeflow TrainJobs in a multi-cluster environment #2358

Manage Kubeflow TrainJobs in a multi-cluster environment #2358

andreyvelich commented Dec 19, 2024 •

edited

Loading

Manage Kubeflow TrainJobs in a multi-cluster environment #2358

Manage Kubeflow TrainJobs in a multi-cluster environment #2358

Comments

andreyvelich commented Dec 19, 2024 • edited Loading

What you would like to be added?

Love this feature?

andreyvelich commented Dec 19, 2024 •

edited

Loading