Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage Kubeflow TrainJobs in a multi-cluster environment #2358

Open
andreyvelich opened this issue Dec 19, 2024 · 0 comments
Open

Manage Kubeflow TrainJobs in a multi-cluster environment #2358

andreyvelich opened this issue Dec 19, 2024 · 0 comments

Comments

@andreyvelich
Copy link
Member

andreyvelich commented Dec 19, 2024

What you would like to be added?

More and more organizations are managing Kubernetes in a multi-cluster setup to effectively manage capacity and workload placement. For instance, we can find a few initiatives:

Currently, Kubeflow Training doesn't offer any best practices for managing TrainJobs across multiple Kubernetes clusters.
Our Python SDK fully relies on kubeconfig or Access Token to communicate with Kubernetes API server.

I would like to initiate this discussion to explore various options for enabling ML Engineers and Data Scientists to interact with Kubeflow TrainJobs in a multi-cluster environment.

cc @kubeflow/wg-training-leads @saileshd1402 @Electronic-Waste @seanlaii @kannon92 @astefanutti @bigsur0 @akshaychitneni @shravan-achar

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant