Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Part 2. Introduce multi-node SPMD initialization for Neuron #8046

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rpsilva-aws
Copy link
Contributor

In this PR, we adapt to account for a new initialization path that supports multi-node SPMD in Neuron. In order to minimize this change, we retain the xla.init() API, but introduce a reinitialization for PJRT alone once SPMD is enabled. Since enabling SPMD follows the initial Neuron initialization, we require reconfiguring once this is enabled, and if the user did not explicitly set XLA_USE_SPMD (via is_spmd(), as it is currently recommended). Under the hood, both APIs will guarantee that the environment is correctly configured when SPMD is enabled.

@rpsilva-aws rpsilva-aws force-pushed the rpsilva-aws_neuron_multi_node_spmd branch from 92be233 to fd39924 Compare September 20, 2024 01:34
Copy link
Collaborator

@will-cromar will-cromar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

torch_xla/runtime.py Outdated Show resolved Hide resolved
@rpsilva-aws rpsilva-aws force-pushed the rpsilva-aws_neuron_multi_node_spmd branch 3 times, most recently from aa4bbd0 to eaef0b8 Compare September 30, 2024 21:19
@rpsilva-aws rpsilva-aws force-pushed the rpsilva-aws_neuron_multi_node_spmd branch from eaef0b8 to a5e3c23 Compare October 2, 2024 17:59
@rpsilva-aws rpsilva-aws changed the title Introduce multi-node SPMD initialization for Neuron Part 2. Introduce multi-node SPMD initialization for Neuron Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants