You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
With mpi mode, all nodes report the same SM_CURRENT_HOST (which is the master's one).
To reproduce
Run an PyTorch estimator in mpi mode and more than one node. The training entrypoint can simply dump all its environment variables to stdout (which should end-up on Cloudwatch log). From there, we can see that SM_CURRENT_HOST from all nodes are set to the same value (i.e., the master's), whereas PMIX_HOSTNAME is set correctly.
Expected behavior
Master node should not propagate its SM_CURRENT_HOST to the other nodes.
Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.
System information
PyTorch DLC 1.11.0-gpu-py38
Additional context
Add any other context about the problem here.
This patch corrected the SM_CURRENT_HOST issue on my training jobs.
# https://github.com/aws/sagemaker-training-toolkit/blob/3188a9df7803798defb043a332d789f7474219d0/src/sagemaker_training/mpi.py#L353fornameinself._env_vars:
ifname.startswith("SM_"): # New additioncontinue# New additioncommand.extend(["-x", name])
The text was updated successfully, but these errors were encountered:
Describe the bug
With mpi mode, all nodes report the same
SM_CURRENT_HOST
(which is the master's one).To reproduce
Run an PyTorch estimator in mpi mode and more than one node. The training entrypoint can simply dump all its environment variables to stdout (which should end-up on Cloudwatch log). From there, we can see that
SM_CURRENT_HOST
from all nodes are set to the same value (i.e., the master's), whereasPMIX_HOSTNAME
is set correctly.Expected behavior
Master node should not propagate its
SM_CURRENT_HOST
to the other nodes.Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.
System information
PyTorch DLC 1.11.0-gpu-py38
Additional context
Add any other context about the problem here.
This patch corrected the
SM_CURRENT_HOST
issue on my training jobs.The text was updated successfully, but these errors were encountered: