Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] smdistributed not installed in pre-built docker images for PyTorch 2.1 with cuda 12.1 #3627

Closed
4 of 6 tasks
Ridhamz-nd opened this issue Jan 19, 2024 · 6 comments
Closed
4 of 6 tasks

Comments

@Ridhamz-nd
Copy link

Ridhamz-nd commented Jan 19, 2024

Checklist

Concise Description:
Sagemaker docker images with CUDA > 12.X don't contain the smdistributed library. The specific docker file doesn't contain the smdistributed installation code

DLC image/dockerfile:
Image 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker

Current behavior:

(base) ubuntu@ip-172-31-5-218:~$ docker run --rm -ti 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker bash
root@062cb6750a2b:/# python
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import smdistributed
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'smdistributed'
>>>

Expected behavior:
Import should succeed for cu12.1. Below is an old image with smdistributed.

(base) ubuntu@ip-172-31-5-218:~$ docker run --rm -ti 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker bash
proot@bb2738e7bc44:/# python
Python 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:26:04) [GCC 10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import smdistributed
>>>

Additional context:

@Ridhamz-nd Ridhamz-nd changed the title [bug] [bug] smdistributed import fails for PyTorch 2.1 with cuda 12.1 Jan 19, 2024
@Ridhamz-nd Ridhamz-nd changed the title [bug] smdistributed import fails for PyTorch 2.1 with cuda 12.1 [bug] smdistributed not installed for PyTorch 2.1 with cuda 12.1 Jan 19, 2024
@Ridhamz-nd Ridhamz-nd changed the title [bug] smdistributed not installed for PyTorch 2.1 with cuda 12.1 [bug] smdistributed not installed in pre-built docker images for PyTorch 2.1 with cuda 12.1 Jan 19, 2024
@Ridhamz-nd
Copy link
Author

Ridhamz-nd commented Jan 19, 2024

@sirutBuasai I see that you updated the available_images.md (in #3490) and added the initial dockerfile (in #3389). Could you please verify that this is how it should be or inform if smdistributed is no longer an option after 2.1 ?
cc @junpuf

@sirutBuasai
Copy link
Contributor

sirutBuasai commented Jan 19, 2024

SMDistributed binaries will be added to the PyTorch 2.1 SM DLC once they are ready. But yes, PT 2.1 SM DLC currently do not support SM Distributed options.

We have decoupled DLC release from SMDistributed so you sometimes may see them unavailable.

@Ridhamz-nd
Copy link
Author

Thank you for notifying. Currently, I am using the PT 2.0.1 image and installing PT 2.1 on top of it and it seems to be working with pytorchddp launch setting which uses SMDDP AllReduce.

@tejaschumbalkar
Copy link
Contributor

Closing the issue. Feel free to reopen if you have any additional questions.

@rohit901
Copy link

rohit901 commented Mar 5, 2024

so the library has not yet been added to the docker image right?
I'm using the HuggingFace container, and I'm also getting module not found error:
ModuleNotFoundError: No module named ‘smdistributed’

Which versions of the container supports smdistributed?

Entire logs:
ErrorMessage "ModuleNotFoundError: No module named 'smdistributed'
 /opt/conda/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
 torch.utils._pytree._register_pytree_node(
 Traceback (most recent call last)
 File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
 return _run_code(code, main_globals, None,
 File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
 exec(code, run_globals)
 File "/opt/conda/lib/python3.10/site-packages/mpi4py/__main__.py", line 7, in <module>
 main()
 File "/opt/conda/lib/python3.10/site-packages/mpi4py/run.py", line 230, in main
 run_command_line(args)
 File "/opt/conda/lib/python3.10/site-packages/mpi4py/run.py", line 47, in run_command_line
 run_path(sys.argv[0], run_name='__main__')
 File "/opt/conda/lib/python3.10/runpy.py", line 289, in run_path
 return _run_module_code(code, init_globals, run_name,
 File "/opt/conda/lib/python3.10/runpy.py", line 96, in _run_module_code
 _run_code(code, mod_globals, init_globals,
 File "train_vlcm_distill_lcm_wds.py", line 1416, in <module>
 main(args)
 File "train_vlcm_distill_lcm_wds.py", line 780, in main
 accelerator = Accelerator(
 File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 371, in __init__
 self.state = AcceleratorState(
 File "/opt/conda/lib/python3.10/site-packages/accelerate/state.py", line 758, in __init__
 PartialState(cpu, **kwargs)
 File "/opt/conda/lib/python3.10/site-packages/accelerate/state.py", line 145, in __init__
 import smdistributed.dataparallel.torch.torch_smddp  # noqa
 --------------------------------------------------------------------------
 Primary job  terminated normally, but 1 process returned
 a non-zero exit code. Per user-direction, the job has been aborted.
 mpirun.real detected that one or more processes exited with non-zero status, thus causing
 the job to be terminated. The first process to do so was
 
 Process name: [[41100,1],1]
 Exit code:    1"

I'm using the following arguments/parameters for the HuggingFace container:
{
'image_uri': None,
'entry_point': 'train_vlcm_distill_lcm_wds.py',
'source_dir': 'scripts',
'role': 'xxxx',
'transformers_version': '4.36.0',
'pytorch_version': '2.1.0',
'py_version': 'py310',
'base_job_name': 'accelerate-sagemaker-1',
'instance_count': 1,
'instance_type': 'ml.p4d.24xlarge',
'debugger_hook_config': False,
'distribution': {
'smdistributed': {
'dataparallel': {
'enabled': True
}
}
},
'environment': {
'ACCELERATE_USE_SAGEMAKER': 'true',
'ACCELERATE_MIXED_PRECISION': 'fp16',
'ACCELERATE_DYNAMO_BACKEND': 'NO',
'ACCELERATE_DYNAMO_MODE': 'default',
'ACCELERATE_DYNAMO_USE_FULLGRAPH': 'False',
'ACCELERATE_DYNAMO_USE_DYNAMIC': 'False',
'ACCELERATE_SAGEMAKER_DISTRIBUTED_TYPE': 'DATA_PARALLEL'
},
'metric_definitions': None
}

@rohit901
Copy link

rohit901 commented Mar 5, 2024

I tried using different version of the hugging face container, and facing more issues.
Any help is greatly appreciated.

I've described my problem in #3746

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants