[bug] smdistributed not installed in pre-built docker images for PyTorch 2.1 with cuda 12.1 #3627

Ridhamz-nd · 2024-01-19T02:15:04Z

Checklist

I've prepended issue tag with type of change: [bug]
(If applicable) I've attached the script to reproduce the bug
(If applicable) I've documented below the DLC image/dockerfile this relates to
(If applicable) I've documented below the tests I've run on the DLC image
I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
I've built my own container based off DLC (and I've attached the code used to build my own image)

Concise Description:
Sagemaker docker images with CUDA > 12.X don't contain the smdistributed library. The specific docker file doesn't contain the smdistributed installation code

DLC image/dockerfile:
Image 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker

Current behavior:

(base) ubuntu@ip-172-31-5-218:~$ docker run --rm -ti 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker bash
root@062cb6750a2b:/# python
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import smdistributed
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'smdistributed'
>>>

Expected behavior:
Import should succeed for cu12.1. Below is an old image with smdistributed.

(base) ubuntu@ip-172-31-5-218:~$ docker run --rm -ti 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker bash
proot@bb2738e7bc44:/# python
Python 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:26:04) [GCC 10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import smdistributed
>>>

Additional context:

The text was updated successfully, but these errors were encountered:

Ridhamz-nd · 2024-01-19T03:38:24Z

@sirutBuasai I see that you updated the available_images.md (in #3490) and added the initial dockerfile (in #3389). Could you please verify that this is how it should be or inform if smdistributed is no longer an option after 2.1 ?
cc @junpuf

sirutBuasai · 2024-01-19T16:12:04Z

SMDistributed binaries will be added to the PyTorch 2.1 SM DLC once they are ready. But yes, PT 2.1 SM DLC currently do not support SM Distributed options.

We have decoupled DLC release from SMDistributed so you sometimes may see them unavailable.

Ridhamz-nd · 2024-01-21T00:01:40Z

Thank you for notifying. Currently, I am using the PT 2.0.1 image and installing PT 2.1 on top of it and it seems to be working with pytorchddp launch setting which uses SMDDP AllReduce.

tejaschumbalkar · 2024-02-21T19:32:55Z

Closing the issue. Feel free to reopen if you have any additional questions.

rohit901 · 2024-03-05T07:06:02Z

so the library has not yet been added to the docker image right?
I'm using the HuggingFace container, and I'm also getting module not found error:
ModuleNotFoundError: No module named ‘smdistributed’

Which versions of the container supports smdistributed?

Entire logs:
ErrorMessage "ModuleNotFoundError: No module named 'smdistributed'
 /opt/conda/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
 torch.utils._pytree._register_pytree_node(
 Traceback (most recent call last)
 File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
 return _run_code(code, main_globals, None,
 File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
 exec(code, run_globals)
 File "/opt/conda/lib/python3.10/site-packages/mpi4py/__main__.py", line 7, in <module>
 main()
 File "/opt/conda/lib/python3.10/site-packages/mpi4py/run.py", line 230, in main
 run_command_line(args)
 File "/opt/conda/lib/python3.10/site-packages/mpi4py/run.py", line 47, in run_command_line
 run_path(sys.argv[0], run_name='__main__')
 File "/opt/conda/lib/python3.10/runpy.py", line 289, in run_path
 return _run_module_code(code, init_globals, run_name,
 File "/opt/conda/lib/python3.10/runpy.py", line 96, in _run_module_code
 _run_code(code, mod_globals, init_globals,
 File "train_vlcm_distill_lcm_wds.py", line 1416, in <module>
 main(args)
 File "train_vlcm_distill_lcm_wds.py", line 780, in main
 accelerator = Accelerator(
 File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 371, in __init__
 self.state = AcceleratorState(
 File "/opt/conda/lib/python3.10/site-packages/accelerate/state.py", line 758, in __init__
 PartialState(cpu, **kwargs)
 File "/opt/conda/lib/python3.10/site-packages/accelerate/state.py", line 145, in __init__
 import smdistributed.dataparallel.torch.torch_smddp  # noqa
 --------------------------------------------------------------------------
 Primary job  terminated normally, but 1 process returned
 a non-zero exit code. Per user-direction, the job has been aborted.
 mpirun.real detected that one or more processes exited with non-zero status, thus causing
 the job to be terminated. The first process to do so was
 
 Process name: [[41100,1],1]
 Exit code:    1"

I'm using the following arguments/parameters for the HuggingFace container:
{
'image_uri': None,
'entry_point': 'train_vlcm_distill_lcm_wds.py',
'source_dir': 'scripts',
'role': 'xxxx',
'transformers_version': '4.36.0',
'pytorch_version': '2.1.0',
'py_version': 'py310',
'base_job_name': 'accelerate-sagemaker-1',
'instance_count': 1,
'instance_type': 'ml.p4d.24xlarge',
'debugger_hook_config': False,
'distribution': {
'smdistributed': {
'dataparallel': {
'enabled': True
}
}
},
'environment': {
'ACCELERATE_USE_SAGEMAKER': 'true',
'ACCELERATE_MIXED_PRECISION': 'fp16',
'ACCELERATE_DYNAMO_BACKEND': 'NO',
'ACCELERATE_DYNAMO_MODE': 'default',
'ACCELERATE_DYNAMO_USE_FULLGRAPH': 'False',
'ACCELERATE_DYNAMO_USE_DYNAMIC': 'False',
'ACCELERATE_SAGEMAKER_DISTRIBUTED_TYPE': 'DATA_PARALLEL'
},
'metric_definitions': None
}

rohit901 · 2024-03-05T07:58:28Z

I tried using different version of the hugging face container, and facing more issues.
Any help is greatly appreciated.

I've described my problem in #3746

Ridhamz-nd changed the title ~~[bug]~~ [bug] smdistributed import fails for PyTorch 2.1 with cuda 12.1 Jan 19, 2024

Ridhamz-nd changed the title ~~[bug] smdistributed import fails for PyTorch 2.1 with cuda 12.1~~ [bug] smdistributed not installed for PyTorch 2.1 with cuda 12.1 Jan 19, 2024

Ridhamz-nd changed the title ~~[bug] smdistributed not installed for PyTorch 2.1 with cuda 12.1~~ [bug] smdistributed not installed in pre-built docker images for PyTorch 2.1 with cuda 12.1 Jan 19, 2024

tejaschumbalkar closed this as completed Feb 21, 2024

rohit901 mentioned this issue Mar 5, 2024

[bug] Couldn't initialize SMDDP on HuggingFace Training Containers #3746

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] smdistributed not installed in pre-built docker images for PyTorch 2.1 with cuda 12.1 #3627

[bug] smdistributed not installed in pre-built docker images for PyTorch 2.1 with cuda 12.1 #3627

Ridhamz-nd commented Jan 19, 2024 •

edited

Loading

Ridhamz-nd commented Jan 19, 2024 •

edited

Loading

sirutBuasai commented Jan 19, 2024 •

edited

Loading

Ridhamz-nd commented Jan 21, 2024

tejaschumbalkar commented Feb 21, 2024

rohit901 commented Mar 5, 2024 •

edited

Loading

rohit901 commented Mar 5, 2024

[bug] smdistributed not installed in pre-built docker images for PyTorch 2.1 with cuda 12.1 #3627

[bug] smdistributed not installed in pre-built docker images for PyTorch 2.1 with cuda 12.1 #3627

Comments

Ridhamz-nd commented Jan 19, 2024 • edited Loading

Ridhamz-nd commented Jan 19, 2024 • edited Loading

sirutBuasai commented Jan 19, 2024 • edited Loading

Ridhamz-nd commented Jan 21, 2024

tejaschumbalkar commented Feb 21, 2024

rohit901 commented Mar 5, 2024 • edited Loading

rohit901 commented Mar 5, 2024

Ridhamz-nd commented Jan 19, 2024 •

edited

Loading

Ridhamz-nd commented Jan 19, 2024 •

edited

Loading

sirutBuasai commented Jan 19, 2024 •

edited

Loading

rohit901 commented Mar 5, 2024 •

edited

Loading