-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] smdistributed not installed in pre-built docker images for PyTorch 2.1 with cuda 12.1 #3627
Comments
@sirutBuasai I see that you updated the |
SMDistributed binaries will be added to the PyTorch 2.1 SM DLC once they are ready. But yes, PT 2.1 SM DLC currently do not support SM Distributed options. We have decoupled DLC release from SMDistributed so you sometimes may see them unavailable. |
Thank you for notifying. Currently, I am using the PT 2.0.1 image and installing PT 2.1 on top of it and it seems to be working with |
Closing the issue. Feel free to reopen if you have any additional questions. |
so the library has not yet been added to the docker image right? Which versions of the container supports smdistributed?
I'm using the following arguments/parameters for the HuggingFace container: |
I tried using different version of the hugging face container, and facing more issues. I've described my problem in #3746 |
Checklist
Concise Description:
Sagemaker docker images with CUDA > 12.X don't contain the
smdistributed
library. The specific docker file doesn't contain the smdistributed installation codeDLC image/dockerfile:
Image
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker
Current behavior:
Expected behavior:
Import should succeed for cu12.1. Below is an old image with smdistributed.
Additional context:
The text was updated successfully, but these errors were encountered: