Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend documentation regarding distributed training for own Docker containers. #218

Open
marseller opened this issue Aug 26, 2024 · 1 comment

Comments

@marseller
Copy link

marseller commented Aug 26, 2024

What did you find confusing? Please describe.
I was searching for documentation regarding distributed training with own docker containers. The current documentation explains how to create containers or extend them to be able to use distributed training with the required modules installation guide , but its does not provide information on how to configure the Estimator class or any other launch parameters to start the distributed training as it does for PyTorch or Tensorflow classes.

Describe how documentation can be improved
Add text that describe how to launch the distributed training after creating or extending the docker image.
Do it at these sections:
https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api (here is a typo in the link that you should also fix, skd instead of sdk)
https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-bring-your-own-container

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@marseller and others