Extend documentation regarding distributed training for own Docker containers. #218

marseller · 2024-08-26T11:04:56Z

What did you find confusing? Please describe.
I was searching for documentation regarding distributed training with own docker containers. The current documentation explains how to create containers or extend them to be able to use distributed training with the required modules installation guide , but its does not provide information on how to configure the Estimator class or any other launch parameters to start the distributed training as it does for PyTorch or Tensorflow classes.

Describe how documentation can be improved
Add text that describe how to launch the distributed training after creating or extending the docker image.
Do it at these sections:
https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api (here is a typo in the link that you should also fix, skd instead of sdk)
https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-bring-your-own-container

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend documentation regarding distributed training for own Docker containers. #218

Extend documentation regarding distributed training for own Docker containers. #218

marseller commented Aug 26, 2024 •

edited

Loading

Extend documentation regarding distributed training for own Docker containers. #218

Extend documentation regarding distributed training for own Docker containers. #218

Comments

marseller commented Aug 26, 2024 • edited Loading

marseller commented Aug 26, 2024 •

edited

Loading