Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[torchtitan][debug] integrated CommDebugMode into TorchTitan #480

Open
wants to merge 7 commits into
base: gh/sinhaanhsul/1/base
Choose a base branch
from

Conversation

sinhaanshul
Copy link

@sinhaanshul sinhaanshul commented Jul 24, 2024

Stack from ghstack (oldest at bottom):

Summary
I have enabled TorchTitan developers to have the option to use CommDebugMode to help debug when using DTensors. Users can use it by setting the command line argument to use CommDebugMode and have the option to use arguments to set the console file dump name, json file name, and noise level they want to use. Currently, the debugger fails when using compiled_rmsnorm. The temporary fix is to increase torch._dynamo.config.cache_size_limit before using commdebugmode.

Test Plan
CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 LOG_RANK=0,1,2,3 ./run_llama_train.sh --comm_debug.enable_comm_debug_mode --model.norm_type="rmsnorm"

[ghstack-poisoned]
sinhaanshul added a commit that referenced this pull request Jul 24, 2024
ghstack-source-id: 7e9de7b83a376eb320a403c416b891a0c5b5321e
Pull Request resolved: #480
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 24, 2024
sinhaanshul added a commit that referenced this pull request Jul 24, 2024
ghstack-source-id: 3b531851e5fb12259ab0e28979ea5b94afe936f8
Pull Request resolved: #480
sinhaanshul added a commit that referenced this pull request Jul 24, 2024
ghstack-source-id: 60227a2c491c10eb6cf03f156754be4341481957
Pull Request resolved: #480
sinhaanshul added a commit that referenced this pull request Jul 24, 2024
ghstack-source-id: ca3a9f5983a86dad0b867a8ec92e0e878e7784d5
Pull Request resolved: #480
sinhaanshul added a commit that referenced this pull request Jul 24, 2024
ghstack-source-id: 8985c58902dc4e8b00e7975921df065d04c01911
Pull Request resolved: #480
[ghstack-poisoned]
sinhaanshul added a commit that referenced this pull request Jul 24, 2024
ghstack-source-id: 9d0eae8fe6c7f19ea75f6e5ac8929802f1ae1157
Pull Request resolved: #480
train.py Outdated Show resolved Hide resolved
train.py Outdated Show resolved Hide resolved
torchtitan/config_manager.py Outdated Show resolved Hide resolved
torchtitan/config_manager.py Outdated Show resolved Hide resolved
[ghstack-poisoned]
sinhaanshul added a commit that referenced this pull request Jul 24, 2024
ghstack-source-id: fbbc6f0257396b21eea0e40939c832a7afa3490f
Pull Request resolved: #480
train.py Show resolved Hide resolved
@facebook-github-bot
Copy link

Hi @sinhaanshul!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

tianyu-l pushed a commit that referenced this pull request Aug 16, 2024
ghstack-source-id: fbbc6f0257396b21eea0e40939c832a7afa3490f
Pull Request resolved: #480
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants