-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
70B Fine-tuning GPUs Utilization #2142
Comments
I am using the configuration above to fine tuning 70B model on 2 nodes with 8 gpus each. the job took 75minutes to compile (is that usual?) I also noticed that one of the 16 gpus wan not used at all, i hope the video helps i also attached the nccl 70b_nccl.txt the job was killed because, any suggestions
|
Thanks for the report! Based on your config and the setup you have, I don't see immediately why this would hit your specified memory limit of 768G. Let me get ahold of a multi-node setup today and test this out. |
Hey @fabiogeraci, just updating you on this. I'm waiting on a request for multi-node server (PyTorch has limited quantity). If I don't hear back today, I'll just rent one out on Lambda Labs or something. |
thanks you |
full_finetune_distributed.py
_distributed.py
Originally posted by @fabiogeraci in #2018 (comment)
The text was updated successfully, but these errors were encountered: