Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Middle Class? #2161

Open
EugenHotaj opened this issue Dec 16, 2024 · 7 comments
Open

GPU Middle Class? #2161

EugenHotaj opened this issue Dec 16, 2024 · 7 comments
Labels
discussion Start a discussion distributed Anything related to distributed env (multi-GPU, multi-node) triaged This issue has been assigned an owner and appropriate label

Comments

@EugenHotaj
Copy link
Contributor

EugenHotaj commented Dec 16, 2024

Does torchtune have any plans to support "GPU middle class" users?

We're trying to evaluate using torchtune for post-training, especially since there are many useful features implemented (RLHF, LORA, etc). However, one big sticking point is that the system seems heavily geared towards single-node training. Are there plans to support multi-node training (e.g. 16-64 nodes) and things like model parallelism, 128k context training, etc?

If not, is torchtitan the recommended system to use?

Thanks!

@joecummings joecummings added discussion Start a discussion distributed Anything related to distributed env (multi-GPU, multi-node) triaged This issue has been assigned an owner and appropriate label labels Dec 16, 2024
@joecummings
Copy link
Contributor

Hey @EugenHotaj - glad you're checking out torchtune. Up til now, we've managed to provide pretty extensive offerings including long-context, large models up to 405B, and RLHF all on single node. This has allowed people will smaller GPU budgets to fine-tune some pretty incredible models and develop new features faster b/c single node is much easier to debug.

Now, all that said, torchtune technically already supports multi-node for FSDP. And we plan on adding tensor parallel + model parallel very soon. The absolute latest we will have these features in torchtune is end of January, but I would bet on sooner!

Would you need anything beyond these parallelism techniques, e.g. pipeline parallel? Are you running on MAST or something like SLURM?

@EugenHotaj
Copy link
Contributor Author

EugenHotaj commented Dec 16, 2024

And we plan on adding tensor parallel + model parallel very soon. The absolute latest we will have these features in torchtune is end of January, but I would bet on sooner!

Thanks @joecummings that's awesome to hear!

Would you need anything beyond these parallelism techniques, e.g. pipeline parallel? Are you running on MAST or something like SLURM.

Yes we use SLURM -- I'm currently trying to hack a multi-node run from your suggestions on #2018 and torchtitan, so having some examples in torchtune would be super useful imo. We'd also take all the parallelisms we can get 😃, e.g. model, pipeline, and attention parallelism for longer context.

@tginart
Copy link

tginart commented Dec 17, 2024

I second SLURM! I have also been trying to hack this into torchtune since the single-node experience is quite good.

@ebsmothers
Copy link
Contributor

Thanks folks for the interest! Us torchtune devs are evidently not in the GPU middle class yet 😅 and I think only @joecummings has access to a multi-node setup as of today. I know he is working on testing this out, but until then @EugenHotaj we would love to include any SLURM scripts you're able to put together as part of our documentation.

@EugenHotaj
Copy link
Contributor Author

@ebsmothers the torchtitan SLRUM file worked pretty much out of the box for us since we have a similar cluster setup (p5s on aws). I was able to run Llama 3.3 70B full finetuning on 16 nodes with no issues 😄 .

@tginart
Copy link

tginart commented Dec 19, 2024

@EugenHotaj Thanks for the tip.

Did you use something like https://github.com/pytorch/torchtune/blob/main/recipes/full_finetune_distributed.py as the entry point to replace "./train.py" in line 63 ?

@EugenHotaj
Copy link
Contributor Author

@tginart right you have to replace that torchrun line with something like:

srun torchrun --nnodes 4 --nproc_per_node 8 --rdzv_id $SLURM_JOB_ID --rdzv_backend c10d --rdzv_endpoint "$head_node_ip:29500" recipes/full_finetune_distributed.py --config recipes/configs/llama3_3/70B_full.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Start a discussion distributed Anything related to distributed env (multi-GPU, multi-node) triaged This issue has been assigned an owner and appropriate label
Projects
None yet
Development

No branches or pull requests

4 participants