GPU Middle Class? #2161

EugenHotaj · 2024-12-16T17:22:47Z

Does torchtune have any plans to support "GPU middle class" users?

We're trying to evaluate using torchtune for post-training, especially since there are many useful features implemented (RLHF, LORA, etc). However, one big sticking point is that the system seems heavily geared towards single-node training. Are there plans to support multi-node training (e.g. 16-64 nodes) and things like model parallelism, 128k context training, etc?

If not, is torchtitan the recommended system to use?

Thanks!

joecummings · 2024-12-16T18:23:54Z

Hey @EugenHotaj - glad you're checking out torchtune. Up til now, we've managed to provide pretty extensive offerings including long-context, large models up to 405B, and RLHF all on single node. This has allowed people will smaller GPU budgets to fine-tune some pretty incredible models and develop new features faster b/c single node is much easier to debug.

Now, all that said, torchtune technically already supports multi-node for FSDP. And we plan on adding tensor parallel + model parallel very soon. The absolute latest we will have these features in torchtune is end of January, but I would bet on sooner!

Would you need anything beyond these parallelism techniques, e.g. pipeline parallel? Are you running on MAST or something like SLURM?

EugenHotaj · 2024-12-16T18:37:29Z

And we plan on adding tensor parallel + model parallel very soon. The absolute latest we will have these features in torchtune is end of January, but I would bet on sooner!

Thanks @joecummings that's awesome to hear!

Would you need anything beyond these parallelism techniques, e.g. pipeline parallel? Are you running on MAST or something like SLURM.

Yes we use SLURM -- I'm currently trying to hack a multi-node run from your suggestions on #2018 and torchtitan, so having some examples in torchtune would be super useful imo. We'd also take all the parallelisms we can get 😃, e.g. model, pipeline, and attention parallelism for longer context.

tginart · 2024-12-17T23:12:07Z

I second SLURM! I have also been trying to hack this into torchtune since the single-node experience is quite good.

ebsmothers · 2024-12-19T01:56:22Z

Thanks folks for the interest! Us torchtune devs are evidently not in the GPU middle class yet 😅 and I think only @joecummings has access to a multi-node setup as of today. I know he is working on testing this out, but until then @EugenHotaj we would love to include any SLURM scripts you're able to put together as part of our documentation.

EugenHotaj · 2024-12-19T02:46:02Z

@ebsmothers the torchtitan SLRUM file worked pretty much out of the box for us since we have a similar cluster setup (p5s on aws). I was able to run Llama 3.3 70B full finetuning on 16 nodes with no issues 😄 .

tginart · 2024-12-19T08:11:11Z

@EugenHotaj Thanks for the tip.

Did you use something like https://github.com/pytorch/torchtune/blob/main/recipes/full_finetune_distributed.py as the entry point to replace "./train.py" in line 63 ?

EugenHotaj · 2024-12-19T14:35:44Z

@tginart right you have to replace that torchrun line with something like:

srun torchrun --nnodes 4 --nproc_per_node 8 --rdzv_id $SLURM_JOB_ID --rdzv_backend c10d --rdzv_endpoint "$head_node_ip:29500" recipes/full_finetune_distributed.py --config recipes/configs/llama3_3/70B_full.yaml

joecummings added discussion Start a discussion distributed Anything related to distributed env (multi-GPU, multi-node) triaged This issue has been assigned an owner and appropriate label labels Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Middle Class? #2161

GPU Middle Class? #2161

EugenHotaj commented Dec 16, 2024 •

edited

Loading

joecummings commented Dec 16, 2024

EugenHotaj commented Dec 16, 2024 •

edited

Loading

tginart commented Dec 17, 2024

ebsmothers commented Dec 19, 2024

EugenHotaj commented Dec 19, 2024

tginart commented Dec 19, 2024

EugenHotaj commented Dec 19, 2024

GPU Middle Class? #2161

GPU Middle Class? #2161

Comments

EugenHotaj commented Dec 16, 2024 • edited Loading

joecummings commented Dec 16, 2024

EugenHotaj commented Dec 16, 2024 • edited Loading

tginart commented Dec 17, 2024

ebsmothers commented Dec 19, 2024

EugenHotaj commented Dec 19, 2024

tginart commented Dec 19, 2024

EugenHotaj commented Dec 19, 2024

EugenHotaj commented Dec 16, 2024 •

edited

Loading

EugenHotaj commented Dec 16, 2024 •

edited

Loading