Pipeline Parallelism + FSDP #562

jeromeku · 2024-08-29T14:19:58Z

On PP + FSDP and PP + TP + FSDP:

Is there any documentation on how these different parallelisms compose?
What are the largest training runs these strategies have been tested on?
Are there benchmarks for how these strategies compare against other distributed training frameworks that expose similar parallelisms?

Particularly interested in how PP + FSDP work together as it seems DeepSpeed explicitly disallows ZeRO 2/3 + PP (see here specifically, and here for discussion).

@wconstab @weifengpy @wanchaol

The text was updated successfully, but these errors were encountered:

wconstab · 2024-08-29T16:35:47Z

specifically for Zero3+PP, we haven't published guides or anything, but we are working on it. You can compose them, just have to be aware of scheduling the unshard/reshard well with respect to peak memory. We don't have an out of the box support for this yet but are planning to offer that.
cc @H-Huang @donglimm about Zero3+PP and other PP schedule questions.

We do have a guide showing how to compose FSDP+PP without zero-3 in torchtitan-
https://github.com/pytorch/torchtitan/blob/main/torchtitan/parallelisms/parallelize_llama.py

For benchmarks, we do not have large benchmarks due to resource constraints but we are preparing 64-gpu benchmarks for torchtitan. We are happy to collaborate on larger benchmarks if you have resources to run them and want help with any optimization opportunities.

tianyu-l added the question Further information is requested label Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline Parallelism + FSDP #562

Pipeline Parallelism + FSDP #562

jeromeku commented Aug 29, 2024

wconstab commented Aug 29, 2024

Pipeline Parallelism + FSDP #562

Pipeline Parallelism + FSDP #562

Comments

jeromeku commented Aug 29, 2024

wconstab commented Aug 29, 2024