You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is there any documentation on how these different parallelisms compose?
What are the largest training runs these strategies have been tested on?
Are there benchmarks for how these strategies compare against other distributed training frameworks that expose similar parallelisms?
Particularly interested in how PP + FSDP work together as it seems DeepSpeed explicitly disallows ZeRO 2/3 + PP (see here specifically, and here for discussion).
specifically for Zero3+PP, we haven't published guides or anything, but we are working on it. You can compose them, just have to be aware of scheduling the unshard/reshard well with respect to peak memory. We don't have an out of the box support for this yet but are planning to offer that.
cc @H-Huang@donglimm about Zero3+PP and other PP schedule questions.
For benchmarks, we do not have large benchmarks due to resource constraints but we are preparing 64-gpu benchmarks for torchtitan. We are happy to collaborate on larger benchmarks if you have resources to run them and want help with any optimization opportunities.
On
PP + FSDP
andPP + TP + FSDP
:Particularly interested in how
PP + FSDP
work together as it seems DeepSpeed explicitly disallowsZeRO 2/3 + PP
(see here specifically, and here for discussion).@wconstab @weifengpy @wanchaol
The text was updated successfully, but these errors were encountered: