How to do multi-machine SPMD/FSDPv2 training with TPU？ #8492

radna0 · 2024-12-13T18:47:39Z

❓ Questions and Help

I saw #6362 but there's no example training script found? For example, if I have multiple TPU v3-8 VMs, how would I achieve this with SPMD/FSDPv2?

I'm currently sending the commands to all TPU VMs this way:

python3.10 podrun --include-local -- hostname

The text was updated successfully, but these errors were encountered:

radna0 · 2024-12-14T09:22:45Z

radna0 · 2024-12-20T09:45:45Z

Anybody can help? I'm still stuck on this

radna0 changed the title ~~How to do multi-machine spmd training with TPU？~~ How to do multi-machine SPMD/FSDPv2 training with TPU？ Dec 13, 2024