First Shard Group Save and Load Checkpoint for HSDP #709

qsh-zh · 2024-11-29T22:20:42Z

Based on my understanding, current strategy:
1. All ranks currently read and load the checkpoint.
2. All ranks also save and write the checkpoint.

I have a question regarding the HSDP case:
If different shard groups write data to storage, could this lead to data corruption?
Ideally, should only the first shard group read the data, broadcast it, and handle writing to ensure consistency?

qsh-zh · 2024-12-02T16:38:56Z

maye @tianyu-l @fegin know ?

fegin · 2024-12-02T18:21:43Z

If different shard groups write data to storage, could this lead to data corruption?

Why would this lead to data corruption?

Ideally, should only the first shard group read the data, broadcast it, and handle writing to ensure consistency?

DCP will only save one copy of the data if the data is replicated across ranks. It is not necessary the first rank/first group will save the replicated data. DCP will decide this during the planning phase.

qsh-zh · 2024-12-02T23:07:08Z

@fegin
Thank you for your explanation.

Why would this lead to data corruption?

When multiple processes write to the same file, isn’t it common to encounter data corruption without proper file locks or scheduling mechanisms?

DCP will only save one copy of the data if the data is replicated across ranks.

Interesting—thank you for clarifying. If there’s a planner coordinating the writes, the file system corruption issue should not occur.

In the meantime, I’ve been exploring the DCP implementation and APIs. However, there is no detailed documentation explaining the coordinator or planner components.

I’d like to share what I’ve found so far. Please correct me if I’m mistaken, and hopefully, this will help others as well:
• dcp.save has an argument called process_group.
• The _DistWrapper class accepts the process_group.
• In this code snippet, central_plan: SavePlan = distW.reduce_scatter("plan", local_step, global_step) seems to coordinate the saving process.
• If we pass process_group=None, this code handles deduplication for world PG

Based on this logic, it seems that setting process_group=None might be the best approach. Could you confirm if this should always be the case? When do we need pass non None arg for process_group?

Additionally, I have another question:
Does the logic of dcp.load work similarly to dcp.save, or do all ranks operate independently without synchronization? For replicated groups, do they read the same data? It seems there are no deduplication and broadcast states.

fegin · 2024-12-03T01:45:43Z

When multiple processes write to the same file, isn’t it common to encounter data corruption without proper file locks or scheduling mechanisms?

Yes, but even if DCP's planner decides to save multiple copies, it still won't cause data corruption because different ranks write to different files.

What your understanding is mostly correct. As for world PG, there may be case where users would like to save among only a subset of ranks, this is not common but some advanced users may have their own infra architectures design that is not common as well.

As for the loading, there is again a planning phase which will coordinate all the ranks to load the data correct without loading redundant data. And DCP assumes a distributed file system such that each rank can access the required files. If such a file system does not exist, users will need to ensure the required files can be accessed.

tianyu-l added the question Further information is requested label Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First Shard Group Save and Load Checkpoint for HSDP #709

First Shard Group Save and Load Checkpoint for HSDP #709

qsh-zh commented Nov 29, 2024

qsh-zh commented Dec 2, 2024

fegin commented Dec 2, 2024 •

edited by tianyu-l

Loading

qsh-zh commented Dec 2, 2024 •

edited

Loading

fegin commented Dec 3, 2024

First Shard Group Save and Load Checkpoint for HSDP #709

First Shard Group Save and Load Checkpoint for HSDP #709

Comments

qsh-zh commented Nov 29, 2024

qsh-zh commented Dec 2, 2024

fegin commented Dec 2, 2024 • edited by tianyu-l Loading

qsh-zh commented Dec 2, 2024 • edited Loading

fegin commented Dec 3, 2024

fegin commented Dec 2, 2024 •

edited by tianyu-l

Loading

qsh-zh commented Dec 2, 2024 •

edited

Loading