You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wonder if you have changed the number of GPUs / number of data_parallel_degree/tensor_parallel_degree when resuming from the previous checkpoint. Currently resharding is not supported by the data loader.
Thanks for your amazing work !
We have been testing the llama3_8b model on slimpajama dataset. The training seem to be fine based on loss curves.
However, upon resuming the model from a previous checkpoint, we see the following warnings:
What can be the reason for DataLoader state being empty when loading the model ?
Also noting that checkpoints are loaded properly.
The text was updated successfully, but these errors were encountered: