Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start_index not getting reset in data loader when moving to new epoch #650

Open
leon-g-xu opened this issue Jul 10, 2024 · 4 comments
Open
Labels
type/bug An issue about a bug

Comments

@leon-g-xu
Copy link
Contributor

leon-g-xu commented Jul 10, 2024

🐛 Describe the bug

When a training job resumes from a checkpoint, it resumes from the epoch and start_index saved in the checkpoint.
The start_index is being set in the data loader.
However this start_index does not get reset to 0 when the current epoch finishes and next epoch starts. So new epoch will still read the data from the old start_index.

start_index loaded from checkpoint: https://github.com/allenai/OLMo/blob/main/olmo/train.py#L377
how start_index is used in data loader(and it didn't get reset) : https://github.com/allenai/OLMo/blob/main/olmo/data/iterable_dataset.py#L133-L135

Versions

olmo 0.3.0

@leon-g-xu leon-g-xu added the type/bug An issue about a bug label Jul 10, 2024
@leon-g-xu
Copy link
Contributor Author

One solution is to reset the start index to be 0 on the next epoch. I am not sure if there's any setting that I missed.

@AkshitaB
Copy link
Contributor

@epwalsh I believe you already fixed this. Can you confirm?

@leon-g-xu
Copy link
Contributor Author

If this is already fixed, can you share the commit/PR that fixes this?

@epwalsh
Copy link
Member

epwalsh commented Jul 29, 2024

Yeup, fixed here: a3e2ea7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug An issue about a bug
Projects
None yet
Development

No branches or pull requests

3 participants