Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure times #439

Closed
wants to merge 1 commit into from
Closed

Conversation

miqueljubert
Copy link

Summary:
Add two counters, containing the iteration and time blocked on data for the last iteration. These are stored on the state, not sure if the unit would be a better place.

They can be accessed at the start of the next iteration. This is done because we want to capture the time of the whole iteration, including callbacks, so it cannot be accessed by callbacks.

Another possibility would be to keep a list of all the times and data times, not sure if that would be preferrable.

Differential Revision: D46853794

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46853794

miqueljubert pushed a commit to miqueljubert/tnt that referenced this pull request Jul 17, 2023
Summary:
Pull Request resolved: pytorch#439

Add two counters, containing the iteration and time blocked on data for the last iteration. These are stored on the state, not sure if the unit would be a better place.

They can be accessed at the start of the next iteration. This is done because we want to capture the time of the whole iteration, including callbacks, so it cannot be accessed by callbacks.

Another possibility would be to keep a list of all the times and data times, not sure if that would be preferrable.

Reviewed By: daniellepintz

Differential Revision: D46853794

fbshipit-source-id: 439a9ae17bf867f5722cd86f21cd42599966bd9c
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46853794

miqueljubert pushed a commit to miqueljubert/tnt that referenced this pull request Aug 7, 2023
Summary:
Pull Request resolved: pytorch#439

Add two counters, containing the iteration and time blocked on data for the last iteration. These are stored on the state, not sure if the unit would be a better place.

Had to rework the approach slightly since the state does not seem the right place. With the changes, it is more natural to make it similar to progress, and store the info in a Stateful. This has the additional advantage that all values should be storable and be restored with the checkpoints.

Also rather than the last value, all of them are stored.

Reviewed By: daniellepintz

Differential Revision: D46853794

fbshipit-source-id: 888c9559c48961eccb1f9a6ecbd976cecdb90100
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46853794

miqueljubert pushed a commit to miqueljubert/tnt that referenced this pull request Aug 7, 2023
Summary:
Pull Request resolved: pytorch#439

Add two counters, containing the iteration and time blocked on data for the last iteration. These are stored on the state, not sure if the unit would be a better place.

Had to rework the approach slightly since the state does not seem the right place. With the changes, it is more natural to make it similar to progress, and store the info in a Stateful. This has the additional advantage that all values should be storable and be restored with the checkpoints.

Also rather than the last value, all of them are stored.

Differential Revision: D46853794

fbshipit-source-id: 87d0cd58963c6cfd15c2e497801ed993668b6d29
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46853794

miqueljubert pushed a commit to miqueljubert/tnt that referenced this pull request Aug 8, 2023
Summary:
Pull Request resolved: pytorch#439

Add two counters, containing the iteration and time blocked on data for the last iteration. These are stored and recorded using a timer that does not synchronise with CUDA, and that only records these two values.

- iteration time is recorded in the training loop for all jobs.
- data time is recorded if the training loop does the data fetching, otherwise the user needs to instrument the logic which reads the data from the iterable.

Had to rework the approach slightly since the state does not seem the right place. With the changes, it is more natural to make it similar to progress, and store the info in a Stateful. This has the additional advantage that all values should be storable and be restored with the checkpoints.

Also rather than the last value, all of them are stored.

Differential Revision: D46853794

fbshipit-source-id: 64a4f55f8fb0985831c0cd94e93eb7be67a30faa
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46853794

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46853794

miqueljubert pushed a commit to miqueljubert/tnt that referenced this pull request Aug 18, 2023
Summary:
Pull Request resolved: pytorch#439

Add two counters, containing the iteration and time blocked on data for the last iteration. These are stored and recorded using a timer that does not synchronise with CUDA, and that only records these two values.

- iteration time is recorded in the training loop for all jobs.
- data time is recorded if the training loop does the data fetching, otherwise the user needs to instrument the logic which reads the data from the iterable.

Had to rework the approach slightly since the state does not seem the right place. With the changes, it is more natural to make it similar to progress, and store the info in a Stateful. This has the additional advantage that all values should be storable and be restored with the checkpoints.

Also rather than the last value, the last LOWER_BOUND values are stored. I hardcoded this in the torchtnt state, to not bloat parameters, since intuitively the last 1e4 values for each timer action here should be enough for monitoring purposes.

Reviewed By: daniellepintz, ananthsub

Differential Revision: D46853794

fbshipit-source-id: 6d5c5d4b2ab3372c606f60a4c12f2a5e0092e867
miqueljubert pushed a commit to miqueljubert/tnt that referenced this pull request Aug 21, 2023
Summary:
Pull Request resolved: pytorch#439

Add two counters, containing the iteration and time blocked on data for the last iteration. These are stored and recorded using a timer that does not synchronise with CUDA, and that only records these two values.

- iteration time is recorded in the training loop for all jobs.
- data time is recorded if the training loop does the data fetching, otherwise the user needs to instrument the logic which reads the data from the iterable.

Had to rework the approach slightly since the state does not seem the right place. With the changes, it is more natural to make it similar to progress, and store the info in a Stateful. This has the additional advantage that all values should be storable and be restored with the checkpoints.

Also rather than the last value, the last LOWER_BOUND values are stored. I hardcoded this in the torchtnt state, to not bloat parameters, since intuitively the last 1e4 values for each timer action here should be enough for monitoring purposes.

Reviewed By: daniellepintz, ananthsub

Differential Revision: D46853794

fbshipit-source-id: 71178021e60a78c993474ae3bdfeb208aa1e4a88
@codecov
Copy link

codecov bot commented Aug 21, 2023

Codecov Report

Merging #439 (c836b76) into master (a690136) will increase coverage by 0.07%.
The diff coverage is 100.00%.

❗ Current head c836b76 differs from pull request most recent head 1fab468. Consider uploading reports for the commit 1fab468 to get more accurate results

@@            Coverage Diff             @@
##           master     #439      +/-   ##
==========================================
+ Coverage   86.95%   87.02%   +0.07%     
==========================================
  Files         106      106              
  Lines        8407     8455      +48     
==========================================
+ Hits         7310     7358      +48     
  Misses       1097     1097              
Files Changed Coverage Δ
tests/framework/test_train.py 100.00% <100.00%> (ø)
tests/utils/test_timer.py 92.64% <100.00%> (+0.64%) ⬆️
torchtnt/framework/auto_unit.py 79.74% <100.00%> (+0.06%) ⬆️
torchtnt/framework/state.py 100.00% <100.00%> (ø)
torchtnt/framework/train.py 97.75% <100.00%> (+0.02%) ⬆️
torchtnt/utils/timer.py 94.11% <100.00%> (+0.61%) ⬆️

... and 26 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46853794

miqueljubert pushed a commit to miqueljubert/tnt that referenced this pull request Aug 23, 2023
Summary:
Pull Request resolved: pytorch#439

Add two counters, containing the iteration and time blocked on data for the last iteration. These are stored and recorded using a timer that does not synchronise with CUDA, and that only records these two values.

- iteration time is recorded in the training loop for all jobs.
- data time is recorded if the training loop does the data fetching, otherwise the user needs to instrument the logic which reads the data from the iterable.

Had to rework the approach slightly since the state does not seem the right place. With the changes, it is more natural to make it similar to progress, and store the info in a Stateful. This has the additional advantage that all values should be storable and be restored with the checkpoints.

Also rather than the last value, the last LOWER_BOUND values are stored. I hardcoded this in the torchtnt state, to not bloat parameters, since intuitively the last 1e4 values for each timer action here should be enough for monitoring purposes.

Reviewed By: daniellepintz, ananthsub

Differential Revision: D46853794

fbshipit-source-id: e0c74d7147a8da57f51ef8c9bd7bbb6ec38858c9
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46853794

miqueljubert pushed a commit to miqueljubert/tnt that referenced this pull request Aug 23, 2023
Summary:
Pull Request resolved: pytorch#439

Add two counters, containing the iteration and time blocked on data for the last iteration. These are stored and recorded using a timer that does not synchronise with CUDA, and that only records these two values.

- iteration time is recorded in the training loop for all jobs.
- data time is recorded if the training loop does the data fetching, otherwise the user needs to instrument the logic which reads the data from the iterable.

Had to rework the approach slightly since the state does not seem the right place. With the changes, it is more natural to make it similar to progress, and store the info in a Stateful. This has the additional advantage that all values should be storable and be restored with the checkpoints.

Also rather than the last value, the last LOWER_BOUND values are stored. I hardcoded this in the torchtnt state, to not bloat parameters, since intuitively the last 1e4 values for each timer action here should be enough for monitoring purposes.

Reviewed By: daniellepintz, ananthsub

Differential Revision: D46853794

fbshipit-source-id: dafc454fb790d91debfd3ff452c49c0ff6f84135
Summary:
Pull Request resolved: pytorch#439

Add two counters, containing the iteration and time blocked on data for the last iteration. These are stored and recorded using a timer that does not synchronise with CUDA, and that only records these two values.

- iteration time is recorded in the training loop for all jobs.
- data time is recorded if the training loop does the data fetching, otherwise the user needs to instrument the logic which reads the data from the iterable.

Had to rework the approach slightly since the state does not seem the right place. With the changes, it is more natural to make it similar to progress, and store the info in a Stateful. This has the additional advantage that all values should be storable and be restored with the checkpoints.

Also rather than the last value, the last LOWER_BOUND values are stored. I hardcoded this in the torchtnt state, to not bloat parameters, since intuitively the last 1e4 values for each timer action here should be enough for monitoring purposes.

Reviewed By: daniellepintz, ananthsub

Differential Revision: D46853794

fbshipit-source-id: ac2e9f992f21ac66625f89d1c75d228397740e1f
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46853794

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants