Add a Benchmarking / profiling example #11

lebrice · 2024-06-26T15:45:17Z

Should be completed after mila-iqia/mila-docs#247

Now that we have an example of how to benchmark the throughput and identify bottlenecks in the mila-docs, the research project template should also make this easy to do.

Add an example experiment configuration and accompanying notebook, that use the pytorch profiler and does the same kind of profiling as in the example, but using the template
Add an example of a sweep over some parameters, with the training throughput as the metric, and using different kinds of GPUs.
Create a wandb report with the throughput comparison between the different GPU types.
- 1. Find the best datamodule parameters to maximize the throughput (batches per second) without training (NoOP algo)
- 1. Measure the performance on different GPUS using the optimal datamodule params from before (and keeping other parameters the same)
- 1. Using the results from before, do a simple sweep over model hyper-parameters to maximize the utilization of the selected GPU (which was selected as a tradeoff between performance and difficulty to get an allocation). For example if the RTX8000's are 20% slower than A100s but 5x easier to get an allocation on, use those instead.
If done after DRAC support, also include a comparison between Mila/DRAC clusters. (For example, the optimal num_workers might be greater in DRAC due to the very slow $SCRATCH filesystems, could be interesting to take a look at that).

lebrice · 2024-07-05T14:54:39Z

Also interesting: https://github.com/nschloe/tuna

lebrice · 2024-09-10T18:27:13Z

More specific breakdown of the example notebook steps:

Instrumenting your code: adding metrics so you can measure things you care about (e.g. a) training speed (steps or samples per second), b) CPU/GPU utilization, RAM / VRAM utilization, etc.)
- This is achieved by using a callback (MeasureSamplesPerSecondCallback)
- An easy way to set this up is using wandb, you get those "for free" in the systems panel
Establish a baseline performance: What are the values for the metrics above that we get with our initial configuration?
Check whether dataloading is the bottleneck (Using the NoOp algorithm, check that throughput (metric A) is much higher than when actually training).
- If it is, then we can safely assume that the dataloader isnt the bottleneck, so we can move on to other problems,
Do we even need a GPU? Compare speed using CPU only vs the slowest GPU available, for a low number of steps
- If the CPU performance loosely comparable (for instance, only 1.5-2x slower) than with a GPU, then it might be worth considering! (LMK if this happens, one thing could be to try to increase the # of CPUs and measure performance scaling, then ship this kind of job to a DRAC cluster)
- In most workflows, using a GPU actually helps a lot.
What performance do you get with each type of GPU? (Based on the VRAM requirements of the job (step 1), try all the GPU types on the Cluster that can accomodate this kind of job)
How well are we using the GPU?
- Once we've selected the target GPU that we want to use, measure the GPU utilization. Is the GPU utilization high? (>80%?)
- If it's high (>80%), then we can either stop here, or we can keep going a bit further
- If it's low, then we can use the PyTorch profiler (or any other tool) to try to figure out what the bottleneck is.

lebrice · 2024-09-10T18:28:40Z

An example of step 7+ would be something like this: https://pytorch.org/blog/accelerating-generative-ai/

lebrice assigned cmvcordova Aug 5, 2024

lebrice linked a pull request Sep 18, 2024 that will close this issue

Profiling #45

Closed

lebrice mentioned this issue Sep 18, 2024

Profiling #45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Benchmarking / profiling example #11

Add a Benchmarking / profiling example #11

lebrice commented Jun 26, 2024 •

edited

Loading

lebrice commented Jul 5, 2024

lebrice commented Sep 10, 2024 •

edited

Loading

lebrice commented Sep 10, 2024

Add a Benchmarking / profiling example #11

Add a Benchmarking / profiling example #11

Comments

lebrice commented Jun 26, 2024 • edited Loading

lebrice commented Jul 5, 2024

lebrice commented Sep 10, 2024 • edited Loading

lebrice commented Sep 10, 2024

lebrice commented Jun 26, 2024 •

edited

Loading

lebrice commented Sep 10, 2024 •

edited

Loading