-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Batch Optimization Scripts for Neuron Instances #500
base: main
Are you sure you want to change the base?
Conversation
…erfile for the e2e BERT training task
…or MPI, NCCL, and EFA.
…ce to be consistent with the other test images
Is this going to be used to tune our test cases? https://github.com/aws/aws-k8s-tester/tree/main/e2e2/test/cases/neuron I'm not clear on the goal |
COPY train_bert_neuron.py /app/train_bert_neuron.py | ||
COPY infer_bert_neuron.py /app/infer_bert_neuron.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this image supports inference and training for neuron? should we just put it under e2e2's images folder rather than hack?
these python scripts you could leave in /hack
and then just volume mount them into the container
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Yeah honestly I struggled with where to put these, and someone recommended hack a couple weeks ago. The main use case right now is to just get optimal batch size to support upcoming benchmarking efforts for our e2e tests.
I could see it evolving in the future to being automatically ran when certain dependencies are updated, or as new instance types become available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so IIUC we can use the neuron test for inference tuning but you need an imaage for neuron here that supports training as well? im trying to decouple the test image from the optimization suite/framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ndbaker1 Use of Dockerfile
was just to make things more portable across instances as I did testing. Also, while probably made no difference, there is slight overhead introduced from running in a container versus just a script. Additional dependencies (e.g. neuron container runtime) as well, which makes the optimization's environment closer to the tests' runtime environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ndbaker1 @cartermckinnon Not sure I have a perfect answer of where these scripts/dockerfile should go, but here's the full context...
-
The training and inference tests part of e2e2 currently have suboptimal values for their batch parameter.
-
A standard batch value is hardcoded for all of them, leaving many of the instance's GPUs underutilized.
-
A major goal moving forward is to be able to benchmark these tests on all instances, and to gain an understanding of what full peak performance looks like for each instance type.
-
These new optimization scripts look to target a single GPU on an instance (even if multiple GPU), and to determine max batch size that a GPU of a certain type can handle.
-
The optimal batch value will then be used to determine the total batch size per instance (batch_size * num_gpus) for each instance, enabling us to run benchmarking for each instance at full GPU utilization (like our customers would)
-
The need for a training and inference script has to do with the fact that depending on the "mode" of a model, more/less memory might be utilized
-
Memory utilization by mode differs significantly because training requires large amounts of temporary parameter values to be held in memory (as weights/parameters get updated during the training process), while inference does not (parameter values are static)
-
The scripts were containerized to more closely mirror the test's runtime environment of running on kubernetes
-
A single Dockerfile was used for simplicity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Can we include this script in our existing test images so we don't need a separate pipeline for it? will be easier to set up a periodic for this as well if it's all the same spec with a different command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, dependencies should be kept the consistent anyways. Can't do this for Neuron yet, I'm just now noticing that the PR for Neuron BERT training/inference was closed and never merged. Will need to get that merged in first.
@cartermckinnon Yes, these are used to determine optimal batch size for Neuron instances for both training and inference e2e tests. There's one for NVIDIA instances as well - #498 |
This pull request introduces the training and inference scripts essential for model development. Additionally, a supporting Dockerfile is provided to optimize batch sizes specifically for Neuron GPU instances, ensuring efficient utilization of GPU resources.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.