-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Batch Optimization Scripts for Neuron Instances #500
Open
mattcjo
wants to merge
24
commits into
aws:main
Choose a base branch
from
mattcjo:batch-optimization-neuron
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
af9fda0
Add python training script, requirements.txt (dependencies), and dock…
mattcjo 104fa93
Add github action to build bert-testing image on PR
mattcjo 477f672
Specify directory the BERT training image should be built in for the …
mattcjo fb7d18f
Add default values and include in docker env for MASTER_ADDR and MAST…
mattcjo b5aedc7
Slightly change env var value retrieval. Also ran a formatter to pret…
mattcjo 7f9480b
Update bert training dockerfile to include amazon specific packages f…
mattcjo 19613e1
Change Dockerfile.bert-training file name to just Dockerfile
mattcjo 974da50
Update git workflow to use new Dockerfile path since the name was upd…
mattcjo 5b4ae1a
Update Docker image to use Python version 3.10.12 and build from sour…
mattcjo 6bc3ef4
Merge remote-tracking branch 'upstream/main'
mattcjo fa8d244
Remove extra line
mattcjo f87ba65
Had been setting MASTER_ADDR and MASTER_PORT env vars twice. Removed …
mattcjo 7af6b13
Set each process to a GPU via local rank instead of overall rank
mattcjo 1a3ad52
Merge remote-tracking branch 'upstream/main'
mattcjo 1f5b1c9
Change comment describing section in dockerfile
mattcjo b67026c
Merge branch 'aws:main' into main
mattcjo 4a8e0ec
parameterize number of gpus per node in Dockerfile and train.py
mattcjo 60ddc02
Merge remote-tracking branch 'upstream/main'
mattcjo 01d8270
formatting in train.py
mattcjo 21fd336
Merge remote-tracking branch 'upstream/main'
mattcjo f250ede
Merge branch 'aws:main' into main
mattcjo f000ec6
Add nvidia batch optimization scripts for both training and inference
mattcjo 21e27a0
Merge branch 'aws:main' into batch-optimization-neuron
mattcjo 7493cfd
Move Neuron scripts into neuron directory
mattcjo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
# Use Ubuntu 20.04 as the base image | ||
FROM ubuntu:20.04 | ||
|
||
# Neuron SDK components versions | ||
ARG NEURONX_FRAMEWORK_VERSION=2.11.0.0 | ||
ARG NEURONX_RUNTIME_LIB_VERSION=2.11.7.0 | ||
ARG NEURONX_TOOLS_VERSION=2.11.8.0 | ||
ARG NEURONX_CC_VERSION=2.11.8.0 | ||
|
||
# Set environment variables | ||
ENV DEBIAN_FRONTEND=noninteractive | ||
ENV PYTHONUNBUFFERED=1 | ||
ENV PYTHONIOENCODING=UTF-8 | ||
ENV LD_LIBRARY_PATH="/opt/aws/neuron/lib:/usr/local/lib" | ||
ENV PATH="/opt/aws/neuron/bin:$PATH" | ||
|
||
# Install system dependencies including libsqlite3-dev and libbz2-dev for Python | ||
RUN apt-get update && \ | ||
apt-get install -y --no-install-recommends \ | ||
build-essential \ | ||
ca-certificates \ | ||
curl \ | ||
wget \ | ||
zlib1g-dev \ | ||
gnupg2 \ | ||
libssl-dev \ | ||
libffi-dev \ | ||
libsqlite3-dev \ | ||
libbz2-dev \ | ||
libopenblas-dev \ | ||
libomp5 \ | ||
&& rm -rf /var/lib/apt/lists/* | ||
|
||
# Add Neuron repository and install Neuron SDK components | ||
RUN echo "deb https://apt.repos.neuron.amazonaws.com focal main" > /etc/apt/sources.list.d/neuron.list && \ | ||
wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-NEURON.PUB | apt-key add - && \ | ||
apt-get update && \ | ||
apt-get install -y \ | ||
aws-neuronx-tools=${NEURONX_TOOLS_VERSION} \ | ||
aws-neuronx-runtime-lib=${NEURONX_RUNTIME_LIB_VERSION} \ | ||
&& rm -rf /var/lib/apt/lists/* | ||
|
||
# Install Python 3.10 with sqlite3 and bz2 support | ||
RUN wget -q https://www.python.org/ftp/python/3.10.12/Python-3.10.12.tgz && \ | ||
tar -xzf Python-3.10.12.tgz && \ | ||
cd Python-3.10.12 && \ | ||
./configure --enable-shared --enable-optimizations --with-ensurepip=install && \ | ||
make -j $(nproc) && make install && \ | ||
cd .. && rm -rf Python-3.10.12* | ||
|
||
# Upgrade pip and install required Python packages | ||
RUN python3.10 -m pip install --upgrade pip | ||
|
||
# Install Neuron-related Python packages from the Neuron repository | ||
RUN python3.10 -m pip install --no-cache-dir \ | ||
--extra-index-url https://pip.repos.neuron.amazonaws.com \ | ||
torch-neuronx==${NEURONX_FRAMEWORK_VERSION} \ | ||
torch-xla==1.13.* \ | ||
torchvision | ||
|
||
# Install additional Python packages | ||
RUN python3.10 -m pip install --no-cache-dir \ | ||
transformers==4.29 \ | ||
numpy==1.23 \ | ||
pynvml | ||
|
||
# Set the working directory | ||
WORKDIR /app | ||
|
||
# Copy training and inference scripts | ||
COPY train_bert_neuron.py /app/train_bert_neuron.py | ||
COPY infer_bert_neuron.py /app/infer_bert_neuron.py | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
import os | ||
|
||
# Unset XLA_FLAGS to avoid GPU-specific issues on Neuron | ||
os.environ.pop('XLA_FLAGS', None) | ||
|
||
import torch | ||
import torch_neuronx | ||
from transformers import BertTokenizer, BertForPreTraining | ||
from torch.utils.data import DataLoader, TensorDataset | ||
|
||
def create_dummy_data(tokenizer, num_samples=1000, max_length=128): | ||
sentences = [ | ||
f"This is a dummy sentence number {i}" for i in range(num_samples) | ||
] | ||
tokenized_inputs = tokenizer( | ||
sentences, | ||
max_length=max_length, | ||
padding="max_length", | ||
truncation=True, | ||
return_tensors="pt", | ||
) | ||
labels = tokenized_inputs.input_ids.detach().clone() | ||
next_sentence_labels = torch.randint(0, 2, (num_samples,)) | ||
return TensorDataset( | ||
tokenized_inputP1+rOQ\P1+rOR\P1+rOS\s.input_ids, | ||
tokenized_inputs.attention_mask, | ||
labels, | ||
next_sentence_labels, | ||
) | ||
|
||
def infer_bert_neuron(model, tokenizer, batch_sizes, device): | ||
dataset = create_dummy_data(tokenizer) | ||
results = [] | ||
|
||
for batch_size in batch_sizes: | ||
try: | ||
dataloader = DataLoader(dataset, batch_size=batch_size) | ||
start_time = time.time() | ||
for batch in dataloader: | ||
inputs, masks, labels, next_sentence_labels = batch | ||
inputs, masks = inputs.to(device), masks.to(device) | ||
outputs = model(input_ids=inputs, attention_mask=masks) | ||
end_time = time.time() | ||
inference_time = end_time - start_time | ||
throughput = len(dataset) / inference_time | ||
|
||
print(f"Batch Size: {batch_size}") | ||
print(f"Inference time: {inference_time:.2f} seconds") | ||
print(f"Throughput: {throughput:.2f} samples/second") | ||
|
||
results.append({ | ||
'batch_size': batch_size, | ||
'throughput': throughput, | ||
}) | ||
break # Exit after successful batch size | ||
|
||
except RuntimeError as e: | ||
if 'out of memory' in str(e).lower(): | ||
print(f"Batch Size {batch_size}: Out of Memory. Trying smaller batch size.") | ||
torch.cuda.empty_cache() | ||
continue | ||
else: | ||
raise e | ||
|
||
print("Optimal Batch Size Found:") | ||
for res in results: | ||
print(f"Batch Size: {res['batch_size']}, Throughput: {res['throughput']:.2f} samples/sec") | ||
|
||
def main(): | ||
device = torch.device("xla") | ||
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") | ||
model = BertForPreTraining.from_pretrained("bert-base-uncased") | ||
|
||
example_inputs = torch.randint(0, 2000, (1, 128)).to(device) | ||
model_neuron = torch_neuronx.trace(model, example_inputs) | ||
|
||
batch_sizes = [128, 64, 32, 16, 8] | ||
infer_bert_neuron(model_neuron, tokenizer, batch_sizes, device) | ||
|
||
if __name__ == "__main__": | ||
main() | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
import os | ||
|
||
# Unset XLA_FLAGS to avoid GPU-specific issues on Neuron | ||
os.environ.pop('XLA_FLAGS', None) | ||
|
||
import time | ||
import torch | ||
import torch_xla | ||
import torch_xla.core.xla_model as xm | ||
from transformers import BertForPreTraining, BertTokenizer | ||
from torch.utils.data import DataLoader, TensorDataset | ||
|
||
def create_dummy_data(tokenizer, num_samples=1000, max_length=128): | ||
sentences = [ | ||
f"This is a dummy sentence number {i}" for i in range(num_samples) | ||
] | ||
tokenized_inputs = tokenizer( | ||
sentences, | ||
max_length=max_length, | ||
padding="max_length", | ||
truncation=True, | ||
return_tensors="pt", | ||
) | ||
labels = tokenized_inputs.input_ids.detach().clone() | ||
next_sentence_labels = torch.randint(0, 2, (num_samples,)) | ||
return TensorDataset( | ||
tokenized_inputs.input_ids, | ||
tokenized_inputs.attention_mask, | ||
labels, | ||
next_sentence_labels, | ||
) | ||
|
||
def train_bert_neuron(model, tokenizer, batch_sizes, device): | ||
model.train() | ||
model.to(device) | ||
|
||
dataset = create_dummy_data(tokenizer) | ||
results = [] | ||
|
||
for batch_size in batch_sizes: | ||
try: | ||
train_dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True) | ||
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001) | ||
|
||
# Measure training time for throughput calculation | ||
start_time = time.time() | ||
for batch in train_dataloader: | ||
optimizer.zero_grad() | ||
inputs, masks, labels, next_sentence_labels = batch | ||
inputs, masks, labels, next_sentence_labels = ( | ||
inputs.to(device), | ||
masks.to(device), | ||
labels.to(device), | ||
next_sentence_labels.to(device), | ||
) | ||
outputs = model( | ||
input_ids=inputs, | ||
attention_mask=masks, | ||
labels=labels, | ||
next_sentence_label=next_sentence_labels, | ||
) | ||
loss = outputs.loss | ||
loss.backward() | ||
optimizer.step() | ||
end_time = time.time() | ||
training_time = end_time - start_time | ||
throughput = len(dataset) / training_time | ||
|
||
print(f"Batch Size: {batch_size}") | ||
print(f"Training time: {training_time:.2f} seconds") | ||
print(f"Throughput: {throughput:.2f} samples/second") | ||
|
||
results.append({ | ||
'batch_size': batch_size, | ||
'throughput': throughput, | ||
}) | ||
break # Exit after successful batch size | ||
|
||
except RuntimeError as e: | ||
if 'out of memory' in str(e).lower(): | ||
print(f"Batch Size {batch_size}: Out of Memory. Trying smaller batch size.") | ||
torch.cuda.empty_cache() | ||
continue | ||
else: | ||
raise e | ||
|
||
print("Optimal Batch Size Found:") | ||
for res in results: | ||
print(f"Batch Size: {res['batch_size']}, Throughput: {res['throughput']:.2f} samples/sec") | ||
|
||
def main(): | ||
device = xm.xla_device() | ||
|
||
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") | ||
model = BertForPreTraining.from_pretrained("bert-base-uncased") | ||
|
||
batch_sizes = [128, 64, 32, 16, 8] | ||
|
||
train_bert_neuron(model, tokenizer, batch_sizes, device) | ||
|
||
if __name__ == "__main__": | ||
main() | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this image supports inference and training for neuron? should we just put it under e2e2's images folder rather than hack?
these python scripts you could leave in
/hack
and then just volume mount them into the containerThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Yeah honestly I struggled with where to put these, and someone recommended hack a couple weeks ago. The main use case right now is to just get optimal batch size to support upcoming benchmarking efforts for our e2e tests.
I could see it evolving in the future to being automatically ran when certain dependencies are updated, or as new instance types become available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so IIUC we can use the neuron test for inference tuning but you need an imaage for neuron here that supports training as well? im trying to decouple the test image from the optimization suite/framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ndbaker1 Use of
Dockerfile
was just to make things more portable across instances as I did testing. Also, while probably made no difference, there is slight overhead introduced from running in a container versus just a script. Additional dependencies (e.g. neuron container runtime) as well, which makes the optimization's environment closer to the tests' runtime environment.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ndbaker1 @cartermckinnon Not sure I have a perfect answer of where these scripts/dockerfile should go, but here's the full context...
The training and inference tests part of e2e2 currently have suboptimal values for their batch parameter.
A standard batch value is hardcoded for all of them, leaving many of the instance's GPUs underutilized.
A major goal moving forward is to be able to benchmark these tests on all instances, and to gain an understanding of what full peak performance looks like for each instance type.
These new optimization scripts look to target a single GPU on an instance (even if multiple GPU), and to determine max batch size that a GPU of a certain type can handle.
The optimal batch value will then be used to determine the total batch size per instance (batch_size * num_gpus) for each instance, enabling us to run benchmarking for each instance at full GPU utilization (like our customers would)
The need for a training and inference script has to do with the fact that depending on the "mode" of a model, more/less memory might be utilized
Memory utilization by mode differs significantly because training requires large amounts of temporary parameter values to be held in memory (as weights/parameters get updated during the training process), while inference does not (parameter values are static)
The scripts were containerized to more closely mirror the test's runtime environment of running on kubernetes
A single Dockerfile was used for simplicity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Can we include this script in our existing test images so we don't need a separate pipeline for it? will be easier to set up a periodic for this as well if it's all the same spec with a different command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, dependencies should be kept the consistent anyways. Can't do this for Neuron yet, I'm just now noticing that the PR for Neuron BERT training/inference was closed and never merged. Will need to get that merged in first.