Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Batch Optimization Scripts for Neuron Instances #500

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
af9fda0
Add python training script, requirements.txt (dependencies), and dock…
mattcjo Jun 26, 2024
104fa93
Add github action to build bert-testing image on PR
mattcjo Jun 26, 2024
477f672
Specify directory the BERT training image should be built in for the …
mattcjo Jun 26, 2024
fb7d18f
Add default values and include in docker env for MASTER_ADDR and MAST…
mattcjo Jun 27, 2024
b5aedc7
Slightly change env var value retrieval. Also ran a formatter to pret…
mattcjo Jun 27, 2024
7f9480b
Update bert training dockerfile to include amazon specific packages f…
mattcjo Jun 28, 2024
19613e1
Change Dockerfile.bert-training file name to just Dockerfile
mattcjo Jul 16, 2024
974da50
Update git workflow to use new Dockerfile path since the name was upd…
mattcjo Jul 16, 2024
5b4ae1a
Update Docker image to use Python version 3.10.12 and build from sour…
mattcjo Jul 16, 2024
6bc3ef4
Merge remote-tracking branch 'upstream/main'
mattcjo Jul 16, 2024
fa8d244
Remove extra line
mattcjo Jul 16, 2024
f87ba65
Had been setting MASTER_ADDR and MASTER_PORT env vars twice. Removed …
mattcjo Jul 18, 2024
7af6b13
Set each process to a GPU via local rank instead of overall rank
mattcjo Jul 18, 2024
1a3ad52
Merge remote-tracking branch 'upstream/main'
mattcjo Jul 18, 2024
1f5b1c9
Change comment describing section in dockerfile
mattcjo Jul 19, 2024
b67026c
Merge branch 'aws:main' into main
mattcjo Jul 23, 2024
4a8e0ec
parameterize number of gpus per node in Dockerfile and train.py
mattcjo Jul 23, 2024
60ddc02
Merge remote-tracking branch 'upstream/main'
mattcjo Jul 31, 2024
01d8270
formatting in train.py
mattcjo Jul 31, 2024
21fd336
Merge remote-tracking branch 'upstream/main'
mattcjo Aug 7, 2024
f250ede
Merge branch 'aws:main' into main
mattcjo Aug 30, 2024
f000ec6
Add nvidia batch optimization scripts for both training and inference
mattcjo Oct 11, 2024
21e27a0
Merge branch 'aws:main' into batch-optimization-neuron
mattcjo Oct 25, 2024
7493cfd
Move Neuron scripts into neuron directory
mattcjo Oct 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions hack/optimize/neuron/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Use Ubuntu 20.04 as the base image
FROM ubuntu:20.04

# Neuron SDK components versions
ARG NEURONX_FRAMEWORK_VERSION=2.11.0.0
ARG NEURONX_RUNTIME_LIB_VERSION=2.11.7.0
ARG NEURONX_TOOLS_VERSION=2.11.8.0
ARG NEURONX_CC_VERSION=2.11.8.0

# Set environment variables
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV PYTHONIOENCODING=UTF-8
ENV LD_LIBRARY_PATH="/opt/aws/neuron/lib:/usr/local/lib"
ENV PATH="/opt/aws/neuron/bin:$PATH"

# Install system dependencies including libsqlite3-dev and libbz2-dev for Python
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
ca-certificates \
curl \
wget \
zlib1g-dev \
gnupg2 \
libssl-dev \
libffi-dev \
libsqlite3-dev \
libbz2-dev \
libopenblas-dev \
libomp5 \
&& rm -rf /var/lib/apt/lists/*

# Add Neuron repository and install Neuron SDK components
RUN echo "deb https://apt.repos.neuron.amazonaws.com focal main" > /etc/apt/sources.list.d/neuron.list && \
wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-NEURON.PUB | apt-key add - && \
apt-get update && \
apt-get install -y \
aws-neuronx-tools=${NEURONX_TOOLS_VERSION} \
aws-neuronx-runtime-lib=${NEURONX_RUNTIME_LIB_VERSION} \
&& rm -rf /var/lib/apt/lists/*

# Install Python 3.10 with sqlite3 and bz2 support
RUN wget -q https://www.python.org/ftp/python/3.10.12/Python-3.10.12.tgz && \
tar -xzf Python-3.10.12.tgz && \
cd Python-3.10.12 && \
./configure --enable-shared --enable-optimizations --with-ensurepip=install && \
make -j $(nproc) && make install && \
cd .. && rm -rf Python-3.10.12*

# Upgrade pip and install required Python packages
RUN python3.10 -m pip install --upgrade pip

# Install Neuron-related Python packages from the Neuron repository
RUN python3.10 -m pip install --no-cache-dir \
--extra-index-url https://pip.repos.neuron.amazonaws.com \
torch-neuronx==${NEURONX_FRAMEWORK_VERSION} \
torch-xla==1.13.* \
torchvision

# Install additional Python packages
RUN python3.10 -m pip install --no-cache-dir \
transformers==4.29 \
numpy==1.23 \
pynvml

# Set the working directory
WORKDIR /app

# Copy training and inference scripts
COPY train_bert_neuron.py /app/train_bert_neuron.py
COPY infer_bert_neuron.py /app/infer_bert_neuron.py
Comment on lines +71 to +72
Copy link
Contributor

@ndbaker1 ndbaker1 Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this image supports inference and training for neuron? should we just put it under e2e2's images folder rather than hack?

these python scripts you could leave in /hack and then just volume mount them into the container

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Yeah honestly I struggled with where to put these, and someone recommended hack a couple weeks ago. The main use case right now is to just get optimal batch size to support upcoming benchmarking efforts for our e2e tests.

I could see it evolving in the future to being automatically ran when certain dependencies are updated, or as new instance types become available.

Copy link
Contributor

@ndbaker1 ndbaker1 Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so IIUC we can use the neuron test for inference tuning but you need an imaage for neuron here that supports training as well? im trying to decouple the test image from the optimization suite/framework.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ndbaker1 Use of Dockerfile was just to make things more portable across instances as I did testing. Also, while probably made no difference, there is slight overhead introduced from running in a container versus just a script. Additional dependencies (e.g. neuron container runtime) as well, which makes the optimization's environment closer to the tests' runtime environment.

Copy link
Contributor Author

@mattcjo mattcjo Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ndbaker1 @cartermckinnon Not sure I have a perfect answer of where these scripts/dockerfile should go, but here's the full context...

  • The training and inference tests part of e2e2 currently have suboptimal values for their batch parameter.

  • A standard batch value is hardcoded for all of them, leaving many of the instance's GPUs underutilized.

  • A major goal moving forward is to be able to benchmark these tests on all instances, and to gain an understanding of what full peak performance looks like for each instance type.

  • These new optimization scripts look to target a single GPU on an instance (even if multiple GPU), and to determine max batch size that a GPU of a certain type can handle.

  • The optimal batch value will then be used to determine the total batch size per instance (batch_size * num_gpus) for each instance, enabling us to run benchmarking for each instance at full GPU utilization (like our customers would)

  • The need for a training and inference script has to do with the fact that depending on the "mode" of a model, more/less memory might be utilized

  • Memory utilization by mode differs significantly because training requires large amounts of temporary parameter values to be held in memory (as weights/parameters get updated during the training process), while inference does not (parameter values are static)

  • The scripts were containerized to more closely mirror the test's runtime environment of running on kubernetes

  • A single Dockerfile was used for simplicity

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Can we include this script in our existing test images so we don't need a separate pipeline for it? will be easier to set up a periodic for this as well if it's all the same spec with a different command

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, dependencies should be kept the consistent anyways. Can't do this for Neuron yet, I'm just now noticing that the PR for Neuron BERT training/inference was closed and never merged. Will need to get that merged in first.


82 changes: 82 additions & 0 deletions hack/optimize/neuron/infer_bert_neuron.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
import os

# Unset XLA_FLAGS to avoid GPU-specific issues on Neuron
os.environ.pop('XLA_FLAGS', None)

import torch
import torch_neuronx
from transformers import BertTokenizer, BertForPreTraining
from torch.utils.data import DataLoader, TensorDataset

def create_dummy_data(tokenizer, num_samples=1000, max_length=128):
sentences = [
f"This is a dummy sentence number {i}" for i in range(num_samples)
]
tokenized_inputs = tokenizer(
sentences,
max_length=max_length,
padding="max_length",
truncation=True,
return_tensors="pt",
)
labels = tokenized_inputs.input_ids.detach().clone()
next_sentence_labels = torch.randint(0, 2, (num_samples,))
return TensorDataset(
tokenized_inputP1+rOQ\P1+rOR\P1+rOS\s.input_ids,
tokenized_inputs.attention_mask,
labels,
next_sentence_labels,
)

def infer_bert_neuron(model, tokenizer, batch_sizes, device):
dataset = create_dummy_data(tokenizer)
results = []

for batch_size in batch_sizes:
try:
dataloader = DataLoader(dataset, batch_size=batch_size)
start_time = time.time()
for batch in dataloader:
inputs, masks, labels, next_sentence_labels = batch
inputs, masks = inputs.to(device), masks.to(device)
outputs = model(input_ids=inputs, attention_mask=masks)
end_time = time.time()
inference_time = end_time - start_time
throughput = len(dataset) / inference_time

print(f"Batch Size: {batch_size}")
print(f"Inference time: {inference_time:.2f} seconds")
print(f"Throughput: {throughput:.2f} samples/second")

results.append({
'batch_size': batch_size,
'throughput': throughput,
})
break # Exit after successful batch size

except RuntimeError as e:
if 'out of memory' in str(e).lower():
print(f"Batch Size {batch_size}: Out of Memory. Trying smaller batch size.")
torch.cuda.empty_cache()
continue
else:
raise e

print("Optimal Batch Size Found:")
for res in results:
print(f"Batch Size: {res['batch_size']}, Throughput: {res['throughput']:.2f} samples/sec")

def main():
device = torch.device("xla")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForPreTraining.from_pretrained("bert-base-uncased")

example_inputs = torch.randint(0, 2000, (1, 128)).to(device)
model_neuron = torch_neuronx.trace(model, example_inputs)

batch_sizes = [128, 64, 32, 16, 8]
infer_bert_neuron(model_neuron, tokenizer, batch_sizes, device)

if __name__ == "__main__":
main()

103 changes: 103 additions & 0 deletions hack/optimize/neuron/train_bert_neuron.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
import os

# Unset XLA_FLAGS to avoid GPU-specific issues on Neuron
os.environ.pop('XLA_FLAGS', None)

import time
import torch
import torch_xla
import torch_xla.core.xla_model as xm
from transformers import BertForPreTraining, BertTokenizer
from torch.utils.data import DataLoader, TensorDataset

def create_dummy_data(tokenizer, num_samples=1000, max_length=128):
sentences = [
f"This is a dummy sentence number {i}" for i in range(num_samples)
]
tokenized_inputs = tokenizer(
sentences,
max_length=max_length,
padding="max_length",
truncation=True,
return_tensors="pt",
)
labels = tokenized_inputs.input_ids.detach().clone()
next_sentence_labels = torch.randint(0, 2, (num_samples,))
return TensorDataset(
tokenized_inputs.input_ids,
tokenized_inputs.attention_mask,
labels,
next_sentence_labels,
)

def train_bert_neuron(model, tokenizer, batch_sizes, device):
model.train()
model.to(device)

dataset = create_dummy_data(tokenizer)
results = []

for batch_size in batch_sizes:
try:
train_dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)

# Measure training time for throughput calculation
start_time = time.time()
for batch in train_dataloader:
optimizer.zero_grad()
inputs, masks, labels, next_sentence_labels = batch
inputs, masks, labels, next_sentence_labels = (
inputs.to(device),
masks.to(device),
labels.to(device),
next_sentence_labels.to(device),
)
outputs = model(
input_ids=inputs,
attention_mask=masks,
labels=labels,
next_sentence_label=next_sentence_labels,
)
loss = outputs.loss
loss.backward()
optimizer.step()
end_time = time.time()
training_time = end_time - start_time
throughput = len(dataset) / training_time

print(f"Batch Size: {batch_size}")
print(f"Training time: {training_time:.2f} seconds")
print(f"Throughput: {throughput:.2f} samples/second")

results.append({
'batch_size': batch_size,
'throughput': throughput,
})
break # Exit after successful batch size

except RuntimeError as e:
if 'out of memory' in str(e).lower():
print(f"Batch Size {batch_size}: Out of Memory. Trying smaller batch size.")
torch.cuda.empty_cache()
continue
else:
raise e

print("Optimal Batch Size Found:")
for res in results:
print(f"Batch Size: {res['batch_size']}, Throughput: {res['throughput']:.2f} samples/sec")

def main():
device = xm.xla_device()

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForPreTraining.from_pretrained("bert-base-uncased")

batch_sizes = [128, 64, 32, 16, 8]

train_bert_neuron(model, tokenizer, batch_sizes, device)

if __name__ == "__main__":
main()

Loading