Example Request #4800

juna962 · 2024-12-17T08:42:26Z

juna962
Dec 17, 2024

Use Case Description:
This example demonstrates how to use TensorBoard with Amazon SageMaker JumpStart to visualize training metrics, such as loss curves, while training a LLaMA3 (8B model for testing purposes). TensorBoard will be integrated to export and monitor the loss curves during training.

Steps to Use TensorBoard with SageMaker JumpStart
Set up your SageMaker environment:
Launch a SageMaker notebook instance with the necessary permissions to interact with SageMaker JumpStart and S3.

Install TensorBoard:
On your SageMaker notebook, install TensorBoard if not already installed:

bash
Copy code
pip install tensorboard
Select a Model from SageMaker JumpStart:
Use SageMaker JumpStart to load a pre-trained LLaMA3 model (8B) or start training from scratch.
Example:

python
Copy code
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models, retrieve_jumpstart_training_uri

List available JumpStart models (search for LLaMA3)

models = list_jumpstart_models()
print([model for model in models if "llama3" in model.lower()])

Retrieve model training URI

training_uri = retrieve_jumpstart_training_uri(model_id="huggingface-llama3-8B", region="us-west-2")
Prepare the Dataset:
Use your own dataset stored in S3. Ensure the dataset is formatted correctly for the LLaMA3 model. For example:

python
Copy code
dataset_s3_uri = "s3://your-bucket/your-dataset/"
Modify the Training Script:
Adapt the SageMaker training script to log metrics compatible with TensorBoard. For example, add TensorBoard logging using torch.utils.tensorboard.SummaryWriter:

python
Copy code
from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter(log_dir="/opt/ml/output/tensorboard")

for epoch in range(num_epochs):
for batch_idx, batch in enumerate(train_dataloader):
loss = model.training_step(batch)
writer.add_scalar("Loss/train", loss.item(), epoch * len(train_dataloader) + batch_idx)
Launch Training in SageMaker:
Start the training job on SageMaker with TensorBoard configured to log outputs to S3.
Example:

python
Copy code
from sagemaker.pytorch import PyTorch

Define the estimator

pytorch_estimator = PyTorch(
entry_point="train.py", # Your training script
source_dir="src", # Directory containing training script
role="SageMakerRole",
instance_count=1,
instance_type="ml.p3.16xlarge", # Adjust based on LLaMA3 size
framework_version="1.12.1",
py_version="py38",
hyperparameters={
"epochs": 5,
"batch_size": 16
},
output_path="s3://your-bucket/tensorboard-logs/",
)

Start training

pytorch_estimator.fit({"train": dataset_s3_uri})
Access TensorBoard Logs:

After training, download the TensorBoard logs from S3 to your local machine or directly use SageMaker Studio.
Start TensorBoard and point it to the logs directory:
bash
Copy code
tensorboard --logdir=s3://your-bucket/tensorboard-logs/
Monitor Loss Curves:
Open the TensorBoard web UI (e.g., http://localhost:6006/), and you should see the loss curves and other metrics.

Involved Services:

SageMaker JumpStart: Model training and deployment.
TensorBoard: Visualization of training metrics.
S3: Storage of datasets and TensorBoard logs.
Dataset:
Use your custom dataset uploaded to an S3 bucket (e.g., s3://your-bucket/your-dataset/).

This approach ensures you can monitor loss curves and other training metrics effectively while using SageMaker JumpStart and TensorBoard. Let me know if you'd like more details on specific steps!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example Request #4800

{{title}}

Replies: 0 comments

Select a reply

Example Request #4800

juna962 Dec 17, 2024

List available JumpStart models (search for LLaMA3)

Retrieve model training URI

Define the estimator

Start training

Replies: 0 comments

juna962
Dec 17, 2024