Skip to content

LambdaLabsML/llm.c-1cc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Run llm.c on Lambda 1-Click Clusters

Welcome to the Lambda Labs 1-Click Clusters setup guide for training Andrej Kapathy's llm.c! 🚀 Building on Andrej's reproduction guide, we have made small adjustments to simplify the setup for 1-Click Cluster's existing hardware (including InfiniBand NICs and shared storage) and software stack.

Step 1: Setup the Cluster for llm.c

Run the following commands from your local machine:

# Specify your 1cc cluster key and 1cc storage accordingly
export CONFIG_PATH=<path-to-your-1cc-key>
export STORAGE_PATH=<path-to-your-1cc-storage>

# We distribute a ssh key across the cluster
# So that the head node can ssh into all workers without a password
bash ssh_init.sh

# Set up llm.c
bash setup_llm.c.sh

Step 2: Train llm.c

ssh into the head node and run the following command:

# Don't forget to set up STORAGE_PATH on your head node
export STORAGE_PATH=<path-to-your-1cc-storage>

export LLMC_PATH=$STORAGE_PATH"/llm.c"
export binary_path=$LLMC_PATH"/train_gpt2cu"
export out_dir=$LLMC_PATH"/log_gpt2_124M_multi"
export train_data_path=$LLMC_PATH"/dev/data/fineweb10B/fineweb_train_*.bin"
export val_data_path=$LLMC_PATH"/dev/data/fineweb10B/fineweb_val_*.bin"

export hostfile_path=/home/ubuntu/hostfile_1cc_worker_mpirun
export OMPI_MCA_btl_tcp_if_include=eno1
export UCX_TLS=self,shm,tcp
export NCCL_P2P_LEVEL=NVL
export NCCL_NET_GDR_LEVEL=PIX
export NCCL_IB_HCA='=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8'
export NCCL_IB_PCI_RELAXED_ORDERING=1
export NCCL_SOCKET_IFNAME=eno1
export NCCL_DEBUG=INFO

mpirun --hostfile $hostfile_path \
    $binary_path \
    -i "$train_data_path" \
    -j "$val_data_path" \
    -o $out_dir \
    -v 250 -s 20000 -g 144 \
    -h 1 \
    -b 64 -t 1024 \
    -d 2097152 \
    -r 0 \
    -z 1 \
    -c 0.1 \
    -l 0.0006 \
    -q 0.1 \
    -u 700 \
    -n 1000 \
    -y 0 \
    -e d12 \
    -pi "mpi"

Example output on a 16xGPU 1-Click Cluster

llm.c on 1cc

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages