Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CI for rocm #346

Merged
merged 28 commits into from
Sep 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
ce60d5d
WIP
Binyang2014 Sep 1, 2024
7d7ca57
WIP
Binyang2014 Sep 1, 2024
2dbc610
Update integration-test-rocm.yml for Azure Pipelines
Binyang2014 Sep 1, 2024
46be2be
Update integration-test-rocm.yml for Azure Pipelines
Binyang2014 Sep 1, 2024
9f566a0
Update integration-test-rocm.yml for Azure Pipelines
Binyang2014 Sep 1, 2024
6630a81
fix
Binyang2014 Sep 1, 2024
c18fb95
Update integration-test-rocm.yml for Azure Pipelines
Binyang2014 Sep 1, 2024
e0a79be
fix
Binyang2014 Sep 1, 2024
d7ea041
update
Binyang2014 Sep 2, 2024
7996c6d
Update integration-test-rocm.yml for Azure Pipelines
Binyang2014 Sep 4, 2024
3df88d8
Update integration-test-rocm.yml for Azure Pipelines
Binyang2014 Sep 4, 2024
ff41cb2
Update integration-test-rocm.yml for Azure Pipelines
Binyang2014 Sep 4, 2024
fdd5922
Update integration-test-rocm.yml for Azure Pipelines
Binyang2014 Sep 4, 2024
6a8c927
Update integration-test-rocm.yml for Azure Pipelines
Binyang2014 Sep 4, 2024
4952d95
Update integration-test-rocm.yml for Azure Pipelines
Binyang2014 Sep 4, 2024
6ccd9fe
Update integration-test-rocm.yml for Azure Pipelines
Binyang2014 Sep 4, 2024
4a9cb69
Update integration-test-rocm.yml for Azure Pipelines
Binyang2014 Sep 4, 2024
b0d425e
Update integration-test-rocm.yml for Azure Pipelines
Binyang2014 Sep 4, 2024
eb2e352
Update integration-test-rocm.yml for Azure Pipelines
Binyang2014 Sep 4, 2024
ca1b334
Update integration-test-rocm.yml for Azure Pipelines
Binyang2014 Sep 4, 2024
7cb77da
update
Binyang2014 Sep 6, 2024
b863ed7
Update integration-test-rocm.yml for Azure Pipelines
Binyang2014 Sep 6, 2024
97e2c93
update
Binyang2014 Sep 6, 2024
ab58e6c
fix
Binyang2014 Sep 6, 2024
72218f9
WIP
Binyang2014 Sep 6, 2024
fd02273
WIP
Binyang2014 Sep 6, 2024
0a98f02
Merge branch 'main' into binyli/rocm-ci
Binyang2014 Sep 6, 2024
90f0354
Merge branch 'main' into binyli/rocm-ci
chhwang Sep 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions .azure-pipelines/integration-test-rocm.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
trigger:
- main

pr:
branches:
include:
- main
drafts: false

jobs:
- job: IntegrationTestRocm
displayName: Integration test ROCm
strategy:
matrix:
rocm6.2:
containerImage: ghcr.io/microsoft/mscclpp/mscclpp:base-dev-rocm6.2

pool:
name: mscclpp-rocm
container:
image: $[ variables['containerImage'] ]
options: --privileged --ipc=host --security-opt seccomp=unconfined --group-add video --ulimit memlock=-1:-1

steps:
- task: Bash@3
name: Build
displayName: Build
inputs:
targetType: 'inline'
script: |
mkdir build && cd build
CXX=/opt/rocm/bin/hipcc cmake -DCMAKE_BUILD_TYPE=Release -DBYPASS_GPU_CHECK=ON -DUSE_ROCM=ON ..
make -j
workingDirectory: '$(System.DefaultWorkingDirectory)'

- task: Bash@3
name: InstallRcclTest
displayName: Install rccl-test
inputs:
targetType: 'inline'
script: |
git clone https://github.com/ROCm/rccl-tests.git
cd rccl-tests
make MPI=1 MPI_HOME=/usr/local/mpi HIP_HOME=/opt/rocm -j
workingDirectory: '$(System.DefaultWorkingDirectory)'

- task: Bash@3
name: InstallDep
displayName: Install dependencies
inputs:
targetType: 'inline'
script: |
set -e
git clone https://github.com/Azure/msccl-tools.git
cd msccl-tools
pip3 install .

- task: Bash@3
name: GenerateExectionFiles
displayName: Generate execution files
inputs:
targetType: 'inline'
script: |
set -e
git clone https://$(GIT_USER):$(GIT_PAT)@msazure.visualstudio.com/DefaultCollection/One/_git/azure-mscclpp
cd azure-mscclpp
git checkout binyli/ci
mkdir execution-files
python3 algos/allreduce_mi300_packet.py 8 8 > execution-files/allreduce_mi300_packet.json
python3 algos/allreduce_mi300_sm_mscclpp.py 8 8 > execution-files/allreduce_mi300_sm_mscclpp.json

- task: Bash@3
name: AllReduceTest
displayName: Run mscclpp allReduce test
inputs:
targetType: 'inline'
script: |
set -e
export PATH=/usr/local/mpi/bin:$PATH
sudo /usr/local/mpi/bin/mpirun --allow-run-as-root -np 8 --bind-to numa -x MSCCLPP_DEBUG=WARN -x LD_PRELOAD="$(pwd)/build/apps/nccl/libmscclpp_nccl.so" \
-x ALLREDUCE_SMALL_MSG_BOUNDARY=32K -x ALLREDUCE_LARGE_MSG_BOUNDARY=1M ./rccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w 10 -n 100
workingDirectory: '$(System.DefaultWorkingDirectory)'

- task: Bash@3
name: AllReduceWithExecutionFileTest
displayName: Run mscclpp allReduce with execution file
inputs:
targetType: 'inline'
script: |
set -e
export PATH=/usr/local/mpi/bin:$PATH
sudo /usr/local/mpi/bin/mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN \
-x ALLREDUCEPKT_IP_JSON_FILE=./azure-mscclpp/execution-files/allreduce_mi300_packet.json \
-x ALLREDUCE_IP_JSON_FILE=./azure-mscclpp/execution-files/allreduce_mi300_sm_mscclpp.json \
-x ALLREDUCE_SMALL_MSG_BOUNDARY=32K -x ALLREDUCE_LARGE_MSG_BOUNDARY=1M ./rccl-tests/build/all_reduce_perf \
-b 1K -e 1G -f 2 -d half -G 20 -w 10 -n 100
workingDirectory: '$(System.DefaultWorkingDirectory)'
19 changes: 19 additions & 0 deletions docker/base-x-rocm.dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
ARG BASE_IMAGE
FROM ${BASE_IMAGE}

LABEL maintainer="MSCCL++"
LABEL org.opencontainers.image.source https://github.com/microsoft/mscclpp

ENV DEBIAN_FRONTEND=noninteractive

ENV RCCL_VERSION=rocm-6.2.0
ARG ARCH=gfx942
ENV ARCH_TARGET=${ARCH}
RUN cd /tmp && \
git clone --branch ${RCCL_VERSION} --depth 1 https://github.com/ROCm/rccl.git && \
cd rccl && \
./install.sh --prefix=/opt/rocm --amdgpu_targets ${ARCH_TARGET} && \
cd .. && \
rm -rf /tmp/rccl

WORKDIR /
1 change: 1 addition & 0 deletions docker/base-x.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ LABEL maintainer="MSCCL++"
LABEL org.opencontainers.image.source https://github.com/microsoft/mscclpp

ENV DEBIAN_FRONTEND=noninteractive
USER root

RUN rm -rf /opt/nvidia

Expand Down
19 changes: 17 additions & 2 deletions docker/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ baseImageTable=(
["cuda12.1"]="nvidia/cuda:12.1.1-devel-ubuntu20.04"
["cuda12.2"]="nvidia/cuda:12.2.2-devel-ubuntu20.04"
["cuda12.3"]="nvidia/cuda:12.3.2-devel-ubuntu20.04"
["rocm6.2"]="rocm/rocm-terminal:6.2"
chhwang marked this conversation as resolved.
Show resolved Hide resolved
)

declare -A extraLdPathTable
Expand All @@ -16,13 +17,14 @@ extraLdPathTable=(
["cuda12.1"]="/usr/local/cuda-12.1/compat:/usr/local/cuda-12.1/lib64"
["cuda12.2"]="/usr/local/cuda-12.2/compat:/usr/local/cuda-12.2/lib64"
["cuda12.3"]="/usr/local/cuda-12.3/compat:/usr/local/cuda-12.3/lib64"
["rocm6.2"]="/opt/rocm/lib"
)

GHCR="ghcr.io/microsoft/mscclpp/mscclpp"
TARGET=${1}

print_usage() {
echo "Usage: $0 [cuda11.8|cuda12.1|cuda12.2|cuda12.3]"
echo "Usage: $0 [cuda11.8|cuda12.1|cuda12.2|cuda12.3|rocm6.2]"
}

if [[ ! -v "baseImageTable[${TARGET}]" ]]; then
Expand All @@ -36,12 +38,25 @@ SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"

cd ${SCRIPT_DIR}/..

docker build -t ${GHCR}:base-${TARGET} \
docker build -t ${GHCR}-common:base-${TARGET} \
chhwang marked this conversation as resolved.
Show resolved Hide resolved
-f docker/base-x.dockerfile \
--build-arg BASE_IMAGE=${baseImageTable[${TARGET}]} \
--build-arg EXTRA_LD_PATH=${extraLdPathTable[${TARGET}]} \
--build-arg TARGET=${TARGET} .

if [[ ${TARGET} == rocm* ]]; then
echo "Building ROCm base image..."
docker build -t ${GHCR}:base-${TARGET} \
-f docker/base-x-rocm.dockerfile \
--build-arg BASE_IMAGE=${GHCR}-common:base-${TARGET} \
--build-arg EXTRA_LD_PATH=${extraLdPathTable[${TARGET}]} \
--build-arg TARGET=${TARGET} \
--build-arg ARCH="gfx942" .
else
echo "Building CUDA base image..."
docker tag ${GHCR}-common:base-${TARGET} ${GHCR}:base-${TARGET}
fi

docker build -t ${GHCR}:base-dev-${TARGET} \
-f docker/base-dev-x.dockerfile \
--build-arg BASE_IMAGE=${GHCR}:base-${TARGET} \
Expand Down
Empty file added python/requirements_rocm6.txt
Empty file.
Loading