Using distconv and model-parallel for GAN #2447

jvwilliams23 · 2024-05-07T11:00:59Z

jvwilliams23
May 7, 2024

I am trying to run a multi-gpu GAN similar to DistConvGAN.py in the ExaGAN example with 2 GPUs (using openmpi launcher on a workstation with 2 V100s).
From the GPU memory usage statistics, it seems that one GPU has the model weights + all activations (10 GB, same as with 1 GPU), whereas the other GPU only has model weights (1.7 GB). Am I missing something?

Below I have put the log file for each run, batch script used to run (maybe I am missing some mpi flag?), and a short python code snippet showing how I try to invoke model parallelism. (Happy to also provide the prototext if it is useful).

one GPU:

Hardware properties (for master process)
  Processes on node          : 1
  Total number of processes  : 1
  OpenMP threads per process : 4
  I/O threads per process (+offset) : 4 (+5)
  Background I/O enabled     : 1
  GPUs on node               : 2

Running: LLNL LBANN version: 0.105.0 (wide_resnet50_amp_baseline-43-gb6188e3-dirty)
         LLNL Hydrogen version: 1.5.4 (e10f53e)
         LLNL Aluminum version: 1.4.2 (eed7b3c)

Build settings
  Type     : Release
  Aluminum : detected
  GPU     : detected
  cuDNN    : detected
  CUB      : detected
  MV2_USE_CUDA :

DiHydrogen Features:
  DaCe : disabled

Aluminum Features:
  NCCL : enabled

Trainer settings
  Trainers              : 1
  Processes per trainer : 1
  Grid dimensions       : 1 x 1


trainer0
  Background I/O: true


Running with these parameters:
 General:
  datatype size:              4
  mini_batch_size:            1
  num_epochs:                 3
  hydrogen_block_size:        0
  procs_per_trainer:          1
  serialize_io:               0
  caliper:                    disabled
  cuda:                       enabled
  cudnn:                      enabled
  root_random_seed[t][r]:     [0][0]=0000000042
  random_seed[t][r]:          [0][0]=2654438505
  data_seq_random_seed[t][r]: [0][0]=2654438505
  deterministic_exec:         disabled
  data_layout:                     (only used for metrics)
starting build_model_from_prototext
Model 0 GPU memory usage statistics : 1.31 GiB mean, 1.31 GiB median, 1.31 GiB max, 1.31 GiB min (31.7 GiB total)
Model 0 GPU memory usage statistics : 8.77 GiB mean, 8.77 GiB median, 8.77 GiB max, 8.77 GiB min (31.7 GiB total)
Model 0 GPU memory usage statistics : 10.3 GiB mean, 10.3 GiB median, 10.3 GiB max, 10.3 GiB min (31.7 GiB total)

Two GPUs:

Hardware properties (for master process)
  Processes on node          : 2
  Total number of processes  : 2
  OpenMP threads per process : 4
  I/O threads per process (+offset) : 1 (+5)
  Background I/O enabled     : 1
  GPUs on node               : 2

Running: LLNL LBANN version: 0.105.0 (wide_resnet50_amp_baseline-43-gb6188e3-dirty)
         LLNL Hydrogen version: 1.5.4 (e10f53e)
         LLNL Aluminum version: 1.4.2 (eed7b3c)

Build settings
  Type     : Release
  Aluminum : detected
  GPU     : detected
  cuDNN    : detected
  CUB      : detected
  MV2_USE_CUDA :

DiHydrogen Features:
  DaCe : disabled

Aluminum Features:
  NCCL : enabled

Trainer settings
  Trainers              : 1
  Processes per trainer : 2
  Grid dimensions       : 1 x 2


trainer0
  Background I/O: true


Running with these parameters:
 General:
  datatype size:              4
  mini_batch_size:            1
  num_epochs:                 3
  hydrogen_block_size:        0
  procs_per_trainer:          2
  serialize_io:               0
  caliper:                    disabled
  cuda:                       enabled
  cudnn:                      enabled
  root_random_seed[t][r]:     [0][0]=0000000042 [0][1]=0000000042
  random_seed[t][r]:          [0][0]=2654438505 [0][1]=2654438505
  data_seq_random_seed[t][r]: [0][0]=2654438505 [0][1]=2654438505
  deterministic_exec:         disabled
  data_layout:                     (only used for metrics)
starting build_model_from_prototext
Model 0 GPU memory usage statistics : 1.57 GiB mean, 1.57 GiB median, 1.57 GiB max, 1.57 GiB min (31.7 GiB total)
Model 0 GPU memory usage statistics : 5.39 GiB mean, 1.74 GiB median, 9.03 GiB max, 1.74 GiB min (31.7 GiB total)
Model 0 GPU memory usage statistics : 6.18 GiB mean, 1.76 GiB median, 10.6 GiB max, 1.76 GiB min (31.7 GiB total)

batch file used to run:

#!/bin/bash

export DISTCONV_WS_CAPACITY_FACTOR=0.8
export LBANN_DISTCONV_HALO_EXCHANGE=AL
export LBANN_DISTCONV_TENSOR_SHUFFLER=AL
export LBANN_DISTCONV_CONVOLUTION_FWD_ALGORITHM=AUTOTUNE
export LBANN_DISTCONV_CONVOLUTION_BWD_DATA_ALGORITHM=AUTOTUNE
export LBANN_DISTCONV_CONVOLUTION_BWD_FILTER_ALGORITHM=AUTOTUNE
export LBANN_DISTCONV_RANK_STRIDE=1
export LBANN_DISTCONV_NUM_IO_PARTITIONS=2
export LBANN_KEEP_ERROR_SIGNALS=1
echo "Started at $(date)"
mpiexec -n 2 --map-by ppr:2:node -wdir /home/jwilliams/project/lbann_gan/stylegan2_withdiscriminator_weightDemodTal_modelparallel_filterLastLayer/20240507_114638_gan_HW256_nonsaturatingloss_n1_ppn2 /home/jwilliams/lbann-builds/lbann-latest/build_appspack/install/bin/lbann --use_data_store --preload_data_store --prototext=/home/jwilliams/project/lbann_gan/stylegan2_withdiscriminator_weightDemodTal_modelparallel_filterLastLayer/20240507_114638_gan_HW256_nonsaturatingloss_n1_ppn2/experiment.prototext
status=$?
echo "Finished at $(date)"
exit ${status}

I am setting parallel strategy in the following way:

environment = {}
parallel_strategy = None
procs_per_node = 1
if args.modelparallel:
  height_groups = 2
  procs_per_node = 2
  environment = get_distconv_environment(num_io_partitions=procs_per_node)
  parallel_strategy = lbann.core.util.get_parallel_strategy_args(
    height_groups=procs_per_node,
  )

# setup model 
# (code omitted for brevity)...

# assign parallel strategy to each layer
layers = list(lbann.traverse_layer_graph(loss))
for l in layers:
  l.parallel_strategy = parallel_strategy

kwargs = {
  "nodes": 1,
  "procs_per_node": procs_per_node,
  "scheduler": "openmpi",
}

# run
lbann.run(
  trainer,
  model,
  data_reader,
  opt,
  job_name=job_name,
  environment=environment,
  lbann_args=lbann_args,
  **kwargs,
)

jvwilliams23 · 2024-05-07T11:05:50Z

jvwilliams23
May 7, 2024
Author

FYI, on this system we built the dependencies with spack:

spack -k install lbann@develop %[email protected] +numpy +vision +cuda cuda_arch=70 ^hydrogen@develop+al ^aluminum@master ^py-numpy

and then compiled lbann itself with cmake (there was an issue in building lbann itself with spack).

7 replies

jvwilliams23 May 7, 2024
Author

Just checked the cmake log. Distconv was not enabled... my bad! thanks.

Recompiling now

szaman19 May 7, 2024

You may need to modify the spack command to include +distconv. Otherwise, DiHydrogen may not be installed which is required for distconv currently.

jvwilliams23 May 7, 2024
Author

Is there a repo for distconv that I could use to build from source (not spack)? I only found this old one https://github.com/bvanessen/lbann_distconv

benson31 May 7, 2024
Collaborator

DistConv is part of DiHydrogen. In spack, building dihydrogen+distconv should get you there -- though make sure you're up to date with LBANN@develop since I merged a PR yesterday addressing a versioning issue. If you build H2 directly with CMake, the option is -D H2_ENABLE_DISTCONV_LEGACY=ON.

jvwilliams23 May 9, 2024
Author

Ok thanks, I built with rebuilt lbann with spack using +distconv.

Get similar behaviour though. This time the maximum memory increases from 6GB with 1 GPU to "20.5 GiB mean, 15.9 GiB median, 25.1 GiB max, 15.9 GiB min" with 2 GPUs.

Here is my batch.sh file (I am using openmpi scheduler), and prototext below if that helps.

#!/bin/bash

export DISTCONV_WS_CAPACITY_FACTOR=0.8
export LBANN_DISTCONV_HALO_EXCHANGE=AL
export LBANN_DISTCONV_TENSOR_SHUFFLER=AL
export LBANN_DISTCONV_CONVOLUTION_FWD_ALGORITHM=AUTOTUNE
export LBANN_DISTCONV_CONVOLUTION_BWD_DATA_ALGORITHM=AUTOTUNE
export LBANN_DISTCONV_CONVOLUTION_BWD_FILTER_ALGORITHM=AUTOTUNE
export LBANN_DISTCONV_RANK_STRIDE=1
export LBANN_DISTCONV_NUM_IO_PARTITIONS=1
export LBANN_KEEP_ERROR_SIGNALS=1
echo "Started at $(date)"
mpiexec -n 2 --map-by ppr:2:node -wdir /netfs/smain01/scafellpike/local/HT04543/jxc06/jxw92-jxc06/project/lbann_stylES/multiGPU_distconv/distconv_branch/20240509_155702_gan_HW256_nonsaturatingloss_n1_ppn2 /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/spack/opt/spack/linux-rhel7-skylake_avx512/gcc-11.2.0/lbann-develop-qhi2drst2tayjyhwiajajfqtzh7azpnk/bin/lbann --use_data_store --preload_data_store --num_io_threads=1 --prototext=/netfs/smain01/scafellpike/local/HT04543/jxc06/jxw92-jxc06/project/lbann_stylES/multiGPU_distconv/distconv_branch/20240509_155702_gan_HW256_nonsaturatingloss_n1_ppn2/experiment.prototext
status=$?

echo "Finished at $(date)"
exit ${status}

experiment.prototext.txt

Also some other questions on distconv:

Strided convolutions are not supported, right? I had strided convs previously which gave an error right away, then I saw a comment in DistConvGAN.py saying that. After removing them, lbann started running ok.
Is periodic padding supported? When I tried to use it, I get the following:

LBANN error on rank 0 (/tmp/jxw92-jxc06/spack-stage/spack-stage-lbann-develop-qhi2drst2tayjyhwiajajfqtzh7azpnk/spack-src/src/layers/data_type_distconv_adapter.cpp:1084): gen_synthesis_res8_conv0_conv2dmodule_instance1_split: Copyout of non-first tensor not supported

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using distconv and model-parallel for GAN #2447

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Using distconv and model-parallel for GAN #2447

jvwilliams23 May 7, 2024

Replies: 1 comment · 7 replies

jvwilliams23 May 7, 2024 Author

jvwilliams23 May 7, 2024 Author

szaman19 May 7, 2024

jvwilliams23 May 7, 2024 Author

benson31 May 7, 2024 Collaborator

jvwilliams23 May 9, 2024 Author

jvwilliams23
May 7, 2024

Replies: 1 comment 7 replies

jvwilliams23
May 7, 2024
Author

jvwilliams23 May 7, 2024
Author

jvwilliams23 May 7, 2024
Author

benson31 May 7, 2024
Collaborator

jvwilliams23 May 9, 2024
Author