Using distconv and model-parallel for GAN #2447
Unanswered
jvwilliams23
asked this question in
Q&A
Replies: 1 comment 7 replies
-
FYI, on this system we built the dependencies with spack:
and then compiled lbann itself with cmake (there was an issue in building lbann itself with spack). |
Beta Was this translation helpful? Give feedback.
7 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am trying to run a multi-gpu GAN similar to
DistConvGAN.py
in theExaGAN
example with 2 GPUs (using openmpi launcher on a workstation with 2 V100s).From the GPU memory usage statistics, it seems that one GPU has the model weights + all activations (10 GB, same as with 1 GPU), whereas the other GPU only has model weights (1.7 GB). Am I missing something?
Below I have put the log file for each run, batch script used to run (maybe I am missing some mpi flag?), and a short python code snippet showing how I try to invoke model parallelism. (Happy to also provide the prototext if it is useful).
one GPU:
Two GPUs:
batch file used to run:
I am setting parallel strategy in the following way:
Beta Was this translation helpful? Give feedback.
All reactions