70B Fine-tuning GPUs Utilization #2142

fabiogeraci · 2024-12-10T10:45:16Z

          openmpi script, launch cli

mpirun \
    -np $TOTAL_NUM_GPUS \
    -H \$MPI_HOST_STRING \
    -x PATH \
    -bind-to none \
    -map-by slot \
    --mca pml ob1 --mca btl ^openib \
    --display-allocation \
    --display-map \
    python3 src/full_finetune_distributed.py \
    --config config_files/8B_full_distributed.yaml \
    optimizer_in_bwd=False

full_finetune_distributed.py

if int(os.environ.get("NUM_NODES")) > 1:
    from torch.distributed._tensor import init_device_mesh
    mesh_2d = init_device_mesh("cuda",
                               mesh_shape=(int(os.environ.get("NUM_NODES")),
                                           int(os.environ['WORLD_SIZE']) // 2),
                                           mesh_dim_names=("dp", "tp"))
else:
    mesh_2d = None

training.shard_model(
    model=model,
    shard_conditions=fsdp_shard_conditions,
    cpu_offload=fsdp_cpu_offload,
    reshard_after_forward=reshard_after_forward,
    mesh=mesh_2d,
)

_distributed.py

def shard_model(
    model: TransformerDecoder,
    shard_conditions: List[Callable[[str, nn.Module], bool]],
    *,
    cpu_offload: bool,
    reshard_after_forward: bool = True,
    mesh: Optional[DeviceMesh] = None # <-- Add this line
) -> None:
if mesh is not None: # <-- Add this line
        fsdp_kwargs["mesh"] = mesh # <-- Add this line

Originally posted by @fabiogeraci in #2018 (comment)

The text was updated successfully, but these errors were encountered:

fabiogeraci · 2024-12-10T10:52:17Z

I am using the configuration above to fine tuning 70B model on 2 nodes with 8 gpus each. the job took 75minutes to compile (is that usual?)

I also noticed that one of the 16 gpus wan not used at all, i hope the video helps i also attached the nccl 70b_nccl.txt
Screencast from 10-12-24 09:59:54.webm

the job was killed because, any suggestions

# LSBATCH: User input
#BSUB -J gpu-test
#BSUB -o /nfs/users/nfs_f/fg12/scripts/logs/gpu-test_o.%J
#BSUB -e /nfs/users/nfs_f/fg12/scripts/logs/gpu-test_e.%J
#BSUB -n 128
#BSUB -q gpu-parallel
#BSUB -gpu "num=8:gmem=80000:mode=shared:block=yes"
#BSUB -M 768G
#BSUB -R "select[mem>768G] rusage[mem=768G] span[ptile=64]"

TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Exited with signal termination: 9.

Resource usage summary:

    CPU time :                                   69121.00 sec.
    Max Memory :                                 793870 MB
    Average Memory :                             398370.69 MB
    Total Requested Memory :                     1572864.00 MB
    Delta Memory :                               778994.00 MB
    Max Swap :                                   -
    Max Processes :                              559
    Max Threads :                                5356
    Run time :                                   6266 sec.
    Turnaround time :                            6269 sec.

70b_config.txt

joecummings · 2024-12-10T11:19:57Z

Thanks for the report! Based on your config and the setup you have, I don't see immediately why this would hit your specified memory limit of 768G. Let me get ahold of a multi-node setup today and test this out.

joecummings · 2024-12-11T15:21:43Z

Hey @fabiogeraci, just updating you on this. I'm waiting on a request for multi-node server (PyTorch has limited quantity). If I don't hear back today, I'll just rent one out on Lambda Labs or something.

fabiogeraci · 2024-12-11T15:31:33Z

Hey @fabiogeraci, just updating you on this. I'm waiting on a request for multi-node server (PyTorch has limited quantity). If I don't hear back today, I'll just rent one out on Lambda Labs or something.

thanks you

joecummings self-assigned this Dec 10, 2024

joecummings added discussion Start a discussion distributed Anything related to distributed env (multi-GPU, multi-node) labels Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

70B Fine-tuning GPUs Utilization #2142

70B Fine-tuning GPUs Utilization #2142

fabiogeraci commented Dec 10, 2024

fabiogeraci commented Dec 10, 2024 •

edited

Loading

joecummings commented Dec 10, 2024

joecummings commented Dec 11, 2024

fabiogeraci commented Dec 11, 2024

70B Fine-tuning GPUs Utilization #2142

70B Fine-tuning GPUs Utilization #2142

Comments

fabiogeraci commented Dec 10, 2024

fabiogeraci commented Dec 10, 2024 • edited Loading

joecummings commented Dec 10, 2024

joecummings commented Dec 11, 2024

fabiogeraci commented Dec 11, 2024

fabiogeraci commented Dec 10, 2024 •

edited

Loading