Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - Unable to run fsdp example #3171

Open
jdgh000 opened this issue Dec 1, 2024 · 1 comment
Open

[BUG] - Unable to run fsdp example #3171

jdgh000 opened this issue Dec 1, 2024 · 1 comment
Labels
bug CUDA Issues relating to CUDA distributed

Comments

@jdgh000
Copy link

jdgh000 commented Dec 1, 2024

Add Link

https://github.com/pytorch/tutorials/blob/main/intermediate_source/FSDP_tutorial.rst
Running on simple setup with 2 RTX2070, this example, even pasted line by line fails to run:
Toward the end of log, it says no space left on the device but I am seeing more than terabyte of space

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2070        Off | 00000000:01:00.0  On |                  N/A |
| 41%   41C    P8              11W / 185W |    315MiB /  8192MiB |      7%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 2070 ...    Off | 00000000:04:00.0 Off |                  N/A |
| 41%   35C    P0              38W / 215W |      1MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
WORLD_SIZE:  2
GG: fsdp_main entered: rank:  1 , world_size:  2
GG: train entered: rank:  1 : world_size:  2 , train_loader:  <torch.utils.data.dataloader.DataLoader object at 0x7f71218ab070> , epoch:  1
GG: fsdp_main entered: rank:  0 , world_size:  2
GG: train entered: rank:  0 : world_size:  2 , train_loader:  <torch.utils.data.dataloader.DataLoader object at 0x7f968512c070> , epoch:  1
W1201 19:55:12.485459 339 torch/multiprocessing/spawn.py:160] Terminating process 349 via signal SIGTERM
Traceback (most recent call last):
  File "/root/extdir/gg/git/codelab/gpu/ml/pytorch/distributed/tutorials/3-fsdp/./ex1-fsdp.py", line 229, in <module>
    mp.spawn(fsdp_main,
  File "/usr/local/lib64/python3.9/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/usr/local/lib64/python3.9/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
    while not context.join():
  File "/usr/local/lib64/python3.9/site-packages/torch/multiprocessing/spawn.py", line 203, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib64/python3.9/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
    fn(i, *args)
  File "/root/extdir/gg/git/codelab/gpu/ml/pytorch/distributed/tutorials/3-fsdp/ex1-fsdp.py", line 183, in fsdp_main
    train(args, model, rank, world_size, train_loader, optimizer, epoch, sampler=sampler1)
  File "/root/extdir/gg/git/codelab/gpu/ml/pytorch/distributed/tutorials/3-fsdp/ex1-fsdp.py", line 82, in train
    for batch_idx, (data, target) in enumerate(train_loader):
  File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
    data = self._next_data()
  File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/dataloader.py", line 1465, in _next_data
    return self._process_data(data)
  File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/dataloader.py", line 1491, in _process_data
    data.reraise()
  File "/usr/local/lib64/python3.9/site-packages/torch/_utils.py", line 715, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 351, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
    return self.collate_fn(data)
  File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 398, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
  File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 211, in collate
    return [
  File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 212, in <listcomp>
    collate(samples, collate_fn_map=collate_fn_map)
  File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 155, in collate
    return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
  File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 270, in collate_tensor_fn
    storage = elem._typed_storage()._new_shared(numel, device=elem.device)
  File "/usr/local/lib64/python3.9/site-packages/torch/storage.py", line 1180, in _new_shared
    untyped_storage = torch.UntypedStorage._new_shared(
  File "/usr/local/lib64/python3.9/site-packages/torch/storage.py", line 402, in _new_shared
    return cls._new_using_fd_cpu(size)
RuntimeError: unable to write to file </torch_382_2625200545_0>: No space left on device (28)


/usr/lib64/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 11 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[root@localhost 3-fsdp]# df -h
Filesystem           Size  Used Avail Use% Mounted on
overlay              1.4T  173G  1.3T  13% /
tmpfs                 64M     0   64M   0% /dev
shm                   64M     0   64M   0% /dev/shm
/dev/mapper/cs-home  1.4T  173G  1.3T  13% /root/extdir
tmpfs                 32G   12K   32G   1% /proc/driver/nvidia
/dev/mapper/cs-root  400G  373G   27G  94% /usr/bin/nvidia-smi
devtmpfs             4.0M     0  4.0M   0% /dev/nvidia0


Describe the bug

Run example in https://github.com/pytorch/tutorials/blob/main/intermediate_source/FSDP_tutorial.rst without any modification on multi-gpu (in my case 2 rtx2070)

Describe your environment

Centos 9 stream
cuda 12.2.

cc @wconstab @osalpekar @H-Huang @kwen2501 @malfet

@jdgh000 jdgh000 added the bug label Dec 1, 2024
@svekars svekars added distributed CUDA Issues relating to CUDA labels Dec 2, 2024
@malfet
Copy link
Contributor

malfet commented Dec 2, 2024

RuntimeError: unable to write to file </torch_382_2625200545_0>: No space left on device (28)

I guess that refers to shared memory space rather than disk space. @jdgh000 do you mind sharing output of your name -a command?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug CUDA Issues relating to CUDA distributed
Projects
None yet
Development

No branches or pull requests

3 participants