We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
https://github.com/pytorch/tutorials/blob/main/intermediate_source/FSDP_tutorial.rst Running on simple setup with 2 RTX2070, this example, even pasted line by line fails to run: Toward the end of log, it says no space left on the device but I am seeing more than terabyte of space
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 2070 Off | 00000000:01:00.0 On | N/A | | 41% 41C P8 11W / 185W | 315MiB / 8192MiB | 7% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 2070 ... Off | 00000000:04:00.0 Off | N/A | | 41% 35C P0 38W / 215W | 1MiB / 8192MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+ WORLD_SIZE: 2 GG: fsdp_main entered: rank: 1 , world_size: 2 GG: train entered: rank: 1 : world_size: 2 , train_loader: <torch.utils.data.dataloader.DataLoader object at 0x7f71218ab070> , epoch: 1 GG: fsdp_main entered: rank: 0 , world_size: 2 GG: train entered: rank: 0 : world_size: 2 , train_loader: <torch.utils.data.dataloader.DataLoader object at 0x7f968512c070> , epoch: 1 W1201 19:55:12.485459 339 torch/multiprocessing/spawn.py:160] Terminating process 349 via signal SIGTERM Traceback (most recent call last): File "/root/extdir/gg/git/codelab/gpu/ml/pytorch/distributed/tutorials/3-fsdp/./ex1-fsdp.py", line 229, in <module> mp.spawn(fsdp_main, File "/usr/local/lib64/python3.9/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") File "/usr/local/lib64/python3.9/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes while not context.join(): File "/usr/local/lib64/python3.9/site-packages/torch/multiprocessing/spawn.py", line 203, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 1 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib64/python3.9/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap fn(i, *args) File "/root/extdir/gg/git/codelab/gpu/ml/pytorch/distributed/tutorials/3-fsdp/ex1-fsdp.py", line 183, in fsdp_main train(args, model, rank, world_size, train_loader, optimizer, epoch, sampler=sampler1) File "/root/extdir/gg/git/codelab/gpu/ml/pytorch/distributed/tutorials/3-fsdp/ex1-fsdp.py", line 82, in train for batch_idx, (data, target) in enumerate(train_loader): File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/dataloader.py", line 701, in __next__ data = self._next_data() File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/dataloader.py", line 1465, in _next_data return self._process_data(data) File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/dataloader.py", line 1491, in _process_data data.reraise() File "/usr/local/lib64/python3.9/site-packages/torch/_utils.py", line 715, in reraise raise exception RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 351, in _worker_loop data = fetcher.fetch(index) # type: ignore[possibly-undefined] File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch return self.collate_fn(data) File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 398, in default_collate return collate(batch, collate_fn_map=default_collate_fn_map) File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 211, in collate return [ File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 212, in <listcomp> collate(samples, collate_fn_map=collate_fn_map) File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 155, in collate return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map) File "/usr/local/lib64/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 270, in collate_tensor_fn storage = elem._typed_storage()._new_shared(numel, device=elem.device) File "/usr/local/lib64/python3.9/site-packages/torch/storage.py", line 1180, in _new_shared untyped_storage = torch.UntypedStorage._new_shared( File "/usr/local/lib64/python3.9/site-packages/torch/storage.py", line 402, in _new_shared return cls._new_using_fd_cpu(size) RuntimeError: unable to write to file </torch_382_2625200545_0>: No space left on device (28) /usr/lib64/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 11 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' [root@localhost 3-fsdp]# df -h Filesystem Size Used Avail Use% Mounted on overlay 1.4T 173G 1.3T 13% / tmpfs 64M 0 64M 0% /dev shm 64M 0 64M 0% /dev/shm /dev/mapper/cs-home 1.4T 173G 1.3T 13% /root/extdir tmpfs 32G 12K 32G 1% /proc/driver/nvidia /dev/mapper/cs-root 400G 373G 27G 94% /usr/bin/nvidia-smi devtmpfs 4.0M 0 4.0M 0% /dev/nvidia0
Run example in https://github.com/pytorch/tutorials/blob/main/intermediate_source/FSDP_tutorial.rst without any modification on multi-gpu (in my case 2 rtx2070)
Centos 9 stream cuda 12.2.
cc @wconstab @osalpekar @H-Huang @kwen2501 @malfet
The text was updated successfully, but these errors were encountered:
RuntimeError: unable to write to file </torch_382_2625200545_0>: No space left on device (28)
I guess that refers to shared memory space rather than disk space. @jdgh000 do you mind sharing output of your name -a command?
name -a
Sorry, something went wrong.
No branches or pull requests
Add Link
https://github.com/pytorch/tutorials/blob/main/intermediate_source/FSDP_tutorial.rst
Running on simple setup with 2 RTX2070, this example, even pasted line by line fails to run:
Toward the end of log, it says no space left on the device but I am seeing more than terabyte of space
Describe the bug
Run example in https://github.com/pytorch/tutorials/blob/main/intermediate_source/FSDP_tutorial.rst without any modification on multi-gpu (in my case 2 rtx2070)
Describe your environment
Centos 9 stream
cuda 12.2.
cc @wconstab @osalpekar @H-Huang @kwen2501 @malfet
The text was updated successfully, but these errors were encountered: