You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This means that, if a model is trained using a cpu accelerator, but a gpu is present and setup on the machine, the returned predictor will be loaded onto the gpu.
This can be observed when constructing the Estimator and then calling Estimator.train()
What happens is a call to Estimator.train_model, which finishes with a call to create_predictorhere.
To Reproduce
This can be shown using the following code sample and nvidia-smi, adapted from the gluonts tutorials.
Calling nvidia-smi after running this code will indicate that the process is running on a gpu, something like the following:
['Wed Jul 31 14:54:10 2024 ',
'+-----------------------------------------------------------------------------+',
'| NVIDIA-SMI 470.199.02 Driver Version: 470.199.02 CUDA Version: 11.4 |',
'|-------------------------------+----------------------+----------------------+',
'| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |',
'| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |',
'| | | MIG M. |',
'|===============================+======================+======================|',
'| 0 Tesla M60 Off | 00000000:00:1D.0 Off | 0 |',
'| N/A 28C P0 37W / 150W | 721MiB / 7618MiB | 1% Default |',
'| | | N/A |',
'+-------------------------------+----------------------+----------------------+',
'| 1 Tesla M60 Off | 00000000:00:1E.0 Off | 0 |',
'| N/A 29C P8 15W / 150W | 6MiB / 7618MiB | 0% Default |',
'| | | N/A |',
'+-------------------------------+----------------------+----------------------+',
' ',
'+-----------------------------------------------------------------------------+',
'| Processes: |',
'| GPU GI CI PID Type Process name GPU Memory |',
'| ID ID Usage |',
'|=============================================================================|',
'| 0 N/A N/A 1186 G /usr/lib/xorg/Xorg 3MiB |',
'| 0 N/A N/A 6479 C ...ython/ts_torch/bin/python 714MiB |',
'| 1 N/A N/A 1186 G /usr/lib/xorg/Xorg 3MiB |',
'+-----------------------------------------------------------------------------+']
Workaround
Setting the CUDA_VISIBLE_DEVICES env variable to -1 before code execution should prevent the auto detection of any cuda gpu.
This has side effects, as no further code in the process can detect gpus.
This might not work for other device types, I have not tested it myself.
This must be done before gluonts is imported
Possible fix
I have had a quick look at the codebase (as a new arrival here, so I might be missing some stuff), and from a naive perspective it seems that this could be a fairly light touch fix:
Adding the parameter device with a default value of auto to the get_predictor method
Either:
1: passing through the kwargs present in the Estimator.train() function args down through to train_model(), and adding, a specific kwarg for the predictor device (predictor_device)
2: Doing something smart with the (if present) trainer_kwargs.accelerator inside the train_model() function
I'm not certain that number 2 is good, as I don't think there is a 1-1 mapping between lightning accelerators and torch device types
I'm happy to do the work to implement the fix, but I'd be curious to first know if anyone has ideas for a nicer approach?
Description
Implementations of
PyTorchLightningEstimator.create_predictor
, such as DeepAREstimator.create_predictor and SimpleFeedForwardAREstimator.create_predictor, passdevice="auto"
in the constructor ofPyTorchPredictor
.This means that, if a model is trained using a
cpu
accelerator, but agpu
is present and setup on the machine, the returned predictor will be loaded onto thegpu
.This can be observed when constructing the
Estimator
and then callingEstimator.train()
What happens is a call to
Estimator.train_model
, which finishes with a call tocreate_predictor
here.To Reproduce
This can be shown using the following code sample and
nvidia-smi
, adapted from the gluonts tutorials.Error message or code output
Calling
nvidia-smi
after running this code will indicate that the process is running on a gpu, something like the following:Workaround
Setting the
CUDA_VISIBLE_DEVICES
env variable to-1
before code execution should prevent the auto detection of any cuda gpu.This has side effects, as no further code in the process can detect gpus.
This might not work for other device types, I have not tested it myself.
This must be done before gluonts is imported
Possible fix
I have had a quick look at the codebase (as a new arrival here, so I might be missing some stuff), and from a naive perspective it seems that this could be a fairly light touch fix:
device
with a default value ofauto
to theget_predictor
method1: passing through the
kwargs
present in the Estimator.train() function args down through totrain_model()
, and adding, a specific kwarg for the predictor device (predictor_device
)2: Doing something smart with the (if present)
trainer_kwargs.accelerator
inside thetrain_model()
functionI'm not certain that number 2 is good, as I don't think there is a 1-1 mapping between lightning accelerators and torch device types
I'm happy to do the work to implement the fix, but I'd be curious to first know if anyone has ideas for a nicer approach?
Environment
Let me know if you have any questions about my setup, I'll be happy to help
Thanks !
The text was updated successfully, but these errors were encountered: