multi-gpu evaluate 'fix' #15

DavideTisi · 2024-06-19T13:26:03Z

This is more an issue with a stupid fix than a real PR:
I had this problem try to run with ipi and pet on izar:

/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/torch_geometric/nn/data_parallel.py:60: UserWarning: 'DataParallel' is usually much slower than 'DistributedDataParallel' even on a single machine. Please consider switching to 'DistributedDataParallel' for multi-GPU training.
  warnings.warn("'DataParallel' is usually much slower than "
Traceback (most recent call last):
  File "/work/cosmo/tisi/src/i-pi/drivers/py/driver.py", line 237, in <module>
    run_driver(
  File "/work/cosmo/tisi/src/i-pi/drivers/py/driver.py", line 130, in run_driver
    pot, force, vir, extras = driver(cell, pos)
  File "/work/cosmo/tisi/src/i-pi/drivers/py/pes/pet.py", line 94, in __call__
    pot, force = self.pet_calc.forward(pet_structure)
  File "/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/pet/single_struct_calculator.py", line 91, in forward
    prediction_energy, prediction_forces = self.model([graph])
  File "/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/torch_geometric/nn/data_parallel.py", line 91, in forward
    outputs = self.parallel_apply(replicas, inputs, None)
  File "/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 200, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 108, in parallel_apply
    output.reraise()
  File "/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise
    raise exception
TypeError: Caught TypeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
    output = module(*input, **kwargs)
  File "/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: PETMLIPWrapper.forward() missing 2 required positional arguments: 'augmentation' and 'create_graph'

The problem arises from the fact that the model has been trained on multigpus and that the wrapper not really work on evaluation. What happens is that on izar it actally uses one gpus but there are 2 gpus so it wants to wrap for multigpus but then it fails. I increased that limit to 4 to make the if statement false.

Of course this is not a long term solution

ceriottm · 2024-06-19T15:34:15Z

is the multigpu used to parallelize over structures or over a single structure? in the former case I think at inference this should never use multigpu, in the latter case it should be an optional parameter like device

ugly "fix"

ec45641

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-gpu evaluate 'fix' #15

multi-gpu evaluate 'fix' #15

DavideTisi commented Jun 19, 2024

ceriottm commented Jun 19, 2024

multi-gpu evaluate 'fix' #15

Are you sure you want to change the base?

multi-gpu evaluate 'fix' #15

Conversation

DavideTisi commented Jun 19, 2024

ceriottm commented Jun 19, 2024