Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-gpu evaluate 'fix' #15

Open
wants to merge 1 commit into
base: neighbors_convert_cpp
Choose a base branch
from

Conversation

DavideTisi
Copy link
Collaborator

This is more an issue with a stupid fix than a real PR:
I had this problem try to run with ipi and pet on izar:

/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/torch_geometric/nn/data_parallel.py:60: UserWarning: 'DataParallel' is usually much slower than 'DistributedDataParallel' even on a single machine. Please consider switching to 'DistributedDataParallel' for multi-GPU training.
  warnings.warn("'DataParallel' is usually much slower than "
Traceback (most recent call last):
  File "/work/cosmo/tisi/src/i-pi/drivers/py/driver.py", line 237, in <module>
    run_driver(
  File "/work/cosmo/tisi/src/i-pi/drivers/py/driver.py", line 130, in run_driver
    pot, force, vir, extras = driver(cell, pos)
  File "/work/cosmo/tisi/src/i-pi/drivers/py/pes/pet.py", line 94, in __call__
    pot, force = self.pet_calc.forward(pet_structure)
  File "/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/pet/single_struct_calculator.py", line 91, in forward
    prediction_energy, prediction_forces = self.model([graph])
  File "/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/torch_geometric/nn/data_parallel.py", line 91, in forward
    outputs = self.parallel_apply(replicas, inputs, None)
  File "/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 200, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 108, in parallel_apply
    output.reraise()
  File "/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise
    raise exception
TypeError: Caught TypeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
    output = module(*input, **kwargs)
  File "/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/work/cosmo/tisi/src/pet-venv-gpu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: PETMLIPWrapper.forward() missing 2 required positional arguments: 'augmentation' and 'create_graph'

The problem arises from the fact that the model has been trained on multigpus and that the wrapper not really work on evaluation. What happens is that on izar it actally uses one gpus but there are 2 gpus so it wants to wrap for multigpus but then it fails. I increased that limit to 4 to make the if statement false.

Of course this is not a long term solution

@ceriottm
Copy link
Contributor

is the multigpu used to parallelize over structures or over a single structure? in the former case I think at inference this should never use multigpu, in the latter case it should be an optional parameter like device

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants