You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current experimentation code in benchmarking runs evaluation in the same thread as the (subsequent) trainings. This is a problem when using DDP as the first evaluation (Line 295) creates several processes (as many GPUs) and each of them try to spawn training processes causing a problem with DDP ports clashing.
Describe the solution you'd like
The evaluation/testing should run in a separate process a la run_training_job and this issue wouldn't occur.
The current experimentation code in benchmarking runs evaluation in the same thread as the (subsequent) trainings. This is a problem when using DDP as the first evaluation (Line 295) creates several processes (as many GPUs) and each of them try to spawn training processes causing a problem with DDP ports clashing.
Describe the solution you'd like
The evaluation/testing should run in a separate process a la
run_training_job
and this issue wouldn't occur.**Additional info:
See the discussion on Lightning forum: Lightning-AI/pytorch-lightning#2537
The text was updated successfully, but these errors were encountered: