You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Python3.8
torch=2.0.1
torch-tb-profiler=0.4.3 # built from source
In tensorboard in the overview view the communication is 0.
In the distributed view:
there are no bar charts shown for Synchronizing/Communication Overview.
the table at the bottom called Communication Operation stats has 0 values in columns total latency, avg latency, data transfer time, avg data transfer time.
When I try using
Python3.8
torch=1.11.0
torch-tb-profiler=0.4.3 # built from source
There are no issues and the views show up properly.
However even for torch=1.12+ there are issues in communication and distributed view not showing up properly.
Does anyone have any insight into why this may be the case?
The text was updated successfully, but these errors were encountered:
I'm looking at the .json logs for both of these runs.
An observation I found is that the torch=2.0.1 generated .json
specifically for the objects in the json that has the name "ncclKernel_AllReduce_RING_LL_Sum_float(ncclDevComm*, unsigned long, ncclWork*)"
External id and correlation fields are the same value
whereas in torch=1.11.0 External id and correlation fields have different values
in torch=1.11.0
the External id also match with various other .json objects where the name can be cudaEventRecord, cudaLaunchKernel etc.
This is not the case in the torch=2.0.1 generated .json
Hi, I am using the sample script in this repository
resnet50_ddp_profiler.py
from https://github.com/pytorch/kineto/blob/main/tb_plugin/examples/resnet50_ddp_profiler.pyUsing
In tensorboard in the overview view the communication is 0.
In the distributed view:
When I try using
There are no issues and the views show up properly.
However even for
torch=1.12+
there are issues in communication and distributed view not showing up properly.Does anyone have any insight into why this may be the case?
The text was updated successfully, but these errors were encountered: