Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[torchbench] torchrec_dlrm fails to run #548

Open
pbchekin opened this issue Feb 22, 2024 · 5 comments
Open

[torchbench] torchrec_dlrm fails to run #548

pbchekin opened this issue Feb 22, 2024 · 5 comments
Assignees
Labels

Comments

@pbchekin
Copy link
Contributor

./inductor_xpu_test.sh torchbench amp_fp16 inference accuracy xpu 0 static 1 0 torchrec_dlrm
Traceback (most recent call last):
  File "/home/jovyan/pytorch/benchmarks/dynamo/torchbench.py", line 481, in <module>
    torchbench_main()
  File "/home/jovyan/pytorch/benchmarks/dynamo/torchbench.py", line 477, in torchbench_main
    main(TorchBenchmarkRunner(), original_dir)
  File "/home/jovyan/pytorch/benchmarks/dynamo/common.py", line 3041, in main
    process_entry(0, runner, original_dir, args)
  File "/home/jovyan/pytorch/benchmarks/dynamo/common.py", line 2998, in process_entry
    return maybe_fresh_cache(
  File "/home/jovyan/pytorch/benchmarks/dynamo/common.py", line 1661, in inner
    return fn(*args, **kwargs)
  File "/home/jovyan/pytorch/benchmarks/dynamo/common.py", line 3451, in run
    ) = runner.load_model(
  File "/home/jovyan/pytorch/benchmarks/dynamo/torchbench.py", line 313, in load_model
    module = importlib.import_module(c)
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/home/jovyan/benchmark/torchbenchmark/canary_models/torchrec_dlrm/__init__.py", line 7, in <module>
    from .data.dlrm_dataloader import get_dataloader
  File "/home/jovyan/benchmark/torchbenchmark/canary_models/torchrec_dlrm/data/dlrm_dataloader.py", line 13, in <module>
    from torchrec.datasets.criteo import (
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/site-packages/torchrec/__init__.py", line 8, in <module>
    import torchrec.distributed  # noqa
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/site-packages/torchrec/distributed/__init__.py", line 36, in <module>
    from torchrec.distributed.model_parallel import DistributedModelParallel  # noqa
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/site-packages/torchrec/distributed/model_parallel.py", line 24, in <module>
    from torchrec.distributed.planner import EmbeddingShardingPlanner, Topology
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/site-packages/torchrec/distributed/planner/__init__.py", line 22, in <module>
    from torchrec.distributed.planner.planners import EmbeddingShardingPlanner  # noqa
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/site-packages/torchrec/distributed/planner/planners.py", line 19, in <module>
    from torchrec.distributed.planner.constants import BATCH_SIZE, MAX_SIZE
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/site-packages/torchrec/distributed/planner/constants.py", line 10, in <module>
    from torchrec.distributed.embedding_types import EmbeddingComputeKernel
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/site-packages/torchrec/distributed/embedding_types.py", line 14, in <module>
    from fbgemm_gpu.split_table_batched_embeddings_ops_training import EmbeddingLocation
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/site-packages/fbgemm_gpu/__init__.py", line 23, in <module>
    from . import _fbgemm_gpu_docs, sparse_ops  # noqa: F401, E402  # noqa: F401, E402
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/site-packages/fbgemm_gpu/_fbgemm_gpu_docs.py", line 19, in <module>
    torch.ops.fbgemm.jagged_2d_to_dense,
  File "/home/jovyan/.conda/envs/python-3.9/lib/python3.9/site-packages/torch/_ops.py", line 761, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense'
@pbchekin pbchekin added bug Something isn't working tests: e2e labels Feb 22, 2024
@vlad-penkin vlad-penkin added this to the 03. E2E pass rate milestone Feb 24, 2024
@gshimansky
Copy link
Contributor

The problem with this benchmark is that it unconditionally imports torchrec which in its turn unconditionally imports fbgemm. Both libraries seem to exist only for CUDA (especially fbgemm) and aren't supposed to work on any other GPUs.

@gshimansky
Copy link
Contributor

gshimansky commented Feb 27, 2024

Torchrec readme suggests to install fbgemm_gpu for CPU usnig command pip install fbgemm-gpu --index-url https://download.pytorch.org/whl/nightly/cpu but with this version I am getting AttributeError: '_OpNamespace' 'fbgemm' object has no attribute 'jagged_2d_to_dense' again.
The problem is caused by incompatibility of fbgemm nightly and out pytorch versions. When native code is loaded I see an error /home/jovyan/.conda/envs/triton-no-conda-3.10-stonepia/lib/python3.10/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv and therefore there are no native function definitions.

@gshimansky
Copy link
Contributor

Ok it is possible to make this benchmark work but it is a considerable effort.

  1. fbgemm_gpu package by default is built for CUDA and we cannot use binaries for CPU because they don't match our installation of pytorch so we need to build fbgemm_gpu from sources.
  2. There is a bug in fbgemm_gpu sources, they need to be patched. These three lines should be deleted https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/src/jagged_tensor_ops/jagged_tensor_ops_cpu.cpp#L1661-L1663 because they should exist only under #ifdef.
  3. Some extra packages are required to build library successfully. Here are config command lines that I used:
mkdir build
cd build
cmake -DUSE_SANITIZER=address -DFBGEMM_LIBRARY_TYPE=shared -DPYTHON_EXECUTABLE=`which python3` -DFBGEMM_BUILD_DOCS=OFF -DFBGEMM_BUILD_BENCHMARKS=OFF -DCMAKE_INSTALL_PREFIX=${CONDA_PREFIX} ..
make -j
make install
cd ../fbgemm_gpu
export package_name=fbgemm_gpu_cpu
export python_tag=py310
export ARCH=$(uname -m)
export python_plat_name="manylinux2014_${ARCH}"
python setup.py bdist_wheel     --package_variant=cpu     --package_name="${package_name}"     --python-tag="${python_tag}"     --plat-name="${python_plat_name}"
python setup.py install --package_variant=cpu

it is possible that C++ library is not required for python, maybe it is enough to build just python fbgemm_gpu.

@gshimansky
Copy link
Contributor

Bug report on fbgemm_gpu build pytorch/FBGEMM#2362

@vlad-penkin
Copy link
Contributor

The issue is still reproducible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants