[Question] GenAI ROCm #3411

robertgshaw2-neuralmagic · 2024-11-23T05:08:28Z

Hello!

I work on the vllm-project. I worked in the past with FBGEMM for Llama-405B launch in VLLM. As part of our 2025 roadmap planning, we are evaluating options for Fp8 compute on ROCm. I noticed several PRs in the V1 release (https://github.com/pytorch/FBGEMM/releases/tag/v1.0.0) include Fp8 GEMM support and the documentation suggests RoCM is a target for the kernels. However, I also noticed that the RoCM builds in the CMakeLists.txt (https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/CMakeLists.txt#L195) skips the genai compilation.

I also spent a few hours trying to get things to build for RoCM following (https://pytorch.org/FBGEMM/fbgemm_gpu-development/BuildInstructions.html) using rocm/rocm-terminal:6.2.0 and rocm/dev-ubuntu-22.04:6.2.2 but was unsuccessful

I have a couple questions:

Is there any reason why the genai codepath is currently skipped for RoCM that we should be aware of?
Is there a recommended build image that could be shared for ROCM 6.2?

Thanks!

The text was updated successfully, but these errors were encountered:

q10 · 2024-11-23T05:54:24Z

Hi @robertgshaw2-neuralmagic We are currently working on enabling ROCm for the GenAI variant of the build. To my knowledge, the GenAI section of the codebase depends on code that exists in the develop branch of ROCm/CK and is not available in any public release yet.

I will work with the team to see if we can work around this limitation, otherwise we are blocked from making the code build on OSS until some of those changes in the develop branch land into a public release of CK.

robertgshaw2-neuralmagic · 2024-11-23T13:34:39Z

Hi @robertgshaw2-neuralmagic We are currently working on enabling ROCm for the GenAI variant of the build. To my knowledge, the GenAI section of the codebase depends on code that exists in the develop branch of ROCm/CK and is not available in any public release yet.

I will work with the team to see if we can work around this limitation, otherwise we are blocked from making the code build on OSS until some of those changes in the develop branch land into a public release of CK.

I see. Thanks for the clear and quick response! I will check back in for a few days

robertgshaw2-neuralmagic · 2024-12-04T02:18:48Z

Hey @q10 - I was wondering if you had any updates

amathews-amd · 2024-12-04T16:34:31Z

@shajrawi @gshtras @sunway513 for vLLM support on ROCm.

shajrawi · 2024-12-04T16:40:21Z

For FP8 GEMM compute performance + Llama my current recommendation is ROCm 6.3 because the math library / hipblaslt has a lot of improvements for GEMM performance + PyTorch nightly for scaled_mm and tunable ops improvements.

For the ROCm 6.2 question, I would recommend this docker that builds newer hipblaslt library + PyTorch https://github.com/ROCm/vllm/blob/main/Dockerfile.rocm

robertgshaw2-neuralmagic · 2024-12-04T16:54:02Z

@shajrawi

I understand that there other options for FP8 GEMM on AMD. My goal is to assess bringing in FBGEMM since it supports fusion of per channel weight scales and per token activation scales onto the GEMM. The key issue with torch._scaled_mm is that it has only supported per tensor weight and activations scales, which means that we do not have an option for fusion of the epilogues. You can see the code here: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/w8a8_utils.py#L150. Dynamic per token and channelwise weights are the recommended pathway for best accuracy, so this is currently incurring significant slowdown on MI300X for channelwise, dynamic per token models. Perhaps the API for torch._scaled_mm has changed in recent torch versions.
I understand that we can build vLLM with that image, my question was about how to build the FBGEMM FP8 kernels on RoCM. It seems that this is not publicly available yet per @q10's note

aazz44ss · 2024-12-12T07:47:21Z

@shajrawi

I understand that there other options for FP8 GEMM on AMD. My goal is to assess bringing in FBGEMM since it supports fusion of per channel weight scales and per token activation scales onto the GEMM. The key issue with torch._scaled_mm is that it has only supported per tensor weight and activations scales, which means that we do not have an option for fusion of the epilogues. You can see the code here: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/w8a8_utils.py#L150. Dynamic per token and channelwise weights are the recommended pathway for best accuracy, so this is currently incurring significant slowdown on MI300X for channelwise, dynamic per token models. Perhaps the API for torch._scaled_mm has changed in recent torch versions.

I understand that we can build vLLM with that image, my question was about how to build the FBGEMM FP8 kernels on RoCM. It seems that this is not publicly available yet per @q10's note

@robertgshaw2-neuralmagic
hipblaslt support D = Activation(A X B * scaleA * scaleB * scaleAlphaVector + beta * C + bias), which scaleA and scaleB can be scalar or vector

robertgshaw2-neuralmagic · 2024-12-12T15:13:49Z

@shajrawi

I understand that there other options for FP8 GEMM on AMD. My goal is to assess bringing in FBGEMM since it supports fusion of per channel weight scales and per token activation scales onto the GEMM. The key issue with torch._scaled_mm is that it has only supported per tensor weight and activations scales, which means that we do not have an option for fusion of the epilogues. You can see the code here: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/w8a8_utils.py#L150. Dynamic per token and channelwise weights are the recommended pathway for best accuracy, so this is currently incurring significant slowdown on MI300X for channelwise, dynamic per token models. Perhaps the API for torch._scaled_mm has changed in recent torch versions.

I understand that we can build vLLM with that image, my question was about how to build the FBGEMM FP8 kernels on RoCM. It seems that this is not publicly available yet per @q10's note

@robertgshaw2-neuralmagic hipblaslt support D = Activation(A X B * scaleA * scaleB * scaleAlphaVector + beta * C + bias), which scaleA and scaleB can be scalar or vector

Great - can you point me to an example of this API?

aazz44ss · 2024-12-12T16:25:25Z

@shajrawi

I understand that there other options for FP8 GEMM on AMD. My goal is to assess bringing in FBGEMM since it supports fusion of per channel weight scales and per token activation scales onto the GEMM. The key issue with torch._scaled_mm is that it has only supported per tensor weight and activations scales, which means that we do not have an option for fusion of the epilogues. You can see the code here: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/w8a8_utils.py#L150. Dynamic per token and channelwise weights are the recommended pathway for best accuracy, so this is currently incurring significant slowdown on MI300X for channelwise, dynamic per token models. Perhaps the API for torch._scaled_mm has changed in recent torch versions.

I understand that we can build vLLM with that image, my question was about how to build the FBGEMM FP8 kernels on RoCM. It seems that this is not publicly available yet per @q10's note

@robertgshaw2-neuralmagic hipblaslt support D = Activation(A X B * scaleA * scaleB * scaleAlphaVector + beta * C + bias), which scaleA and scaleB can be scalar or vector

Great - can you point me to an example of this API?

hipBLASLt/clients/samples/15_gemm_scale_a_b_ext/sample_hipblaslt_gemm_with_scale_a_b_ext.cpp

robertgshaw2-neuralmagic · 2024-12-12T16:35:08Z

@shajrawi

I understand that there other options for FP8 GEMM on AMD. My goal is to assess bringing in FBGEMM since it supports fusion of per channel weight scales and per token activation scales onto the GEMM. The key issue with torch._scaled_mm is that it has only supported per tensor weight and activations scales, which means that we do not have an option for fusion of the epilogues. You can see the code here: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/w8a8_utils.py#L150. Dynamic per token and channelwise weights are the recommended pathway for best accuracy, so this is currently incurring significant slowdown on MI300X for channelwise, dynamic per token models. Perhaps the API for torch._scaled_mm has changed in recent torch versions.

I understand that we can build vLLM with that image, my question was about how to build the FBGEMM FP8 kernels on RoCM. It seems that this is not publicly available yet per @q10's note

@robertgshaw2-neuralmagic hipblaslt support D = Activation(A X B * scaleA * scaleB * scaleAlphaVector + beta * C + bias), which scaleA and scaleB can be scalar or vector

Great - can you point me to an example of this API?

hipBLASLt/clients/samples/15_gemm_scale_a_b_ext/sample_hipblaslt_gemm_with_scale_a_b_ext.cpp

Does this get exposed via PyTorch already or should I use it directly?

aazz44ss · 2024-12-12T16:55:55Z

@shajrawi

I understand that there other options for FP8 GEMM on AMD. My goal is to assess bringing in FBGEMM since it supports fusion of per channel weight scales and per token activation scales onto the GEMM. The key issue with torch._scaled_mm is that it has only supported per tensor weight and activations scales, which means that we do not have an option for fusion of the epilogues. You can see the code here: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/w8a8_utils.py#L150. Dynamic per token and channelwise weights are the recommended pathway for best accuracy, so this is currently incurring significant slowdown on MI300X for channelwise, dynamic per token models. Perhaps the API for torch._scaled_mm has changed in recent torch versions.

I understand that we can build vLLM with that image, my question was about how to build the FBGEMM FP8 kernels on RoCM. It seems that this is not publicly available yet per @q10's note

@robertgshaw2-neuralmagic hipblaslt support D = Activation(A X B * scaleA * scaleB * scaleAlphaVector + beta * C + bias), which scaleA and scaleB can be scalar or vector

Great - can you point me to an example of this API?

hipBLASLt/clients/samples/15_gemm_scale_a_b_ext/sample_hipblaslt_gemm_with_scale_a_b_ext.cpp

Does this get exposed via PyTorch already or should I use it directly?

You have to use it directly.

tjtanaa · 2024-12-22T16:37:51Z

@q10
I have tried enable the GenAI compilation on gfx942 using repo https://github.com/ROCm/FBGEMM (branch main).

I managed to get down to till compiling the last 3 HIPCC object.

I am using rocm/rocm-terminal:6.2.1 In the docker container we need to run the following to install required libraries to compile the code.

sudo apt update
sudo apt install amdgpu-dkms rocm

Installed torch 2.6.0.

python3 -m pip install --pre \
                torch==2.6.0.dev20241113+rocm6.2 \
                'setuptools-scm>=8' \
                torchvision==0.20.0.dev20241113+rocm6.2 \
                --extra-index-url https://download.pytorch.org/whl/nightly/rocm6.2

Following the instruction in https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/model-acceleration-libraries.html#fbgemm-and-fbgemm-gpu
Included the composable-kernel (branch develop) to CMakeLists.txt and enable the GenAI compilation step in FBGEMM/fbgemm_gpu/experimental/gen_ai/CMakeLists.txt

if(NOT FBGEMM_CPU_ONLY)
  # TODO: Re-enable gen_ai for ROCm once ck/tensor_operation/gpu/device/impl/device_gemm_multiple_d_xdl_cshuffle_v3_ab_scale.hpp
  # lands into latest ROCm
  add_subdirectory(experimental/gen_ai)
endif()

I found that the signature ck::tensor_operation::device::GroupedGemmTileLoopKernelArguments has changed to ck::tensor_operation::device::GroupedGemmKernelArgument.
I managed to get down to till compiling the last 3 HIPCC object.
I am encountering error caused by lld: error: undefined hidden symbol: unsigned short ck::atomic_add<unsigned short>(unsigned short*, unsigned short const&). Does this statement require the internal composable-kernel version that you mentioned?

The error logs are as follows:

Target file: /app/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.12/cmake-build/fbgemm_gpu_tbe_training_backward.so
Resetting RPATH to $ORIGIN ...
################################################################################
[87/90] Building HIPCC object experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai_py.dir/src/quantize/ck_extensions/fp8_rowwise/kernels/fbgemm_gpu_experimental_gen_ai_py_generated_fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip.o
FAILED: experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai_py.dir/src/quantize/ck_extensions/fp8_rowwise/kernels/fbgemm_gpu_experimental_gen_ai_py_generated_fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip.o /app/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.12/cmake-build/experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai_py.dir/src/quantize/ck_extensions/fp8_rowwise/kernels/fbgemm_gpu_experimental_gen_ai_py_generated_fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip.o 
cd /app/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.12/cmake-build/experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai_py.dir/src/quantize/ck_extensions/fp8_rowwise/kernels && /root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/cmake/data/bin/cmake -E make_directory /app/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.12/cmake-build/experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai_py.dir/src/quantize/ck_extensions/fp8_rowwise/kernels/. && /root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/cmake/data/bin/cmake -D verbose:BOOL=OFF -D build_configuration:STRING=RELEASE -D generated_file:STRING=/app/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.12/cmake-build/experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai_py.dir/src/quantize/ck_extensions/fp8_rowwise/kernels/./fbgemm_gpu_experimental_gen_ai_py_generated_fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip.o -P /app/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.12/cmake-build/experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai_py.dir/src/quantize/ck_extensions/fp8_rowwise/kernels/fbgemm_gpu_experimental_gen_ai_py_generated_fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip.o.cmake
In file included from /app/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip:9:
In file included from /app/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_common.h:29:
In file included from /app/FBGEMM/fbgemm_gpu/composable_kernel/library/include/ck/library/reference_tensor_operation/cpu/reference_gemm.hpp:11:
/app/FBGEMM/fbgemm_gpu/composable_kernel/include/ck/library/utility/host_tensor.hpp:248:40: warning: implicit capture of 'this' with a capture default of '=' is deprecated [-Wdeprecated-this-capture]
  248 |                     call_f_unpack_args(mF, GetNdIndices(iw));
      |                                        ^
/app/FBGEMM/fbgemm_gpu/composable_kernel/include/ck/library/utility/host_tensor.hpp:245:23: note: add an explicit capture of 'this' to capture '*this' by reference
  245 |             auto f = [=] {
      |                       ^
      |                        , this
In file included from /app/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip:9:
In file included from /app/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_common.h:32:
/app/FBGEMM/fbgemm_gpu/composable_kernel/include/ck/library/utility/fill.hpp:99:18: warning: implicit capture of 'this' with a capture default of '=' is deprecated [-Wdeprecated-this-capture]
   99 |             n += step_;
      |                  ^
/app/FBGEMM/fbgemm_gpu/composable_kernel/include/ck/library/utility/fill.hpp:97:37: note: add an explicit capture of 'this' to capture '*this' by reference
   97 |         std::generate(first, last, [=, n = init_value_]() mutable {
      |                                     ^
      |                                      , this
2 warnings generated when compiling for gfx942.
lld: error: undefined hidden symbol: unsigned short ck::atomic_add<unsigned short>(unsigned short*, unsigned short const&)
>>> referenced by /tmp/fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1-gfx942-62abe2.o:(_ZN2ck35kernel_gemm_xdl_cshuffle_v3_multi_dINS_34GridwiseGemmMultiD_xdl_cshuffle_v3INS_13tensor_layout4gemm8RowMajorENS3_11ColumnMajorENS_5TupleIJS4_S5_EEES4_DB8_S8_ffNS6_IJffEEEtNS_16tensor_operation12element_wise11PassThroughESC_12RowwiseScaleLNSA_6device18GemmSpecializationE7ELi64ELi16ELi16ELi256ELi16ELi16ELi16ELi16ELi1ELi1ENS_8SequenceIJLi16ELi4ELi1EEEENSG_IJLi1ELi0ELi2EEEESI_Li2ELi1ELi16ELb0ELi0ESH_SI_SI_Li2ELi1ELi16ELb0ELi0ELi1ELi1ENSG_IJLi1ELi16ELi1ELi4EEEENSG_IJLi1ELi1ELi1EEEELNS_26BlockGemmPipelineSchedulerE0ELNS_24BlockGemmPipelineVersionE0ES8_S8_S8_S8_EELb1ELNS_25InMemoryDataOperationEnumE1ELi1ELNS_10TailNumberE10EEEvNT_8ArgumentE)
>>> referenced by /tmp/fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1-gfx942-62abe2.o:(_ZN2ck35kernel_gemm_xdl_cshuffle_v3_multi_dINS_34GridwiseGemmMultiD_xdl_cshuffle_v3INS_13tensor_layout4gemm8RowMajorENS3_11ColumnMajorENS_5TupleIJS4_S5_EEES4_DB8_S8_ffNS6_IJffEEEtNS_16tensor_operation12element_wise11PassThroughESC_12RowwiseScaleLNSA_6device18GemmSpecializationE7ELi64ELi16ELi16ELi256ELi16ELi16ELi16ELi16ELi1ELi1ENS_8SequenceIJLi16ELi4ELi1EEEENSG_IJLi1ELi0ELi2EEEESI_Li2ELi1ELi16ELb0ELi0ESH_SI_SI_Li2ELi1ELi16ELb0ELi0ELi1ELi1ENSG_IJLi1ELi16ELi1ELi4EEEENSG_IJLi1ELi1ELi1EEEELNS_26BlockGemmPipelineSchedulerE0ELNS_24BlockGemmPipelineVersionE0ES8_S8_S8_S8_EELb1ELNS_25InMemoryDataOperationEnumE1ELi1ELNS_10TailNumberE10EEEvNT_8ArgumentE)
>>> referenced by /tmp/fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1-gfx942-62abe2.o:(_ZN2ck35kernel_gemm_xdl_cshuffle_v3_multi_dINS_34GridwiseGemmMultiD_xdl_cshuffle_v3INS_13tensor_layout4gemm8RowMajorENS3_11ColumnMajorENS_5TupleIJS4_S5_EEES4_DB8_S8_ffNS6_IJffEEEtNS_16tensor_operation12element_wise11PassThroughESC_12RowwiseScaleLNSA_6device18GemmSpecializationE7ELi64ELi16ELi16ELi256ELi16ELi16ELi16ELi16ELi1ELi1ENS_8SequenceIJLi16ELi4ELi1EEEENSG_IJLi1ELi0ELi2EEEESI_Li2ELi1ELi16ELb0ELi0ESH_SI_SI_Li2ELi1ELi16ELb0ELi0ELi1ELi1ENSG_IJLi1ELi16ELi1ELi4EEEENSG_IJLi1ELi1ELi1EEEELNS_26BlockGemmPipelineSchedulerE0ELNS_24BlockGemmPipelineVersionE0ES8_S8_S8_S8_EELb1ELNS_25InMemoryDataOperationEnumE1ELi1ELNS_10TailNumberE10EEEvNT_8ArgumentE)
>>> referenced 13 more times
clang++: error: amdgcn-link command failed with exit code 1 (use -v to see invocation)
failed to execute:/opt/rocm/llvm/bin/clang++  --offload-arch=gfx942 -O3  -c -x hip /app/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip -o "/app/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.12/cmake-build/experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai_py.dir/src/quantize/ck_extensions/fp8_rowwise/kernels/./fbgemm_gpu_experimental_gen_ai_py_generated_fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip.o" -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_BFLOAT16_CONVERSIONS__=1 -D__HIP_NO_HALF2_OPERATORS__=1 -DTORCH_USE_HIP_DSA -D_GLIBCXX_USE_CXX11_ABI=0 -mavx2 -mf16c -mfma -std=c++20 -fno-gpu-rdc -Wno-defaulted-function-deleted -Wno-\#pragma-messages -Wno-\#warnings -fclang-abi-compat=17 -Wno-cuda-compat -Wno-deprecated-declarations -Wno-format -Wno-ignored-attributes -Wno-unused-result -DNDEBUG -DUSE_ROCM -DUSE_DISTRIBUTED -DUSE_C10D_GLOO -DUSE_RPC -DUSE_TENSORPIPE -DUSE_C10D_NCCL -DUSE_PROF_API=1 -DUSE_DISTRIBUTED -DUSE_C10D_GLOO -DUSE_RPC -DUSE_TENSORPIPE -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_AMD__ -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_AMD__ -DUSE_DISTRIBUTED -DUSE_C10D_GLOO -DUSE_RPC -DUSE_TENSORPIPE -DUSE_C10D_NCCL -DUSE_PROF_API=1 -DUSE_DISTRIBUTED -DUSE_C10D_GLOO -DUSE_RPC -DUSE_TENSORPIPE -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_AMD__ -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_AMD__ -fPIC -I/opt/rocm/include -I/app/FBGEMM/fbgemm_gpu -I/opt/rocm/include -I/app/FBGEMM/fbgemm_gpu/../include -I/app/FBGEMM/fbgemm_gpu -I/app/FBGEMM/fbgemm_gpu/include -I/app/FBGEMM/fbgemm_gpu/../include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/app/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/app/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/app/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/app/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/app/FBGEMM/fbgemm_gpu/../external/json/include -I/app/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -I/app/FBGEMM/fbgemm_gpu/composable_kernel/include -I/app/FBGEMM/fbgemm_gpu/composable_kernel/library/include -I/app/FBGEMM/fbgemm_gpu/composable_kernel/include -I/app/FBGEMM/fbgemm_gpu/composable_kernel/library/include -I/opt/rocm/include -I/app/FBGEMM/fbgemm_gpu -I/opt/rocm/include -I/app/FBGEMM/fbgemm_gpu/../include -I/app/FBGEMM/fbgemm_gpu -I/app/FBGEMM/fbgemm_gpu/include -I/app/FBGEMM/fbgemm_gpu/../include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/app/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/app/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/app/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/app/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/app/FBGEMM/fbgemm_gpu/../external/json/include -I/app/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/opt/rocm/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/opt/rocm-6.2.1/include -I/opt/rocm/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include/hiprand -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/opt/rocm/include -I/opt/rocm-6.2.1/include -I/opt/rocm/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include/hiprand -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/opt/rocm/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/opt/rocm-6.2.1/include -I/opt/rocm/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include/hiprand -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/opt/rocm/include -I/opt/rocm-6.2.1/include -I/opt/rocm/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include/hiprand -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/app/FBGEMM/fbgemm_gpu/composable_kernel/include -I/app/FBGEMM/fbgemm_gpu/composable_kernel/library/include
CMake Error at fbgemm_gpu_experimental_gen_ai_py_generated_fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip.o.cmake:200 (message):
  Error generating file
  /app/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.12/cmake-build/experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai_py.dir/src/quantize/ck_extensions/fp8_rowwise/kernels/./fbgemm_gpu_experimental_gen_ai_py_generated_fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip.o


ninja: build stopped: subcommand failed.
[SETUP.PY] ARGV: ['setup.py', 'bdist_wheel', '--package_variant=rocm', '--python-tag=py312', '--plat-name=manylinux2014_x86_64', '-DHIP_ROOT_DIR=/opt/rocm', '-DCMAKE_C_FLAGS=-DTORCH_USE_HIP_DSA', '-DCMAKE_CXX_FLAGS=-DTORCH_USE_HIP_DSA']
[SETUP.PY] Parsed setup.py arguments: Namespace(verbose=False, debug=False, dryrun=False, package_variant='rocm', package_channel='nightly', nvml_lib_path=None, nccl_lib_path=None, use_fb_only=False, cxxprefix=None)
[SETUP.PY] Other arguments: ['bdist_wheel', '--python-tag=py312', '--plat-name=manylinux2014_x86_64', '-DHIP_ROOT_DIR=/opt/rocm', '-DCMAKE_C_FLAGS=-DTORCH_USE_HIP_DSA', '-DCMAKE_CXX_FLAGS=-DTORCH_USE_HIP_DSA']
[SETUP.PY] Determined the Python package name: 'fbgemm_gpu_nightly-rocm'
[SETUP.PY] Not running under Nova workflow context; ignoring variant_version
[SETUP.PY] Extracting the package version ...
[SETUP.PY] TAG: None, BRANCH: main, SHA: fbf3cd074a79ef2036b2c83176404e4330fd1266
[SETUP.PY] Package is for NIGHTLY; using timestamp for the versioning
[SETUP.PY] Setting the full package version string: 2024.12.22
[SETUP.PY] Not running under Nova workflow context; ignoring variant_version
[SETUP.PY] Extracting the package version ...
[SETUP.PY] TAG: None, BRANCH: main, SHA: fbf3cd074a79ef2036b2c83176404e4330fd1266
[SETUP.PY] Package is for NIGHTLY; using timestamp for the versioning
[SETUP.PY] Setting the full package version string: 2024.12.22
[SETUP.PY] Generating version file at: /app/FBGEMM/fbgemm_gpu/fbgemm_gpu/docs/version.py
[SETUP.PY] Passing CMake arguments: ['-DCMAKE_PREFIX_PATH=/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch', '-DGLIBCXX_USE_CXX11_ABI=0', "-DCMAKE_C_FLAGS=''", "-DCMAKE_CXX_FLAGS=''", '-DHIP_ROOT_DIR=/opt/rocm', '-DCMAKE_C_FLAGS=-DTORCH_USE_HIP_DSA', '-DCMAKE_CXX_FLAGS=-DTORCH_USE_HIP_DSA']

pytorch-bot bot added the module: rocm label Nov 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] GenAI ROCm #3411

[Question] GenAI ROCm #3411

robertgshaw2-neuralmagic commented Nov 23, 2024 •

edited

Loading

q10 commented Nov 23, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Nov 23, 2024

robertgshaw2-neuralmagic commented Dec 4, 2024

amathews-amd commented Dec 4, 2024

shajrawi commented Dec 4, 2024

robertgshaw2-neuralmagic commented Dec 4, 2024 •

edited

Loading

aazz44ss commented Dec 12, 2024

robertgshaw2-neuralmagic commented Dec 12, 2024

aazz44ss commented Dec 12, 2024

robertgshaw2-neuralmagic commented Dec 12, 2024

aazz44ss commented Dec 12, 2024

tjtanaa commented Dec 22, 2024 •

edited

Loading

[Question] GenAI ROCm #3411

[Question] GenAI ROCm #3411

Comments

robertgshaw2-neuralmagic commented Nov 23, 2024 • edited Loading

q10 commented Nov 23, 2024 • edited Loading

robertgshaw2-neuralmagic commented Nov 23, 2024

robertgshaw2-neuralmagic commented Dec 4, 2024

amathews-amd commented Dec 4, 2024

shajrawi commented Dec 4, 2024

robertgshaw2-neuralmagic commented Dec 4, 2024 • edited Loading

aazz44ss commented Dec 12, 2024

robertgshaw2-neuralmagic commented Dec 12, 2024

aazz44ss commented Dec 12, 2024

robertgshaw2-neuralmagic commented Dec 12, 2024

aazz44ss commented Dec 12, 2024

tjtanaa commented Dec 22, 2024 • edited Loading

robertgshaw2-neuralmagic commented Nov 23, 2024 •

edited

Loading

q10 commented Nov 23, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Dec 4, 2024 •

edited

Loading

tjtanaa commented Dec 22, 2024 •

edited

Loading