Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] GenAI ROCm #3411

Open
robertgshaw2-neuralmagic opened this issue Nov 23, 2024 · 12 comments
Open

[Question] GenAI ROCm #3411

robertgshaw2-neuralmagic opened this issue Nov 23, 2024 · 12 comments

Comments

@robertgshaw2-neuralmagic
Copy link

robertgshaw2-neuralmagic commented Nov 23, 2024

Hello!

I work on the vllm-project. I worked in the past with FBGEMM for Llama-405B launch in VLLM. As part of our 2025 roadmap planning, we are evaluating options for Fp8 compute on ROCm. I noticed several PRs in the V1 release (https://github.com/pytorch/FBGEMM/releases/tag/v1.0.0) include Fp8 GEMM support and the documentation suggests RoCM is a target for the kernels. However, I also noticed that the RoCM builds in the CMakeLists.txt (https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/CMakeLists.txt#L195) skips the genai compilation.

I also spent a few hours trying to get things to build for RoCM following (https://pytorch.org/FBGEMM/fbgemm_gpu-development/BuildInstructions.html) using rocm/rocm-terminal:6.2.0 and rocm/dev-ubuntu-22.04:6.2.2 but was unsuccessful

I have a couple questions:

  • Is there any reason why the genai codepath is currently skipped for RoCM that we should be aware of?
  • Is there a recommended build image that could be shared for ROCM 6.2?

Thanks!

@q10
Copy link
Contributor

q10 commented Nov 23, 2024

Hi @robertgshaw2-neuralmagic We are currently working on enabling ROCm for the GenAI variant of the build. To my knowledge, the GenAI section of the codebase depends on code that exists in the develop branch of ROCm/CK and is not available in any public release yet.

I will work with the team to see if we can work around this limitation, otherwise we are blocked from making the code build on OSS until some of those changes in the develop branch land into a public release of CK.

@robertgshaw2-neuralmagic
Copy link
Author

Hi @robertgshaw2-neuralmagic We are currently working on enabling ROCm for the GenAI variant of the build. To my knowledge, the GenAI section of the codebase depends on code that exists in the develop branch of ROCm/CK and is not available in any public release yet.

I will work with the team to see if we can work around this limitation, otherwise we are blocked from making the code build on OSS until some of those changes in the develop branch land into a public release of CK.

I see. Thanks for the clear and quick response! I will check back in for a few days

@robertgshaw2-neuralmagic
Copy link
Author

Hey @q10 - I was wondering if you had any updates

@amathews-amd
Copy link

@shajrawi @gshtras @sunway513 for vLLM support on ROCm.

@shajrawi
Copy link

shajrawi commented Dec 4, 2024

For FP8 GEMM compute performance + Llama my current recommendation is ROCm 6.3 because the math library / hipblaslt has a lot of improvements for GEMM performance + PyTorch nightly for scaled_mm and tunable ops improvements.

For the ROCm 6.2 question, I would recommend this docker that builds newer hipblaslt library + PyTorch https://github.com/ROCm/vllm/blob/main/Dockerfile.rocm

@robertgshaw2-neuralmagic
Copy link
Author

robertgshaw2-neuralmagic commented Dec 4, 2024

@shajrawi

  • I understand that there other options for FP8 GEMM on AMD. My goal is to assess bringing in FBGEMM since it supports fusion of per channel weight scales and per token activation scales onto the GEMM. The key issue with torch._scaled_mm is that it has only supported per tensor weight and activations scales, which means that we do not have an option for fusion of the epilogues. You can see the code here: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/w8a8_utils.py#L150. Dynamic per token and channelwise weights are the recommended pathway for best accuracy, so this is currently incurring significant slowdown on MI300X for channelwise, dynamic per token models. Perhaps the API for torch._scaled_mm has changed in recent torch versions.

  • I understand that we can build vLLM with that image, my question was about how to build the FBGEMM FP8 kernels on RoCM. It seems that this is not publicly available yet per @q10's note

@aazz44ss
Copy link

@shajrawi

  • I understand that there other options for FP8 GEMM on AMD. My goal is to assess bringing in FBGEMM since it supports fusion of per channel weight scales and per token activation scales onto the GEMM. The key issue with torch._scaled_mm is that it has only supported per tensor weight and activations scales, which means that we do not have an option for fusion of the epilogues. You can see the code here: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/w8a8_utils.py#L150. Dynamic per token and channelwise weights are the recommended pathway for best accuracy, so this is currently incurring significant slowdown on MI300X for channelwise, dynamic per token models. Perhaps the API for torch._scaled_mm has changed in recent torch versions.
  • I understand that we can build vLLM with that image, my question was about how to build the FBGEMM FP8 kernels on RoCM. It seems that this is not publicly available yet per @q10's note

@robertgshaw2-neuralmagic
hipblaslt support D = Activation(A X B * scaleA * scaleB * scaleAlphaVector + beta * C + bias), which scaleA and scaleB can be scalar or vector

@robertgshaw2-neuralmagic
Copy link
Author

@shajrawi

  • I understand that there other options for FP8 GEMM on AMD. My goal is to assess bringing in FBGEMM since it supports fusion of per channel weight scales and per token activation scales onto the GEMM. The key issue with torch._scaled_mm is that it has only supported per tensor weight and activations scales, which means that we do not have an option for fusion of the epilogues. You can see the code here: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/w8a8_utils.py#L150. Dynamic per token and channelwise weights are the recommended pathway for best accuracy, so this is currently incurring significant slowdown on MI300X for channelwise, dynamic per token models. Perhaps the API for torch._scaled_mm has changed in recent torch versions.
  • I understand that we can build vLLM with that image, my question was about how to build the FBGEMM FP8 kernels on RoCM. It seems that this is not publicly available yet per @q10's note

@robertgshaw2-neuralmagic hipblaslt support D = Activation(A X B * scaleA * scaleB * scaleAlphaVector + beta * C + bias), which scaleA and scaleB can be scalar or vector

Great - can you point me to an example of this API?

@aazz44ss
Copy link

@shajrawi

  • I understand that there other options for FP8 GEMM on AMD. My goal is to assess bringing in FBGEMM since it supports fusion of per channel weight scales and per token activation scales onto the GEMM. The key issue with torch._scaled_mm is that it has only supported per tensor weight and activations scales, which means that we do not have an option for fusion of the epilogues. You can see the code here: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/w8a8_utils.py#L150. Dynamic per token and channelwise weights are the recommended pathway for best accuracy, so this is currently incurring significant slowdown on MI300X for channelwise, dynamic per token models. Perhaps the API for torch._scaled_mm has changed in recent torch versions.
  • I understand that we can build vLLM with that image, my question was about how to build the FBGEMM FP8 kernels on RoCM. It seems that this is not publicly available yet per @q10's note

@robertgshaw2-neuralmagic hipblaslt support D = Activation(A X B * scaleA * scaleB * scaleAlphaVector + beta * C + bias), which scaleA and scaleB can be scalar or vector

Great - can you point me to an example of this API?

hipBLASLt/clients/samples/15_gemm_scale_a_b_ext/sample_hipblaslt_gemm_with_scale_a_b_ext.cpp

@robertgshaw2-neuralmagic
Copy link
Author

@shajrawi

  • I understand that there other options for FP8 GEMM on AMD. My goal is to assess bringing in FBGEMM since it supports fusion of per channel weight scales and per token activation scales onto the GEMM. The key issue with torch._scaled_mm is that it has only supported per tensor weight and activations scales, which means that we do not have an option for fusion of the epilogues. You can see the code here: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/w8a8_utils.py#L150. Dynamic per token and channelwise weights are the recommended pathway for best accuracy, so this is currently incurring significant slowdown on MI300X for channelwise, dynamic per token models. Perhaps the API for torch._scaled_mm has changed in recent torch versions.
  • I understand that we can build vLLM with that image, my question was about how to build the FBGEMM FP8 kernels on RoCM. It seems that this is not publicly available yet per @q10's note

@robertgshaw2-neuralmagic hipblaslt support D = Activation(A X B * scaleA * scaleB * scaleAlphaVector + beta * C + bias), which scaleA and scaleB can be scalar or vector

Great - can you point me to an example of this API?

hipBLASLt/clients/samples/15_gemm_scale_a_b_ext/sample_hipblaslt_gemm_with_scale_a_b_ext.cpp

Does this get exposed via PyTorch already or should I use it directly?

@aazz44ss
Copy link

@shajrawi

  • I understand that there other options for FP8 GEMM on AMD. My goal is to assess bringing in FBGEMM since it supports fusion of per channel weight scales and per token activation scales onto the GEMM. The key issue with torch._scaled_mm is that it has only supported per tensor weight and activations scales, which means that we do not have an option for fusion of the epilogues. You can see the code here: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/w8a8_utils.py#L150. Dynamic per token and channelwise weights are the recommended pathway for best accuracy, so this is currently incurring significant slowdown on MI300X for channelwise, dynamic per token models. Perhaps the API for torch._scaled_mm has changed in recent torch versions.
  • I understand that we can build vLLM with that image, my question was about how to build the FBGEMM FP8 kernels on RoCM. It seems that this is not publicly available yet per @q10's note

@robertgshaw2-neuralmagic hipblaslt support D = Activation(A X B * scaleA * scaleB * scaleAlphaVector + beta * C + bias), which scaleA and scaleB can be scalar or vector

Great - can you point me to an example of this API?

hipBLASLt/clients/samples/15_gemm_scale_a_b_ext/sample_hipblaslt_gemm_with_scale_a_b_ext.cpp

Does this get exposed via PyTorch already or should I use it directly?

You have to use it directly.

@tjtanaa
Copy link

tjtanaa commented Dec 22, 2024

@q10
I have tried enable the GenAI compilation on gfx942 using repo https://github.com/ROCm/FBGEMM (branch main).

I managed to get down to till compiling the last 3 HIPCC object.

  1. I am using rocm/rocm-terminal:6.2.1 In the docker container we need to run the following to install required libraries to compile the code.
sudo apt update
sudo apt install amdgpu-dkms rocm
  1. Installed torch 2.6.0.
python3 -m pip install --pre \
                torch==2.6.0.dev20241113+rocm6.2 \
                'setuptools-scm>=8' \
                torchvision==0.20.0.dev20241113+rocm6.2 \
                --extra-index-url https://download.pytorch.org/whl/nightly/rocm6.2
  1. Following the instruction in https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/model-acceleration-libraries.html#fbgemm-and-fbgemm-gpu

  2. Included the composable-kernel (branch develop) to CMakeLists.txt and enable the GenAI compilation step in FBGEMM/fbgemm_gpu/experimental/gen_ai/CMakeLists.txt

if(NOT FBGEMM_CPU_ONLY)
  # TODO: Re-enable gen_ai for ROCm once ck/tensor_operation/gpu/device/impl/device_gemm_multiple_d_xdl_cshuffle_v3_ab_scale.hpp
  # lands into latest ROCm
  add_subdirectory(experimental/gen_ai)
endif()
  1. I found that the signature ck::tensor_operation::device::GroupedGemmTileLoopKernelArguments has changed to ck::tensor_operation::device::GroupedGemmKernelArgument.

  2. I managed to get down to till compiling the last 3 HIPCC object.

  3. I am encountering error caused by lld: error: undefined hidden symbol: unsigned short ck::atomic_add<unsigned short>(unsigned short*, unsigned short const&). Does this statement require the internal composable-kernel version that you mentioned?

The error logs are as follows:

Target file: /app/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.12/cmake-build/fbgemm_gpu_tbe_training_backward.so
Resetting RPATH to $ORIGIN ...
################################################################################
[87/90] Building HIPCC object experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai_py.dir/src/quantize/ck_extensions/fp8_rowwise/kernels/fbgemm_gpu_experimental_gen_ai_py_generated_fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip.o
FAILED: experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai_py.dir/src/quantize/ck_extensions/fp8_rowwise/kernels/fbgemm_gpu_experimental_gen_ai_py_generated_fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip.o /app/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.12/cmake-build/experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai_py.dir/src/quantize/ck_extensions/fp8_rowwise/kernels/fbgemm_gpu_experimental_gen_ai_py_generated_fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip.o 
cd /app/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.12/cmake-build/experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai_py.dir/src/quantize/ck_extensions/fp8_rowwise/kernels && /root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/cmake/data/bin/cmake -E make_directory /app/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.12/cmake-build/experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai_py.dir/src/quantize/ck_extensions/fp8_rowwise/kernels/. && /root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/cmake/data/bin/cmake -D verbose:BOOL=OFF -D build_configuration:STRING=RELEASE -D generated_file:STRING=/app/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.12/cmake-build/experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai_py.dir/src/quantize/ck_extensions/fp8_rowwise/kernels/./fbgemm_gpu_experimental_gen_ai_py_generated_fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip.o -P /app/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.12/cmake-build/experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai_py.dir/src/quantize/ck_extensions/fp8_rowwise/kernels/fbgemm_gpu_experimental_gen_ai_py_generated_fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip.o.cmake
In file included from /app/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip:9:
In file included from /app/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_common.h:29:
In file included from /app/FBGEMM/fbgemm_gpu/composable_kernel/library/include/ck/library/reference_tensor_operation/cpu/reference_gemm.hpp:11:
/app/FBGEMM/fbgemm_gpu/composable_kernel/include/ck/library/utility/host_tensor.hpp:248:40: warning: implicit capture of 'this' with a capture default of '=' is deprecated [-Wdeprecated-this-capture]
  248 |                     call_f_unpack_args(mF, GetNdIndices(iw));
      |                                        ^
/app/FBGEMM/fbgemm_gpu/composable_kernel/include/ck/library/utility/host_tensor.hpp:245:23: note: add an explicit capture of 'this' to capture '*this' by reference
  245 |             auto f = [=] {
      |                       ^
      |                        , this
In file included from /app/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip:9:
In file included from /app/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_common.h:32:
/app/FBGEMM/fbgemm_gpu/composable_kernel/include/ck/library/utility/fill.hpp:99:18: warning: implicit capture of 'this' with a capture default of '=' is deprecated [-Wdeprecated-this-capture]
   99 |             n += step_;
      |                  ^
/app/FBGEMM/fbgemm_gpu/composable_kernel/include/ck/library/utility/fill.hpp:97:37: note: add an explicit capture of 'this' to capture '*this' by reference
   97 |         std::generate(first, last, [=, n = init_value_]() mutable {
      |                                     ^
      |                                      , this
2 warnings generated when compiling for gfx942.
lld: error: undefined hidden symbol: unsigned short ck::atomic_add<unsigned short>(unsigned short*, unsigned short const&)
>>> referenced by /tmp/fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1-gfx942-62abe2.o:(_ZN2ck35kernel_gemm_xdl_cshuffle_v3_multi_dINS_34GridwiseGemmMultiD_xdl_cshuffle_v3INS_13tensor_layout4gemm8RowMajorENS3_11ColumnMajorENS_5TupleIJS4_S5_EEES4_DB8_S8_ffNS6_IJffEEEtNS_16tensor_operation12element_wise11PassThroughESC_12RowwiseScaleLNSA_6device18GemmSpecializationE7ELi64ELi16ELi16ELi256ELi16ELi16ELi16ELi16ELi1ELi1ENS_8SequenceIJLi16ELi4ELi1EEEENSG_IJLi1ELi0ELi2EEEESI_Li2ELi1ELi16ELb0ELi0ESH_SI_SI_Li2ELi1ELi16ELb0ELi0ELi1ELi1ENSG_IJLi1ELi16ELi1ELi4EEEENSG_IJLi1ELi1ELi1EEEELNS_26BlockGemmPipelineSchedulerE0ELNS_24BlockGemmPipelineVersionE0ES8_S8_S8_S8_EELb1ELNS_25InMemoryDataOperationEnumE1ELi1ELNS_10TailNumberE10EEEvNT_8ArgumentE)
>>> referenced by /tmp/fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1-gfx942-62abe2.o:(_ZN2ck35kernel_gemm_xdl_cshuffle_v3_multi_dINS_34GridwiseGemmMultiD_xdl_cshuffle_v3INS_13tensor_layout4gemm8RowMajorENS3_11ColumnMajorENS_5TupleIJS4_S5_EEES4_DB8_S8_ffNS6_IJffEEEtNS_16tensor_operation12element_wise11PassThroughESC_12RowwiseScaleLNSA_6device18GemmSpecializationE7ELi64ELi16ELi16ELi256ELi16ELi16ELi16ELi16ELi1ELi1ENS_8SequenceIJLi16ELi4ELi1EEEENSG_IJLi1ELi0ELi2EEEESI_Li2ELi1ELi16ELb0ELi0ESH_SI_SI_Li2ELi1ELi16ELb0ELi0ELi1ELi1ENSG_IJLi1ELi16ELi1ELi4EEEENSG_IJLi1ELi1ELi1EEEELNS_26BlockGemmPipelineSchedulerE0ELNS_24BlockGemmPipelineVersionE0ES8_S8_S8_S8_EELb1ELNS_25InMemoryDataOperationEnumE1ELi1ELNS_10TailNumberE10EEEvNT_8ArgumentE)
>>> referenced by /tmp/fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1-gfx942-62abe2.o:(_ZN2ck35kernel_gemm_xdl_cshuffle_v3_multi_dINS_34GridwiseGemmMultiD_xdl_cshuffle_v3INS_13tensor_layout4gemm8RowMajorENS3_11ColumnMajorENS_5TupleIJS4_S5_EEES4_DB8_S8_ffNS6_IJffEEEtNS_16tensor_operation12element_wise11PassThroughESC_12RowwiseScaleLNSA_6device18GemmSpecializationE7ELi64ELi16ELi16ELi256ELi16ELi16ELi16ELi16ELi1ELi1ENS_8SequenceIJLi16ELi4ELi1EEEENSG_IJLi1ELi0ELi2EEEESI_Li2ELi1ELi16ELb0ELi0ESH_SI_SI_Li2ELi1ELi16ELb0ELi0ELi1ELi1ENSG_IJLi1ELi16ELi1ELi4EEEENSG_IJLi1ELi1ELi1EEEELNS_26BlockGemmPipelineSchedulerE0ELNS_24BlockGemmPipelineVersionE0ES8_S8_S8_S8_EELb1ELNS_25InMemoryDataOperationEnumE1ELi1ELNS_10TailNumberE10EEEvNT_8ArgumentE)
>>> referenced 13 more times
clang++: error: amdgcn-link command failed with exit code 1 (use -v to see invocation)
failed to execute:/opt/rocm/llvm/bin/clang++  --offload-arch=gfx942 -O3  -c -x hip /app/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip -o "/app/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.12/cmake-build/experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai_py.dir/src/quantize/ck_extensions/fp8_rowwise/kernels/./fbgemm_gpu_experimental_gen_ai_py_generated_fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip.o" -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_BFLOAT16_CONVERSIONS__=1 -D__HIP_NO_HALF2_OPERATORS__=1 -DTORCH_USE_HIP_DSA -D_GLIBCXX_USE_CXX11_ABI=0 -mavx2 -mf16c -mfma -std=c++20 -fno-gpu-rdc -Wno-defaulted-function-deleted -Wno-\#pragma-messages -Wno-\#warnings -fclang-abi-compat=17 -Wno-cuda-compat -Wno-deprecated-declarations -Wno-format -Wno-ignored-attributes -Wno-unused-result -DNDEBUG -DUSE_ROCM -DUSE_DISTRIBUTED -DUSE_C10D_GLOO -DUSE_RPC -DUSE_TENSORPIPE -DUSE_C10D_NCCL -DUSE_PROF_API=1 -DUSE_DISTRIBUTED -DUSE_C10D_GLOO -DUSE_RPC -DUSE_TENSORPIPE -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_AMD__ -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_AMD__ -DUSE_DISTRIBUTED -DUSE_C10D_GLOO -DUSE_RPC -DUSE_TENSORPIPE -DUSE_C10D_NCCL -DUSE_PROF_API=1 -DUSE_DISTRIBUTED -DUSE_C10D_GLOO -DUSE_RPC -DUSE_TENSORPIPE -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_AMD__ -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_AMD__ -fPIC -I/opt/rocm/include -I/app/FBGEMM/fbgemm_gpu -I/opt/rocm/include -I/app/FBGEMM/fbgemm_gpu/../include -I/app/FBGEMM/fbgemm_gpu -I/app/FBGEMM/fbgemm_gpu/include -I/app/FBGEMM/fbgemm_gpu/../include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/app/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/app/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/app/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/app/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/app/FBGEMM/fbgemm_gpu/../external/json/include -I/app/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -I/app/FBGEMM/fbgemm_gpu/composable_kernel/include -I/app/FBGEMM/fbgemm_gpu/composable_kernel/library/include -I/app/FBGEMM/fbgemm_gpu/composable_kernel/include -I/app/FBGEMM/fbgemm_gpu/composable_kernel/library/include -I/opt/rocm/include -I/app/FBGEMM/fbgemm_gpu -I/opt/rocm/include -I/app/FBGEMM/fbgemm_gpu/../include -I/app/FBGEMM/fbgemm_gpu -I/app/FBGEMM/fbgemm_gpu/include -I/app/FBGEMM/fbgemm_gpu/../include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/app/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/app/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/app/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/app/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/app/FBGEMM/fbgemm_gpu/../external/json/include -I/app/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/opt/rocm/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/opt/rocm-6.2.1/include -I/opt/rocm/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include/hiprand -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/opt/rocm/include -I/opt/rocm-6.2.1/include -I/opt/rocm/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include/hiprand -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/opt/rocm/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/opt/rocm-6.2.1/include -I/opt/rocm/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include/hiprand -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch/include -I/opt/rocm/include -I/opt/rocm-6.2.1/include -I/opt/rocm/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include/hiprand -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/opt/rocm-6.2.1/include -I/app/FBGEMM/fbgemm_gpu/composable_kernel/include -I/app/FBGEMM/fbgemm_gpu/composable_kernel/library/include
CMake Error at fbgemm_gpu_experimental_gen_ai_py_generated_fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip.o.cmake:200 (message):
  Error generating file
  /app/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.12/cmake-build/experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai_py.dir/src/quantize/ck_extensions/fp8_rowwise/kernels/./fbgemm_gpu_experimental_gen_ai_py_generated_fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip.o


ninja: build stopped: subcommand failed.
[SETUP.PY] ARGV: ['setup.py', 'bdist_wheel', '--package_variant=rocm', '--python-tag=py312', '--plat-name=manylinux2014_x86_64', '-DHIP_ROOT_DIR=/opt/rocm', '-DCMAKE_C_FLAGS=-DTORCH_USE_HIP_DSA', '-DCMAKE_CXX_FLAGS=-DTORCH_USE_HIP_DSA']
[SETUP.PY] Parsed setup.py arguments: Namespace(verbose=False, debug=False, dryrun=False, package_variant='rocm', package_channel='nightly', nvml_lib_path=None, nccl_lib_path=None, use_fb_only=False, cxxprefix=None)
[SETUP.PY] Other arguments: ['bdist_wheel', '--python-tag=py312', '--plat-name=manylinux2014_x86_64', '-DHIP_ROOT_DIR=/opt/rocm', '-DCMAKE_C_FLAGS=-DTORCH_USE_HIP_DSA', '-DCMAKE_CXX_FLAGS=-DTORCH_USE_HIP_DSA']
[SETUP.PY] Determined the Python package name: 'fbgemm_gpu_nightly-rocm'
[SETUP.PY] Not running under Nova workflow context; ignoring variant_version
[SETUP.PY] Extracting the package version ...
[SETUP.PY] TAG: None, BRANCH: main, SHA: fbf3cd074a79ef2036b2c83176404e4330fd1266
[SETUP.PY] Package is for NIGHTLY; using timestamp for the versioning
[SETUP.PY] Setting the full package version string: 2024.12.22
[SETUP.PY] Not running under Nova workflow context; ignoring variant_version
[SETUP.PY] Extracting the package version ...
[SETUP.PY] TAG: None, BRANCH: main, SHA: fbf3cd074a79ef2036b2c83176404e4330fd1266
[SETUP.PY] Package is for NIGHTLY; using timestamp for the versioning
[SETUP.PY] Setting the full package version string: 2024.12.22
[SETUP.PY] Generating version file at: /app/FBGEMM/fbgemm_gpu/fbgemm_gpu/docs/version.py
[SETUP.PY] Passing CMake arguments: ['-DCMAKE_PREFIX_PATH=/root/miniconda3/envs/fbgemm/lib/python3.12/site-packages/torch', '-DGLIBCXX_USE_CXX11_ABI=0', "-DCMAKE_C_FLAGS=''", "-DCMAKE_CXX_FLAGS=''", '-DHIP_ROOT_DIR=/opt/rocm', '-DCMAKE_C_FLAGS=-DTORCH_USE_HIP_DSA', '-DCMAKE_CXX_FLAGS=-DTORCH_USE_HIP_DSA']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants