Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ROCm] Unable to Run FPX Weights #967

Open
Beinsezii opened this issue Sep 28, 2024 · 4 comments
Open

[ROCm] Unable to Run FPX Weights #967

Beinsezii opened this issue Sep 28, 2024 · 4 comments

Comments

@Beinsezii
Copy link

Beinsezii commented Sep 28, 2024

Compiling ao from source using pip install git+https://github.com/pytorch/ao.git results in a very fun throw

NotImplementedError: Could not run 'torchao::quant_llm_linear' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'torchao::quant_llm_linear' is only available for these backends: [Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastXPU, AutocastMPS, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

when running FPX weights using the script below

import torch
from diffusers.pipelines.stable_diffusion_xl.pipeline_stable_diffusion_xl import StableDiffusionXLPipeline
from torchao.quantization import fpx_weight_only, quantize_

@torch.no_grad()
def main():
    pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
    quantize_(pipe.unet, fpx_weight_only(3, 2))
    pipe(
        prompt="high resolution dslr photograph of a kitten in a field of flowers",
        negative_prompt="blurry, noisy, cropped",
        num_inference_steps=20,
        guidance_scale=5,
        seed=0,
    ).images[0].save("fp6.png")

if __name__ == "__main__":
    main()

Setup is 1x 7900XTX on torch 2.5+rocm62. All other quantizations work just fine, with the exception of float8_dynamic_activation_float8_weight because gfx11 currently does not implement torch's _scaled_mm() function

Using bfloat16 as the base dtype instead actually does run but it's wicked slow from conversions. The floatx readme states to use float16 so I assume that's the correct way.

Python traceback
traceback.txt

@gau-nernst
Copy link
Collaborator

FPx quantization is backed by a custom CUDA kernel, so it is not available to ROCm.

https://github.com/pytorch/ao/tree/main/torchao/csrc/cuda/fp6_llm

It's strange that it runs with bfloat16 though, so perhaps it is slow precisely because it doesn't use the CUDA kernel. I don't know ROCm well enough, but maybe it's not so hard to port it to ROCm.

@Beinsezii
Copy link
Author

It actually compiles something when I install from source. I see 5 threads light up. I thought torch used the hipify script for C extensions to try and auto convert code? Usually if something isn't supported by ROCm though it'll be caught when the wheel builds I thought. Additionally the error is different when using the source compiled or pip wheel. I can fetch the pip version later but it's a lot more boring essentially just saying that the function doesn't exist.

@gau-nernst
Copy link
Collaborator

Interesting. I don't know much about how PyTorch handle building for ROCm.

Can you run this script? https://github.com/pytorch/ao/blob/main/benchmarks/benchmark_fp6.py

It will help to verify if you can run the FPx kernel correctly.

@Beinsezii
Copy link
Author

Beinsezii commented Sep 28, 2024

Same exact traceback as my original post.

The one example I know of it working on both rocm and cuda is exllama. It uses torch cpp_extensions in ext.py and the file list is a pretty good chunk of cpp/cu sources. Combing through the code there's almost no hip/rocm specific code as the hipify script will swap out all references to libraries like CUBlas for the rocm equivalents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants