Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SingleDecodeWithKVCache meets illegal memory access when setting input tensors to cuda:1 #452

Open
jason-huang03 opened this issue Aug 17, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@jason-huang03
Copy link

This is from the given example in the repo:

import torch
import flashinfer

device_id = 1

kv_len = 2048
num_kv_heads = 32
head_dim = 128

k = torch.randn(kv_len, num_kv_heads, head_dim).half().to(device_id) 
v = torch.randn(kv_len, num_kv_heads, head_dim).half().to(device_id) 

# decode attention

num_qo_heads = 32
q = torch.randn(num_qo_heads, head_dim).half().to(device_id)

o = flashinfer.single_decode_with_kv_cache(q, k, v) # decode attention without RoPE on-the-fly
o_rope_on_the_fly = flashinfer.single_decode_with_kv_cache(q, k, v, pos_encoding_mode="ROPE_LLAMA") # decode with LLaMA style RoPE on-the-fly

# append attention
append_qo_len = 128
q = torch.randn(append_qo_len, num_qo_heads, head_dim).half().to(device_id) # append attention, the last 128 tokens in the KV-Cache are the new tokens
o = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True) # append attention without RoPE on-the-fly, apply causal mask
o_rope_on_the_fly = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True, pos_encoding_mode="ROPE_LLAMA") # append attention with LLaMA style RoPE on-the-fly, apply causal mask

# prefill attention
qo_len = 2048
q = torch.randn(qo_len, num_qo_heads, head_dim).half().to(device_id) # prefill attention
o = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=False) # prefill attention without RoPE on-the-fly, do not apply causal mask

When device_id=0, everything is fine. However, when device_id=1, the following error is thrown:

    out = _decode.single_decode_with_kv_cache(
RuntimeError: SingleDecodeWithKVCache kernel launch failed, error: an illegal memory access was encountered

I am using A100 SM 80. I find that the problem should have been solved in the commit related to #349 but I still meet this weird problem. Can you see why it happens? Thanks a lot! I want to deploy 70B model on multiple gpus so I think being able to run the kernel on different gpus is really important. Can you see why it happens?

@yzh119
Copy link
Collaborator

yzh119 commented Aug 17, 2024

Hi @jason-huang03 , which version of flashinfer you were using? I suppose the issue should have been fixed in 0.0.9.

I can't reproduce it with the latest version of flashinfer (v0.1.5).

@jason-huang03
Copy link
Author

jason-huang03 commented Aug 18, 2024

I checkout to v0.1.5 and rebuild using pip install --no-cache-dir --force-reinstall -e . . However, the problem persists. The whole error message is

CUDA Error: an illegal memory access was encountered (700) /mnt/huanghaofeng/flashinfer/python/include/flashinfer/attention/decode.cuh: line 658 at function cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size)
Traceback (most recent call last):
  File "/mnt/huanghaofeng/flashinfer/test.py", line 19, in <module>
    o_rope_on_the_fly = flashinfer.single_decode_with_kv_cache(q, k, v, pos_encoding_mode="ROPE_LLAMA") # decode with LLaMA style RoPE on-the-fly
  File "/mnt/huanghaofeng/flashinfer/python/flashinfer/decode.py", line 194, in single_decode_with_kv_cache
    out = _decode.single_decode_with_kv_cache(
RuntimeError: SingleDecodeWithKVCache kernel launch failed, error: an illegal memory access was encountered

You can see that the problem is from cudaFuncSetAttribute.

I am using cuda 11.8, torch 2.2.0 and in a containerized development environment. Can this be the problem?

@jason-huang03
Copy link
Author

Also I find that device_id in function SinglePrefillWithKVCacheDispatched in python/include/flashinfer/attention/prefill.cuh seems to be 0 regardless of the device_id set in the python code.

@yzh119
Copy link
Collaborator

yzh119 commented Aug 18, 2024

@jason-huang03 would you mind checking the device id here.

@jason-huang03
Copy link
Author

jason-huang03 commented Aug 18, 2024

I use std::cout, device.index() here is empty, but device is correct (like cuda:1). I am now trying to use cuda 12.4 and torch 2.4 to see whether the problem can be solved.

@jason-huang03
Copy link
Author

jason-huang03 commented Aug 18, 2024

After using pytorch 2.4 and cuda 12.4, the error disappears. Thanks for your time. It seems that the device and device index api has undergone some changes in the cuda or pytorch version.

@yzh119 yzh119 added the bug Something isn't working label Aug 18, 2024
@yzh119
Copy link
Collaborator

yzh119 commented Aug 18, 2024

thanks for reporting, I'll check the behavior on cu118 platforms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants