Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT REQ][CUDA GRAPH] Allow explicit control flag to force enable/disable split KV #397

Open
AgrawalAmey opened this issue Jul 26, 2024 · 2 comments

Comments

@AgrawalAmey
Copy link

Hello @yzh119,

Currently, we are using two independent API calls for prefill and decode in a mixed batch setting. This makes defining a cuda graph layout considerably harder. Ideally, if we could do both prefill and decode attention computation in prefill kernel it would considerably simplify the cuda graph layout. However, the main barrier for doing this right now is that we don't have an explicit control over when to use split-KV. In case of mixed batches, it appears that doing split-KV is beneficial in most cases. But it appears that split-KV gets disabled in certain batch composition, which significantly hurts latency. Would it be possible to add an optional override knob for this? Thanks!

@yzh119
Copy link
Collaborator

yzh119 commented Jul 26, 2024

Hi @AgrawalAmey , actually I found our scheduler could be further optimized so that there will be no wave quantization and I'm working on a refactor on that. After this change, I suppose the split-KV will always be enabled.

@AgrawalAmey
Copy link
Author

Oh that is great! Looking forward to it, thank you @yzh119!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants