Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] int8 flash attention #952

Open
felipemello1 opened this issue Sep 26, 2024 · 2 comments
Open

[feat] int8 flash attention #952

felipemello1 opened this issue Sep 26, 2024 · 2 comments

Comments

@felipemello1
Copy link

felipemello1 commented Sep 26, 2024

hi all, I saw this tweet and thought of sharing it. The accuracy degration doesnt look too good, but maybe the speed makes it worth it?

https://x.com/papers_anon/status/1839131401322639805?s=46

To be clear: I am not requesting the feature, just mostly sharing it. Thanks! :)

@felipemello1 felipemello1 changed the title [new feat] int8 flash attention [feat] int8 flash attention Sep 26, 2024
@jcaip
Copy link
Contributor

jcaip commented Sep 26, 2024

cc @cpuhrsch @HDCharles I think we could do this with flexattention? Flagging just so you are aware there's interest.

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Oct 1, 2024

@jcaip - Worth a try. Essentially you'd need to dequant within the score mod (before the softmax) and the inputs will have to be quantized. I think at this point only query and key could be quantized, because values will need to be matmul'd against by the result of the softmax.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants