[feat] int8 flash attention #952

felipemello1 · 2024-09-26T14:22:27Z

hi all, I saw this tweet and thought of sharing it. The accuracy degration doesnt look too good, but maybe the speed makes it worth it?

https://x.com/papers_anon/status/1839131401322639805?s=46

To be clear: I am not requesting the feature, just mostly sharing it. Thanks! :)

jcaip · 2024-09-26T20:19:33Z

cc @cpuhrsch @HDCharles I think we could do this with flexattention? Flagging just so you are aware there's interest.

cpuhrsch · 2024-10-01T17:17:52Z

@jcaip - Worth a try. Essentially you'd need to dequant within the score mod (before the softmax) and the inputs will have to be quantized. I think at this point only query and key could be quantized, because values will need to be matmul'd against by the result of the softmax.

felipemello1 changed the title ~~[new feat] int8 flash attention~~ [feat] int8 flash attention Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] int8 flash attention #952

[feat] int8 flash attention #952

felipemello1 commented Sep 26, 2024 •

edited

Loading

jcaip commented Sep 26, 2024

cpuhrsch commented Oct 1, 2024

[feat] int8 flash attention #952

[feat] int8 flash attention #952

Comments

felipemello1 commented Sep 26, 2024 • edited Loading

jcaip commented Sep 26, 2024

cpuhrsch commented Oct 1, 2024

felipemello1 commented Sep 26, 2024 •

edited

Loading