Support causal masking in FlashAttention #1947

jopperm · 2024-08-20T14:01:05Z

PoC extension of the advance path to handle causal masking in FlashAttention-2.

Summary of changes:

Support for tt.make_range throughout all passes of the advance path.
Extended TritonToTritonGPUWarp and MatchTargetSize to support two dependent attention-for-loops.
Extended lowering of tt.broadcast of row vectors: Here, we need to select and splat a single value per thread. To that end, I needed to introduce an op for querying the lane ID in TritonGEN.

The generated code passes result verification against PyTorch (both causal=True and causal=False).

Remaining issues:

Enabling the schedule load pass leads to invalid IR (operation uses its own result).
No lit tests yet.

Signed-off-by: Julian Oppermann <[email protected]>

etiotto

As discussed please post a PR against the llvm-target branch for this feature.

Add support for distributing `tt.make_range` ops according to the desired warp size. This is a PoC which assumes that multiple warps are only needed along one dimension. See #1947 for more context / the complete PoC. --------- Signed-off-by: Julian Oppermann <[email protected]>

…#2013) Detect and handle flash attention with causal masking in `-convert-triton-to-tritongpu-warp` by supporting `tt.make_range` and *two* dependent attention-`for`-loops. See #1947 for more context / the complete PoC. --------- Signed-off-by: Julian Oppermann <[email protected]>

…2043) Splits `make_range` into SG-sized subranges, and handles row-vector broadcasts (e.g. `1x64 -> 16x64`) in `MatchTargetSize`. See #1947 for more context / the complete PoC. --------- Signed-off-by: Julian Oppermann <[email protected]>

Support canonicalization of dependent `scf.for` loops by re-gluing individual results after the loop. See #1947 for more context / the complete PoC. --------- Signed-off-by: Julian Oppermann <[email protected]>

Initial support for causal attention.

645f70a

Signed-off-by: Julian Oppermann <[email protected]>

jopperm linked an issue Aug 20, 2024 that may be closed by this pull request

[#6 Attention Performance] extend attention support for Causal = True #1102

Open

Fix make_range lowering.

af074cf

jopperm marked this pull request as ready for review August 22, 2024 15:02

jopperm requested review from whitneywhtsang, etiotto, Dewei-Wang-sh and a team August 22, 2024 15:03

jopperm self-assigned this Aug 22, 2024

etiotto reviewed Aug 26, 2024

View reviewed changes

etiotto marked this pull request as draft August 26, 2024 19:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support causal masking in FlashAttention #1947

Support causal masking in FlashAttention #1947

jopperm commented Aug 20, 2024 •

edited

Loading

etiotto left a comment

Support causal masking in FlashAttention #1947

Are you sure you want to change the base?

Support causal masking in FlashAttention #1947

Conversation

jopperm commented Aug 20, 2024 • edited Loading

etiotto left a comment

Choose a reason for hiding this comment

jopperm commented Aug 20, 2024 •

edited

Loading