[TT-TO-TTGWARP] Detect and handle flash attention with causal masking #2013

jopperm · 2024-08-27T09:07:10Z

Detect and handle flash attention with causal masking in -convert-triton-to-tritongpu-warp by supporting tt.make_range and two dependent attention-for-loops.

See #1947 for more context / the complete PoC.

Signed-off-by: Julian Oppermann <[email protected]>

test/Conversion/intel/triton_to_tritongpu_warp.mlir

etiotto · 2024-08-27T20:28:24Z

third_party/intel/lib/TritonToTritonGPUWarp/TritonToTritonGPUWarpPass.cpp

@@ -383,6 +385,51 @@ class ConvertTritonToTritonGPUWarp
        }
        return WalkResult::advance();
      });
+
+      if (loops.size() == 2 && workloads.front() == Workload::Attention &&


why the loop size has to be 2 ?

flashAttention with causal mask has 2 loops.

third_party/intel/lib/TritonToTritonGPUWarp/TritonToTritonGPUWarpPass.cpp

test/Conversion/intel/triton_to_tritongpu_warp.mlir

Dewei-Wang-sh · 2024-08-28T02:43:55Z

third_party/intel/lib/TritonToTritonGPUWarp/TritonToTritonGPUWarpPass.cpp

+            return;
+
+          if (auto cst = dyn_cast<arith::ConstantOp>(op)) {
+            transformArithConstantOp(cst, blockLayout);


here just assume the type w/o encoding has blockLayout, right?
I have a local change that aims to cover the full propagation, making most of the case go to L405 early return.

Yes, this IR walk just patches previously unhandled ops, but is completely specific to causal flash attention. I added a FIXME above to make that clearer. So great if we could drop this workaround soon.

Dewei-Wang-sh

overall LGTM

Signed-off-by: Julian Oppermann <[email protected]>

Detect and handle flash attention with causal masking

4215442

Signed-off-by: Julian Oppermann <[email protected]>

jopperm requested review from whitneywhtsang, etiotto, Dewei-Wang-sh and a team August 27, 2024 09:07

jopperm self-assigned this Aug 27, 2024

jopperm linked an issue Aug 27, 2024 that may be closed by this pull request

[#6 Attention Performance] extend attention support for Causal = True #1102

Open

jopperm changed the title ~~Detect and handle flash attention with causal masking~~ [TT-TO-TTGWARP] Detect and handle flash attention with causal masking Aug 27, 2024

etiotto reviewed Aug 27, 2024

View reviewed changes

Dewei-Wang-sh reviewed Aug 28, 2024

View reviewed changes

test/Conversion/intel/triton_to_tritongpu_warp.mlir Show resolved Hide resolved

Dewei-Wang-sh reviewed Aug 28, 2024

View reviewed changes

test/Conversion/intel/triton_to_tritongpu_warp.mlir Show resolved Hide resolved

Dewei-Wang-sh reviewed Aug 28, 2024

View reviewed changes

Dewei-Wang-sh approved these changes Aug 28, 2024

View reviewed changes

Address PR feedback.

138e4cd

Signed-off-by: Julian Oppermann <[email protected]>

victor-eds approved these changes Sep 2, 2024

View reviewed changes

jopperm requested a review from etiotto September 2, 2024 11:24

whitneywhtsang merged commit 0d6f060 into llvm-target Sep 4, 2024
4 checks passed

whitneywhtsang deleted the jopperm/tt-to-ttgwarp-causal branch September 4, 2024 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TT-TO-TTGWARP] Detect and handle flash attention with causal masking #2013

[TT-TO-TTGWARP] Detect and handle flash attention with causal masking #2013

jopperm commented Aug 27, 2024

etiotto Aug 27, 2024

Dewei-Wang-sh Aug 28, 2024

Dewei-Wang-sh Aug 28, 2024

jopperm Aug 28, 2024

Dewei-Wang-sh left a comment

[TT-TO-TTGWARP] Detect and handle flash attention with causal masking #2013

[TT-TO-TTGWARP] Detect and handle flash attention with causal masking #2013

Conversation

jopperm commented Aug 27, 2024

etiotto Aug 27, 2024

Choose a reason for hiding this comment

Dewei-Wang-sh Aug 28, 2024

Choose a reason for hiding this comment

Dewei-Wang-sh Aug 28, 2024

Choose a reason for hiding this comment

jopperm Aug 28, 2024

Choose a reason for hiding this comment

Dewei-Wang-sh left a comment

Choose a reason for hiding this comment