[Attention Performance] Flash Attention performance get to 80%~90% of XeTLA #773

Dewei-Wang-sh · 2024-03-28T08:12:50Z

we aim to get 80%+ of XeTLA
use python/tutorials/06-fused-attention.py as the test case.

[#0 Attention Performance] get xetla's attention perf data and tiling strategy #912
[#1 Attention Performance] raise non-block pointer to block pointer #913
[#2 Attention Performance] match attention workload when convert-triton-to-tritongpu-warp #914
[#3 Attention Performance] add attention support to tritonintelgpu-distribute-to-warps #915
[#4 Attention Performance] add attention support to tritonintelgpu-match-target-size #916
[#5 Attention Performance] add attention support to convert-tritongpu-to-llvm #917
[#6 Attention Performance] extend attention support for Causal = True #1102
[#7 Attention Performance] compare triton optimizations to XeTLA counterpart optimization #1103
[#8 Attention Performance] reduce ALU instruction and register spill in the loop body #1192

(batch head n_ctx d_head causal) on max1100
for fwd_1x2x1024x32_true, xetla median is 4.7tflops
for fwd_1x2x1024x32_false, xetla median is 4.6tflops
for fwd_4x48x1024x64_true, xetla median is 110tflops
for fwd_4x48x1024x64_false, xetla median is 65tflops

The text was updated successfully, but these errors were encountered:

Dewei-Wang-sh · 2024-05-13T09:09:32Z

For case fwd_4x48x1024x64_false (batch, num_head, n_ctx, dim_head, causal), it can get 60% on GPU Max 1550.
end2end run with some hack code, but result data mismatch

Dewei-Wang-sh · 2024-05-20T07:25:37Z

fixed data mismatch;
only collect the perf data that flushes cache.
For case fwd_4x48x1024x64_false (batch, num_head, n_ctx, dim_head, causal), it can get 66% on GPU Max 1550.

Dewei-Wang-sh · 2024-09-25T07:19:53Z

need #1102 to close this umbrella issue.

Dewei-Wang-sh self-assigned this Mar 28, 2024

Dewei-Wang-sh added the performance label Mar 28, 2024

Dewei-Wang-sh added this to the 04. Core performance milestone Mar 28, 2024

Dewei-Wang-sh changed the title ~~[Attention(Forward) Performance] Attention with typical shape performance up to ~80% of pytorch~~ [Attention Performance] Attention(forward of typical shape) performance up to 80% of pytorch Mar 28, 2024

Dewei-Wang-sh changed the title ~~[Attention Performance] Attention(forward of typical shape) performance up to 80% of pytorch~~ [Attention Performance] Attention(forward of typical shape) performance get to 80% of pytorch Mar 28, 2024

tdeng5 changed the title ~~[Attention Performance] Attention(forward of typical shape) performance get to 80% of pytorch~~ [Attention Performance] Flash Attention v2.0 (forward of typical shape) performance get to 80% of XeTLA Apr 1, 2024

vlad-penkin changed the title ~~[Attention Performance] Flash Attention v2.0 (forward of typical shape) performance get to 80% of XeTLA~~ [Attention Performance] Flash Attention v2.0 (forward of typical shape) performance get to between 80%+ and 90% of XeTLA Apr 15, 2024

vlad-penkin assigned quintinwang5 Apr 15, 2024

tdeng5 mentioned this issue Apr 16, 2024

[Performance] Enhance the Triton GEMM/Flash attention kernel performance for the default Triton passes pipeline #878

Closed

vlad-penkin modified the milestones: 04. Core performance, 04.3 Performance tracking Apr 16, 2024

vlad-penkin added the enhancement New feature or request label Apr 17, 2024

Dewei-Wang-sh changed the title ~~[Attention Performance] Flash Attention v2.0 (forward of typical shape) performance get to between 80%+ and 90% of XeTLA~~ [Attention Performance] Flash Attention v2.0 (1x2x1024x32) performance get to between 80%~90% of XeTLA Apr 18, 2024

Dewei-Wang-sh changed the title ~~[Attention Performance] Flash Attention v2.0 (1x2x1024x32) performance get to between 80%~90% of XeTLA~~ [Attention Performance] Flash Attention v2.0 (1x2x1024x32) performance get to 80%~90% of XeTLA Apr 18, 2024

Dewei-Wang-sh mentioned this issue Apr 19, 2024

[DPAS]: Improve the Triton GEMM/flash attention/sparse attention kernel performance to get 60% tensor core peak performance. #272

Closed

Dewei-Wang-sh changed the title ~~[Attention Performance] Flash Attention v2.0 (1x2x1024x32) performance get to 80%~90% of XeTLA~~ [Attention Performance] Flash Attention performance get to 80%~90% of XeTLA Apr 28, 2024

vlad-penkin mentioned this issue Jun 6, 2024

[Performance] Softmax Related Kernel Performance is not good #287

Closed

Dewei-Wang-sh unassigned quintinwang5 Jun 11, 2024

vlad-penkin added the codegen: attention label Aug 4, 2024

vlad-penkin added the umbrella label Aug 17, 2024

vlad-penkin modified the milestones: 4.3 [Performance] Tracking, 4.0 [Performance] Core Aug 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Attention Performance] Flash Attention performance get to 80%~90% of XeTLA #773

[Attention Performance] Flash Attention performance get to 80%~90% of XeTLA #773

Dewei-Wang-sh commented Mar 28, 2024 •

edited

Loading

Dewei-Wang-sh commented May 13, 2024 •

edited

Loading

Dewei-Wang-sh commented May 20, 2024 •

edited

Loading

Dewei-Wang-sh commented Sep 25, 2024

[Attention Performance] Flash Attention performance get to 80%~90% of XeTLA #773

[Attention Performance] Flash Attention performance get to 80%~90% of XeTLA #773

Comments

Dewei-Wang-sh commented Mar 28, 2024 • edited Loading

Dewei-Wang-sh commented May 13, 2024 • edited Loading

Dewei-Wang-sh commented May 20, 2024 • edited Loading

Dewei-Wang-sh commented Sep 25, 2024

Dewei-Wang-sh commented Mar 28, 2024 •

edited

Loading

Dewei-Wang-sh commented May 13, 2024 •

edited

Loading

Dewei-Wang-sh commented May 20, 2024 •

edited

Loading