Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Attention Performance] Flash Attention performance get to 80%~90% of XeTLA #773

Open
Dewei-Wang-sh opened this issue Mar 28, 2024 · 3 comments

Comments

@Dewei-Wang-sh Dewei-Wang-sh self-assigned this Mar 28, 2024
@Dewei-Wang-sh Dewei-Wang-sh added this to the 04. Core performance milestone Mar 28, 2024
@Dewei-Wang-sh Dewei-Wang-sh changed the title [Attention(Forward) Performance] Attention with typical shape performance up to ~80% of pytorch [Attention Performance] Attention(forward of typical shape) performance up to 80% of pytorch Mar 28, 2024
@Dewei-Wang-sh Dewei-Wang-sh changed the title [Attention Performance] Attention(forward of typical shape) performance up to 80% of pytorch [Attention Performance] Attention(forward of typical shape) performance get to 80% of pytorch Mar 28, 2024
@tdeng5 tdeng5 changed the title [Attention Performance] Attention(forward of typical shape) performance get to 80% of pytorch [Attention Performance] Flash Attention v2.0 (forward of typical shape) performance get to 80% of XeTLA Apr 1, 2024
@vlad-penkin vlad-penkin changed the title [Attention Performance] Flash Attention v2.0 (forward of typical shape) performance get to 80% of XeTLA [Attention Performance] Flash Attention v2.0 (forward of typical shape) performance get to between 80%+ and 90% of XeTLA Apr 15, 2024
@vlad-penkin vlad-penkin added the enhancement New feature or request label Apr 17, 2024
@Dewei-Wang-sh Dewei-Wang-sh changed the title [Attention Performance] Flash Attention v2.0 (forward of typical shape) performance get to between 80%+ and 90% of XeTLA [Attention Performance] Flash Attention v2.0 (1x2x1024x32) performance get to between 80%~90% of XeTLA Apr 18, 2024
@Dewei-Wang-sh Dewei-Wang-sh changed the title [Attention Performance] Flash Attention v2.0 (1x2x1024x32) performance get to between 80%~90% of XeTLA [Attention Performance] Flash Attention v2.0 (1x2x1024x32) performance get to 80%~90% of XeTLA Apr 18, 2024
@Dewei-Wang-sh Dewei-Wang-sh changed the title [Attention Performance] Flash Attention v2.0 (1x2x1024x32) performance get to 80%~90% of XeTLA [Attention Performance] Flash Attention performance get to 80%~90% of XeTLA Apr 28, 2024
@Dewei-Wang-sh
Copy link
Contributor Author

Dewei-Wang-sh commented May 13, 2024

For case fwd_4x48x1024x64_false (batch, num_head, n_ctx, dim_head, causal), it can get 60% on GPU Max 1550.
end2end run with some hack code, but result data mismatch

@Dewei-Wang-sh
Copy link
Contributor Author

Dewei-Wang-sh commented May 20, 2024

fixed data mismatch;
only collect the perf data that flushes cache.
For case fwd_4x48x1024x64_false (batch, num_head, n_ctx, dim_head, causal), it can get 66% on GPU Max 1550.

@Dewei-Wang-sh
Copy link
Contributor Author

need #1102 to close this umbrella issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants