Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is FlashAttention really used while using HuggingFaceModel supported as one of ComposerModel types. #2564

Open
harishankar-gopalan opened this issue Sep 25, 2023 · 3 comments
Labels
enhancement New (engineering) enhancements, such as features or API changes.

Comments

@harishankar-gopalan
Copy link

Given that from PyTorch 2.0 the dynamic dispatch to FlashAttention happens if the required conditions satisfy, I do not find a way to ensure whether FlashAttention is used by default. Also due to the HF dependency for general GPT recipes, which do not seem to use the F.scaled_dot_product_attention method of PyTorch, I am wondering if FlashAttention will really be used while using composer. Any ideas on how to easily enabled usage of FlashAttention while using HF model along with composer ?

@harishankar-gopalan harishankar-gopalan added the enhancement New (engineering) enhancements, such as features or API changes. label Sep 25, 2023
@snarayan21
Copy link
Contributor

Hey, we'd recommend that you use our llm-foundry repo, which uses composer extensively and also supports using HF models. Check it out here!

@harishankar-gopalan
Copy link
Author

harishankar-gopalan commented Sep 26, 2023

Hey, we'd recommend that you use our llm-foundry repo, which uses composer extensively and also supports using HF models. Check it out here!

Hi @snarayan21 thanks for the response. This however does not answer my original question. Even in LLM foundry if we are using HuggingFace for model recipes, I do not see a functionality where the attention layer computation is ensured to go via 'F.scaled_dot_product_attention' method of PyTorch which is what ensures to dispatch to either FlashAttention and MemEfficientAttention if possible for the current model parameters. Any insights into this ?

@snarayan21
Copy link
Contributor

Hey so there are three cases you'll have when using llm-foundry:

First, using an MPT model. This has configurable attention, and supports flash attention.
Second, using a Llama model. There is an option to patch in flash attention as configured in llm-foundry.
Third, using a HuggingFace model. Foundry will use whatever attention implementation the underlying HuggingFace model uses.

You can see our attention implementations in foundry in this folder. Hope this helps!

@snarayan21 snarayan21 reopened this Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New (engineering) enhancements, such as features or API changes.
Projects
None yet
Development

No branches or pull requests

2 participants