-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FP8 options] Float8Linear vs TransformerEngine #462
Comments
@weifengpy @awgu |
good question. torchtitan + float8_experimental (or TorchAO for new dtypes in general) is a place where we showcase everything compose well together (fp8, parallalsim, torch.compile, activation checkpointing) using pytorch APIs. We have plans to benchmark perf improvement of fp8 over bf16 for perf comparison with TE, we do not have specific numbers yet. TE is more like our parterners/customers. We welcome adoption of pytorch APIs to fit their needs better |
@yundai424 , I can speak to PyTorch's float8 modeling plans and can't comment on other things you asked about. From the POV of float8_experimental, we care about performance, composability with key PyTorch systems (autograd, distributed, compile), debuggability and readability. Please feel free to file issues in https://github.com/pytorch-labs/float8_experimental/tree/main if you have more specific questions and we will be happy to help. |
Any updates on benchmarking well-tuned e2e Also, are there any examples of running |
This isn't something the PyTorch team is likely to publish in the near term, but we definitely welcome benchmarks from the community on this topic.
I think that would be really nice! It also isn't something the PyTorch team is likely to focus on, but would be great if someone from the community drove this and shared their findings. From what I know, getting a meaningful performance boost |
If possible, can you speak to whether:
Thanks! |
TE has handwritten kernels for the important float8 fusions, which is why running torch.compile on TE would have a limited benefit.
I'm not aware of any, but it would be great if someone helped out with this. |
Hi team, first of all thanks for this great repo for showcasing how to leverage the latest techniques in torch ecosystem, it's been super useful and insightful :) I have a naive question about FP8 options and would like to know more about how you view it.
There's the https://github.com/NVIDIA/TransformerEngine by nvidia for fp8 training on hopper and it's started to be integrated into downstream frameworks like HF, lightning etc. However I'm also seeing https://github.com/pytorch-labs/float8_experimental evolving quickly and the fact that it's more lightweight & potentially more composable w/ remaining torch techniques is also important to us. I'm wondering if you have some insight about the pros and cons of each of them, how would Float8Linear's performance compare to TE, and if you would recommend going with TE or Float8Linear for LLM pretraining/finetuning use cases. Thanks a lot!
The text was updated successfully, but these errors were encountered: