[FP8 options] Float8Linear vs TransformerEngine #462

yundai424 · 2024-07-16T03:54:29Z

Hi team, first of all thanks for this great repo for showcasing how to leverage the latest techniques in torch ecosystem, it's been super useful and insightful :) I have a naive question about FP8 options and would like to know more about how you view it.

There's the https://github.com/NVIDIA/TransformerEngine by nvidia for fp8 training on hopper and it's started to be integrated into downstream frameworks like HF, lightning etc. However I'm also seeing https://github.com/pytorch-labs/float8_experimental evolving quickly and the fact that it's more lightweight & potentially more composable w/ remaining torch techniques is also important to us. I'm wondering if you have some insight about the pros and cons of each of them, how would Float8Linear's performance compare to TE, and if you would recommend going with TE or Float8Linear for LLM pretraining/finetuning use cases. Thanks a lot!

tianyu-l · 2024-07-16T04:13:29Z

@weifengpy @awgu
Any insights or thoughts to share?

weifengpy · 2024-07-16T18:41:05Z

good question. torchtitan + float8_experimental (or TorchAO for new dtypes in general) is a place where we showcase everything compose well together (fp8, parallalsim, torch.compile, activation checkpointing) using pytorch APIs. We have plans to benchmark perf improvement of fp8 over bf16

for perf comparison with TE, we do not have specific numbers yet. TE is more like our parterners/customers. We welcome adoption of pytorch APIs to fit their needs better

vkuzo · 2024-07-16T20:03:32Z

@yundai424 , I can speak to PyTorch's float8 modeling plans and can't comment on other things you asked about. From the POV of float8_experimental, we care about performance, composability with key PyTorch systems (autograd, distributed, compile), debuggability and readability. Please feel free to file issues in https://github.com/pytorch-labs/float8_experimental/tree/main if you have more specific questions and we will be happy to help.

jeromeku · 2024-08-12T12:29:50Z

@vkuzo @awgu

Any updates on benchmarking well-tuned e2e torchtitan / float8_experimental against a comparable TransformerEngine implementation?

Also, are there any examples of running torchtitan with all the composability benefits (e.g., seamless integration with torch.compile, FSDP2, etc.) but with TransformerEngine instead of float8_experimental?

vkuzo · 2024-08-12T14:42:50Z

Any updates on benchmarking well-tuned e2e torchtitan / float8_experimental against a comparable TransformerEngine implementation?

This isn't something the PyTorch team is likely to publish in the near term, but we definitely welcome benchmarks from the community on this topic.

Also, are there any examples of running torchtitan with all the composability benefits (e.g., seamless integration with torch.compile, FSDP2, etc.) but with TransformerEngine instead of float8_experimental?

I think that would be really nice! It also isn't something the PyTorch team is likely to focus on, but would be great if someone from the community drove this and shared their findings. From what I know, getting a meaningful performance boost
from torch.compile + TE might be difficult because TE extensively uses custom CUDA kernels.

jeromeku · 2024-08-12T15:05:53Z

@vkuzo @awgu

If possible, can you speak to whether:

registering TE kernels as custom operators + torch.compile with torchtitan would result in performance degradation vs a purely native torch workflow (float8_experimental, etc.)?
are you aware of any benchmarks from the community on how various fp8 implementations compose with torch (torch.compile, torchtitan, FSDP2, etc.)?

Thanks!

vkuzo · 2024-08-12T17:33:23Z

registering TE kernels as custom operators + torch.compile with torchtitan would result in performance degradation vs a purely native torch workflow (float8_experimental, etc.)?

TE has handwritten kernels for the important float8 fusions, which is why running torch.compile on TE would have a limited benefit. torchao.float8 does not ship with any handwritten kernels, so a compiler is required to recover performance.

are you aware of any benchmarks from the community on how various fp8 implementations compose with torch (torch.compile, torchtitan, FSDP2, etc.)?

I'm not aware of any, but it would be great if someone helped out with this.

tianyu-l added the question Further information is requested label Jul 16, 2024

This was referenced Aug 22, 2024

Question: How to use Float8InferenceLinear with FSDP1/2? pytorch/ao#704

Open

[Feature Request] Fused fp8 matmul kernel (quant + dequant + matmul) pytorch/ao#752

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FP8 options] Float8Linear vs TransformerEngine #462

[FP8 options] Float8Linear vs TransformerEngine #462

yundai424 commented Jul 16, 2024 •

edited

Loading

tianyu-l commented Jul 16, 2024

weifengpy commented Jul 16, 2024

vkuzo commented Jul 16, 2024

jeromeku commented Aug 12, 2024

vkuzo commented Aug 12, 2024

jeromeku commented Aug 12, 2024

vkuzo commented Aug 12, 2024

[FP8 options] Float8Linear vs TransformerEngine #462

[FP8 options] Float8Linear vs TransformerEngine #462

Comments

yundai424 commented Jul 16, 2024 • edited Loading

tianyu-l commented Jul 16, 2024

weifengpy commented Jul 16, 2024

vkuzo commented Jul 16, 2024

jeromeku commented Aug 12, 2024

vkuzo commented Aug 12, 2024

jeromeku commented Aug 12, 2024

vkuzo commented Aug 12, 2024

yundai424 commented Jul 16, 2024 •

edited

Loading