Low bit Optimizers & FA-3 #742

asahni04 · 2024-12-16T03:56:22Z

hi have there been any tests with fa-3 and low bit optimizers from torchao like FP8adam for 8bit adam? i see divergence in training when resuming a FA-2 checkpoint with FA-3 or when using 8BITADAMW

fegin · 2024-12-16T17:44:13Z

weifengpy · 2024-12-16T18:54:04Z

Hey @asahni04, do you happen to have some breakdown?

It helps clarify if it's FA-3 (model state dict) or 8-bit adamw (optim state dict)

asahni04 · 2024-12-19T23:31:35Z

hi @weifengpy sorry for the delayed response, yes

baseline is FA-2 checkpoint with adamw
2.switching to FA-3 directly for inference (single gpu and multi-gpu TP based) in the model trained on FA-2 leads to broken results. however finetuning from scratch with FA-3 seems to work and give around 30% speedup depending on parallel config
with adam 8-bit the loss seems to diverge after some iterations, tried with various block_ sizes and am using the torchao implementation. any suggestions to help solve it? can it be a error due to TP/DP config??

gnadathur · 2024-12-20T21:45:02Z

Provide feedback