Comment by laidoffamazon

Comment by laidoffamazon 9 hours ago

Isn't the new trend to train in lower precision anyway?

Today, training in "low precision" probably means computing FP8 x FP8 -> FP32. The FP32 accumulation is still important, but otherwise yes this works, especially if we're talking about MXFP8 as supported on Blackwell [0].

What's less proven is a recipe using MXFP4 x MXFP4 -> FP32 compute, e.g. [1], which needs more involved techniques to work. But if you get it to work stably, that pathway is running at full throughput on 5090.

[0]: https://arxiv.org/abs/2506.08027 [1]: https://arxiv.org/abs/2502.20586

Reply View 1 reply

laidoffamazon 7 hours ago

Interesting. My assumption was one of the innovations of DeepSeek and the modern GPT models was performing low precision pretraining rather than just finetuning further. I didn't realize you still need accumulation at a higher precision anyway

Reply View | 0 replies

storus 8 hours ago

Only GPU-poors run Q-GaLore and similar tricks.

Reply View 1 reply

Twirrim 5 hours ago

Even the large cloud AI services are focusing on this too, because it drives down the average "cost per query", or whatever you want to call it. For inference, arguably more even than training, the smaller and more efficient they can get it, the better their bottom line.

Reply View | 0 replies