Comment by ainch

That's an interesting idea, it sounds similar to the principles behind low precision models like BitNet (where each weight is +-1 or 0).

That said, I know Deepseek use fp32 for their gradient updates even though they use fp8 for inference. And a recent paper shows that RL+LLM training is shakier at bf16 than fp16, which would both imply that numerical precision in gradients still matters.