Comment by laidoffamazon

Comment by laidoffamazon 7 hours ago

0 replies

Interesting. My assumption was one of the innovations of DeepSeek and the modern GPT models was performing low precision pretraining rather than just finetuning further. I didn't realize you still need accumulation at a higher precision anyway