Comment by laidoffamazon
Comment by laidoffamazon 7 hours ago
Interesting. My assumption was one of the innovations of DeepSeek and the modern GPT models was performing low precision pretraining rather than just finetuning further. I didn't realize you still need accumulation at a higher precision anyway