Comment by HarHarVeryFunny

I'm sure that's part of it, which is why it's nice to see Hugging Face sharing this, but still it obviously reflects the reality that large LLMs are difficult to train for whatever reasons (maybe more than just gradient issues - I haven't read that HF doc yet).

For simpler nets, like ResNet, it may just be that modern initialization and training recipes avoid most gradient issues, even though they are otherwise potentially still there.