Comment by HarHarVeryFunny

Comment by HarHarVeryFunny a day ago

2 replies

> It seems to me that in 2016 people did (have to) play a lot more tricks with the backpropagation than today

Perhaps, but maybe because there was more experimentation with different neural net architectures and nodes/layers back then?

Nowadays the training problems are better understood, clipping is supported by the frameworks, and it's easy to find training examples online with clipping enabled.

The problem itself didn't actually go away. ReLU (or GELU) is still the default activation for most networks, and training an LLM is apparently something of a black art. Hugging Face just released their "Smol Training Playbook: a distillation of hard earned knowledge to share exactly what it takes to train SOTA LLMs", so evidentially even in 2025 training isn't exactly a turn-key affair.

joe_the_user a day ago

I wonder if its just about the important neural nets now being trained by large, secretive corporations that aren't interested in sharing their knowledge.

  • HarHarVeryFunny 15 hours ago

    I'm sure that's part of it, which is why it's nice to see Hugging Face sharing this, but still it obviously reflects the reality that large LLMs are difficult to train for whatever reasons (maybe more than just gradient issues - I haven't read that HF doc yet).

    For simpler nets, like ResNet, it may just be that modern initialization and training recipes avoid most gradient issues, even though they are otherwise potentially still there.