Comment by brcmthrowaway

Comment by brcmthrowaway 2 days ago

Do LLMs still use backprop?

Yes. Pretraining and fine-tuning use standard Adam optimizers (usually with weight-decay). Reinforcement learning has been the odd-man out historically, but these days almost all RL algorithms also use backprop and gradient descent.

Reply View 0 replies

ForceBru 2 days ago

Are LLMs still trained by (variants of) stochastic GRADIENT descent? AFAIK what used to be called "backprop" is nowadays known as "automatic differentiation". It's widely used in PyTorch, JAX etc

Reply View 1 reply

imtringued 2 days ago

Gradient descent doesn't matter here. Second order and higher methods still use lower order derivatives.
Back propagation is reverse mode auto differentiation. They are the same thing.
And for those who don't understand what back propagation is, it is just an efficient method to calculate the gradient for all parameters.

Reply View | 0 replies

[removed] 2 days ago

[deleted]

Reply View 0 replies