Comment by drivebyhooting

Comment by drivebyhooting 2 days ago

12 replies

I have a naive question about backprop and optimizers.

I understand how SGD is just taking a step proportional to the gradient and how backprop computes the partial derivative of the loss function with respect to each model weight.

But with more advanced optimizers the gradient is not really used directly. It gets per weight normalization, fudged with momentum, clipped, etc.

So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?

blackbear_ 2 days ago

Two thoughts:

> how important is computing the exact gradient using calculus

Normally the gradient is computed with a small "minibatch" of examples, meaning that on average over many steps the true gradient is followed, but each step individually never moves exacty along the true gradient. This noisy walk is actually quite beneficial for the final performance of the network https://arxiv.org/abs/2006.15081 , https://arxiv.org/abs/1609.04836 so much so that people started wondering what is the best way to "corrupt" this approximate gradient even more to improve performance https://arxiv.org/abs/2202.02831 (and many other works relating to SGD noise)

> vs just knowing the general direction to step

I can't find relevant papers now, but I seem to recall that the Hessian eigenvalues of the loss function decay rather quickly, which means that taking a step in most directions will not change the loss very much. That is to say, you have to know which direction to go quite precisely for an SGD-like method to work. People have been trying to visualize the loss and trajectory taken during optimization https://arxiv.org/pdf/1712.09913 , https://losslandscape.com/

  • raindeer2 2 days ago

    The first bit is why it is called Stochastic gradient decent. You follow the gradient of a randomly chosen minibatch at each step. It basically makes you "vibrate" down along the gradient.

ssivark 2 days ago

> So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?

Yes, absolutely -- a lot of ideas inspired by this have been explored in the field of optimization, and also in machine learning. The very idea of "stochastic" gradient descent using mini-batches basically a cheap (hardware compatible) approximation to the gradient for each step.

For a relatively extreme example of how we might circumvent the computational effort of backprop, see Direct Feedback Alignment: https://towardsdatascience.com/feedback-alignment-methods-7e...

Ben Recht has an interesting survey of how various learning algorithms used in reinforcement learning relate with techniques in optimization (and how they each play with the gradient in different ways): https://people.eecs.berkeley.edu/~brecht/l2c-icml2018/ (there's nothing special about RL... as far as optimization is concerned, the concepts work the same even when all the data is given up front rather than generated on-the-fly based on interactions with the environment)

GistNoesis 2 days ago

>computing the exact gradient using calculus

First of all, gradient computation with back-prop (aka reverse-mode automatic differentiation) is exact to numerical precision (except for edge-cases that are not relevant here) so it's not about the way of computing the gradient.

What Andrej is trying to tell is that when you create a model, you have freedom of design in the shape of the loss function. And that in this design what matters for learning is not so much the value of the loss function, but its slopes, and curvature (peaks and valleys).

The problematic case being flat valleys, surrounded by straight cliffs, (picture the grand canyon).

Advanced optimizers in deep learning like "Adam", are still first-order, with diagonal approximation of the curvature, which mean the optimizer in addition to the gradient it has an estimate of the scale sensitivity of each parameter independently. So the cheap thing it can reasonably do is modulate the gradient with this scale.

The length of the gradient vector, being often problematic, what optimizers would usually do was something called "line-search", which is determine the optimal step-size along this direction. But the cost of doing that is usually between 10-100 evaluation of the cost function which is often not worth the effort in the noisy stochastic context, compared to just taking a smaller step multiple times.

Higher-order optimizers necessitate that the loss function is twice differentiable, so non-linearities like relu, which are cheap to calculate can't be used.

Lower-order global optimizers don't even necessitate the gradient, which is useful when the energy-function landscape has lots of local minima, (picture an egg-box).

mgh95 2 days ago

> But with more advanced optimizers the gradient is not really used directly. It gets per weight normalization, fudged with momentum, clipped, etc.

Why would these things be "fudging"? Vanishing gradients (see the initial batch norm paper) are a real thing, and ensuring that the relative magnitudes are in some sense "smooth" between layers allows for an easier optimization problem.

> So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?

Very. In high dimensional space, small steps can move you extremely far from a proper solution. See adversarial examples.

danielmarkbruce 2 days ago

Calculus isn't that complicated, at least not what's done in backprop.

How do you propose calculating the "general direction" ?

And, an example "advanced optimizer" - AdamW - absolutely uses gradients. It just does more, but not less.

killerstorm 15 hours ago

I don't think there's a way to find "general direction" without actually using calculus.

If you write down an explicit expression for partial derivatives, it will contain sums. The sign (which kind of defines the general direction) is affected by what's in the sum, and you can't avoid calculating it.

ainch a day ago

That's an interesting idea, it sounds similar to the principles behind low precision models like BitNet (where each weight is +-1 or 0).

That said, I know Deepseek use fp32 for their gradient updates even though they use fp8 for inference. And a recent paper shows that RL+LLM training is shakier at bf16 than fp16, which would both imply that numerical precision in gradients still matters.

macleginn 2 days ago

It is possible to compute the approximate gradient (direction to step) without using the formulas: we can change the value of each parameter individually, compute the loss, set the values of all parameters in such a way that the loss is minimized, and then repeat. This means, however, that we have to do number-of-parameters forward passes for one optimization step, which is very expensive. With formulas, we can compute all these values in one backward pass.

HarHarVeryFunny a day ago

You don't need exact gradients, since gradient descent is self-correcting (which can make it hard to find gradient calculation bugs!). One approach using inexact gradients is to use predicted "synthetic gradients" which avoids needing to wait for backward pass for weight updates.

imtringued 2 days ago

All first order methods use the gradient or Jacobian of a function. Calculating the first order derivatives is really cheap.

Non-stochastic gradient descent has to optimize over the full dataset. This doesn't matter for non-machine learning applications, because often there is no such thing as a dataset in the first place and the objective has a small fixed size. The gradient here is exact.

With stochastic gradient descent you're turning gradient descent into an online algorithm, where you process a finite subset of the dataset at a time. Obviously the gradient is no longer exact, you still have to calculate it though.

Seems like "exactness" is not that useful of a property for optimization. Also, I can't stress it enough, but calculating first order derivatives is so cheap there is no need to bother. It's roughly 2x the cost of evaluating the function in the first place.

It's second order derivatives that you want to approximate using first order derivatives. That's how BFGS and Gauss-Newton work.

joe_the_user a day ago

There are a lot of approximation methods involved in training neural networks. But the main thing that while learning calculus is a challenging, actually calculating the derivative of a function at a point using algorithmic differentiation is actually extremely fast and exact, nearly as exact as calculating the function's value itself and inherently more efficient than finite difference approximations to the derivative. Algorithmic differentiation is nearly "magic".

But remember, that is for taking the derivative at a single data point - what's hard is the average derivative over the entire set of points and that's where sampling and approximations (SGD etc)comes in.