Comment by gwern

Comment by gwern 11 hours ago

> Note again that a residual connection is not just an arbitrary shortcut connection or skip connection (e.g., 1988)[LA88][SEG1-3] from one layer to another! No, its weight must be 1.0, like in the 1997 LSTM, or in the 1999 initialized LSTM, or the initialized Highway Net, or the ResNet. If the weight had some other arbitrary real value far from 1.0, then the vanishing/exploding gradient problem[VAN1] would raise its ugly head, unless it was under control by an initially open gate that learns when to keep or temporarily remove the connection's residual property, like in the 1999 initialized LSTM, or the initialized Highway Net.

After reading Lang & Witbrock 1988 https://gwern.net/doc/ai/nn/fully-connected/1988-lang.pdf I'm not sure how convincing I find this explanation.

imtringued 9 hours ago

For residual networks with an infinite number of layers it is absolutely correct. For a residual network with finite layers, you can get away with any non zero constant weight as long as the weight chosen appropriately for the fixed network depth. The problem is simply c^n gives you very big or very small numbers for large n and large deviations from 1.

Now let me address the other possibility that you are talking about: what if residual connections aren't necessary? What if there is another way? What are the criteria necessary to avoid exploding or vanishing gradient or slow learning in the absence of both?

For that we need to first know why residual connections work. There is no way around calculating the back propagation formula by hand, but there is an easy trick to make it simple. We don't care about the number of parameters in the network, we only care about the flow of the gradient. So just have a single input and output with hidden size 1 and two hidden layers.

Each layer has a bias and a single weight and an activation function.

Let's assume you initialize each weight and bias with zero. The forward pass returns zero for any input and the gradient is zero. In this artificial scenario the gradient starts vanished and stays vanished. The reason is pretty obvious when you apply back propagation. The second layer clips the gradient of the first layer. If there was a single layer, the gradient would be non zero and yield a non zero gradient, rescuing the network out of the vanishing gradient.

Now what if you add residual connections? The forward pass stays the same, but the backward pass changes for two layers and beyond. The gradient for the second layer consists of just the second layer activation function multiplied by the first layer activation of the forward pass. The first layer gradient consists of the second layer gradient where the first layer activation is substituted by the gradient of the first layer but because it is a residual net, you also add the gradient of just the first layer.

In other words, the first layer is trained independently of the layers that come after it, but also gets feedback from higher layers on top. This allows it to become non zero, which then lets the second layer become non zero, which lets the third be non zero and so on.

Since the degenerate case of a zero initialized network makes things easy to conceptualise, it should help you figure out what other ways there are to accomplish the same task.

For example, what if we apply the loss to every layer's output as a regularizer? That is essentially doing the same thing as a residual, but with skip connections that sum up the outputs. You could replace the sum with a weighted sum where the weights are not equal to 1.0.

But what if you don't want skip connections either, because they are too similar to residual networks? A residual network has one skip connection already and summing up in a different way is uninteresting. It is also too reliant on each layer being encouraged to produce an output that is matched against the label.

In other words, what if we wanted to let the inner layers not be subject to any correlation with the output data? You would need something that forces the gradients away from zero but also away from excessively high numbers. I.e. weight regularization or layer normalisation with a fixed non zero bias.

Predictive coding and especially batched predictive coding could also be a solution to this.

Predictive coding predicts the input of the next layer, so the only requirement is that the forward pass produces a non zero output. There is no requirement for the gradient to flow through the entire network.

Reply View 1 reply

gwern 5 hours ago

My point is more that Schmidhuber is saying that the gates or the initialization are the innovation solely because they produce well-behaved gradients, which is why Hochreiter's 1991 thing is where he starts and nothing before that counts. But it's not clear to me why we should define it like that when you can solve the gradient misbehavior other ways, which is why https://gwern.net/doc/ai/nn/fully-connected/1988-lang.pdf#pa... works and doesn't diverge: if I'm understanding them right, they did warmup, so the gradients don't explode or vanish. So why doesn't that count? They have shortcut layers and a solution to exploding/vanishing gradients and it works to solve their problem. Is it literally 'well, you didn't use a gate neuron or fancy initialization to train your shortcuts stably, therefore it doesn't count'? Such an argument seems carefully tailored to exclude all prior work...

Reply View | 0 replies

CamperBob2 9 hours ago

That's a cool paper. Super interesting to see how work was progressing at the time, when Convex was the machine everybody wanted on (or rather next to) their desks.

Reply View 0 replies