Comment by bjourne

I'm not a giant like Schmidhuber so I might be wrong, but imo there are at least two features that set residual connections and LSTMs apart:

1. In LSTMs skip connections help propagate gradients backwards through time. In ResNets, skip connections help propagate gradients across layers.

2. Forking the dataflow is part of the novelty, not only the residual computation. Shortcuts can contain things like batch norm, down sampling, or any other operation. LSTM "residual learning" is much more rigid.