Hacker News new | past | comments | ask | show | jobs | submit login
Understanding LSTM networks (colah.github.io)
99 points by michael_nielsen on Aug 27, 2015 | hide | past | favorite | 17 comments



Whats amazing to me is that if I understand correctly backprop still works. It is very odd that SGD on the error function for some training data is conceptually equivalent to teaching all the gates for each hidden feature when to open/close given the next input in a sequence.


> Whats amazing to me is that if I understand correctly backprop still works.

Yep! One computes the gradient with backpropagation and trains LSTMs on that.

> It is very odd that SGD on the error function for some training data is conceptually equivalent to teaching all the gates for each hidden feature when to open/close given the next input in a sequence.

Agreed, it's pretty remarkable the things one can learn with gradient descent. I'd like to understand this better.


I don't understand why this is remarkable, could you elaborate? You have a smooth map from inputs to outputs, you differentiate it via automated applications of the chain rule and use this to make the error function smaller. Why should this be remarkable?


If I didn't know that neural nets worked, and someone explained the idea, I'd strongly expect them to get stuck in local minima, instead of learning this elaborate behavior. (That's basically what happens with a normal RNN, I think.)

It's probably a case of my intuitions about optimization being really broken in high-dimensional spaces, but I'd like to improve that. I'd also like to understand how the cost surface we're optimizing on interacts with network architecture decisions. Both, unfortunately, are very hard demands.


>If I didn't know that neural nets worked, and someone explained the idea, I'd strongly expect them to get stuck in local minima

This has been the general consensus (based on intuition) until recently. It turns out that it's not as big problem as tought. Almost all local minima have very similar error values. The bigger issue is combinatorially large number of saddle points where gradient becomes zero with very few downward curving directions. Saddle-free Newton method is one way around that.

Open Problem: The landscape of the loss surfaces of multilayer networks, Choromanska, LeCun, Arous http://jmlr.org/proceedings/papers/v40/Choromanska15.pdf


Unless there is some major advance in nonconvex optimization here, (sub)gradient descent on the function you describe in your post is almost certainly converging to a local minimum as well. I guess maybe the surprise is that local minima can perform well? But again, this does not seem like such a surprise when you consider the number of parameters being fitted. I do not have much experience with LSTMs per se, but modern i.e. conv-nets are basically O(10^8)-dimensional nonlinear maps. It just doesn't seem all that unexpected that you can embed a huge number of patterns into such an object.


If it was a smooth map, yes. But neural network error surfaces are messier.

Computing the gradient on random mini-batches ends up helping "average things out". But adding more layers, different architectures, different regularization techniques, you get very different results than you would get by taking the simpler mini-batch only approach where my expectation would be more of an "averaging" effect along a smooth surface rather than the kind of learning we see from a neural network.

It's remarkable because it seems you can get away with ignoring certain assumptions about the smoothness of the error surface and bootstrap predictive models from raw (read: cheaper to generate) data rather than features pre-designed to live in a nice space together. A lot of people didn't think you could do that, and the question of why you can remains unanswered.


Whilst the theory behind the mathematics is clear, it's not necessarily clear it should work well in practice.

The weights are initialized as noise, the optimization problem isn't convex so there are numerous local optima to get stuck in, and the distance between the relevant input and output can be dozens or hundreds of timesteps in length. The last point is quite important as you can end up with very odd gradients (exploding / vanishing gradients). Troublesome gradients are why traditional RNNs don't work particularly well - they can't actually make the connection between input and output.

Others can likely explain the intricacies better :)


An LSTM is slightly different than a standard feedforward unit. LSTM unit has internal state dependent on past inputs. Gradient descent trains the LSTM not only which output for a given input, but also how to update the internal state given an input. That the whole thing should still be differentiable is not intuitive to me.

edit: It's kind of like, if not actually equivalent to, programming a Turing machine by its gradient on training data.


You're solving a problem that isn't convex, in most interesting cases.


colah, your posts combine a deep level of understanding and an exceptional clarity. These are both rare, especially in the cargo-cult-driven world of neural networks and deep learning.

I hope you keep writing as much as you can. Thanks!


Thanks colah, that was a very readable walk-through. I've been making my way through Bishop's PRML ch 5 to get as much of a handle as possible on NNs, but your intro here to LSTM's makes me want to jump ahead and skip to the new stuff :)


Michael, this post nicely completes your book about neural networks. I was a little surprised you didn't write it yourself.


Anybody know what he uses for his diagrams?


All of the diagrams in this post were made in Inkscape, with the LaTeX plugin for equations.


great explanation. many thanks! hochreiter is genius!


> hochreiter is genius!

It's really impressive to me how farsighted his work on LSTMs was.

In any case, I'm glad you liked the post.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: