Personally, I've found that I don't retain much of this sort of material without working through exercises. If you learn the same way, you might want to check out the series of progressive exercises from Andrew Ng here: http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial
For reference, I have a copy of my solutions here: https://github.com/danluu/UFLDL-tutorial. Debugging broken learning algorithms can be tedious in a way that's not particularly educational, so I tried to find a reference I could compare against when I was doing the exercises, and every copy I found had bugs. Hope having this reference helps someone.
I've followed the developments in Neural Networks somewhat, but have never applied deep learning so far. This is seems like a good place to ask a couple of question I've been having for a while.
1. When does it make sense to apply deep learning? Could it potentially be applied successful applied to any difficult problem given enough data? Could it also be good at the type of problems that Random Forest, Gradient Boosting Machines are traditionally good at versus the problems that SVMs are traditionally good at (Computer Vision, NLP)? [1]
2. How much data is enough?
3. What degree of tuning is required to make it work? Are we at the point yet where deep learning works more or less out the box?
4. Is it fair to say that dropout and maxout always work better in practice? [2]
5. What is the computational effort? How long e.g. does it take to classify an ImageNet image (on a CPU / GPU)? How long does it take train a model like that?
6. How on earth does this fit into memory? Say in ImageNet your have (256 pixels * 256 pixels) * (10,000 classes) * 4 bytes = 2.4 GB, for a NN without any hidden layers.
[1] I am overgeneralizing somewhat, I know. It's my way to avoid overfitting.
I don't have great answers to the other questions, though I too am interested in them.
#5) [1] has a some python code and timings mixed in to the docs. One such example (stacked denoising autoencoders on MNIST):
By default the code runs 15 pre-training epochs for each layer,
with a batch size of 1. The corruption level forthe first layer is
0.1, for the second 0.2 and 0.3 for the third. The pretraining
learning rate is was 0.001 and the finetuning learning rate is
0.1. Pre-training takes 585.01 minutes, with an average of 13
minutes per epoch. Fine-tuning is completed after 36 epochs in
444.2 minutes, with an average of 12.34 minutes per epoch. The
final validation score is 1.39% with a testing score of
1.3%. These results were obtained on a machine with an Intel Xeon
E5430 @ 2.66GHz CPU, with a single-threaded GotoBLAS.
#6) The size of the NN is not typically num_features * num_classes, but rather num_features * num_layers where num_layers is commonly 3-10 or so. If you want a (multi-class) classifier, you first feed your neural network a bunch of examples, unsupervised. Then once you've got your NN built, you feed the outputs of the NN to a classifier like SVM or SGD. The idea is that the net provides more meaningful features than you would have if you used hand crafted features or the raw input data itself.
It's ironic that deep neural networks have become the biggest machine learning breakthrough of 2013: they were also the biggest machine learning breakthrough of 1957. The idea dates back to the Perceptron, one of the oldest ideas in AI.
One thing to note: although there was a lot of initial excitement about Restricted Boltzman Machines, Auto-encoders, and other unsupervised approaches, the best results in the last year or so have all used conventional the back-propagation algorithm from 1974, with a few tweaks.
http://en.wikipedia.org/wiki/Backpropagation
The history of AI is really interesting. Perceptrons were extremely oversold by their inventor, Frank Rosenblatt after he introduced then in 1958. This led to a lot of funding and interest in AI and perceptrons. Then, in 1969, Marvin Minsky coauthored a book Perceptrons which harshly criticized how underpowered perceptrons were. Most famously, the book proved that a perceptron could not model a simple XOR function. In other words, a technique that many had been led to believe would one day emulate human-like intelligence couldn't even emulate a dead-simple logic gate! The book was devastating and effectively led to a dark age of AI where funding and interest dried up (later, the term "AI winter" was coined).
The next big boom in AI (ignoring some logic/rules-based research in the 70s that I don't think is very interesting from an AI perspective) occurred in the 80s, when computational power increased and researchers discovered/rediscovered neural approaches, including the obscure 1974 research on backpropagation. This led to tons of press and funding from governments who dreamed of killer AI robots and what-not. But, once again, imagination raced ahead of reality and funding dried up when said robots didn't materialize. The field didn't really die off, but funding in AI went way down, leading to another major "AI winter".
I'd say the next big era of AI is the one we're in, driven largely by applied statistics that became known as "machine learning". This has been by far the most successful era, and has probably added 100s of billions of dollars to the economy (I'd argue Google is a machine learning company, for example). I think it's also the most pragmatic era, as people in the field have really learned from the past mistakes of overpromising. In fact, when I was studying "AI" in grad school, my professors warned me to always refer to what I did as machine learning because the concept of "intelligence" was such a joke to so many in the field.
Signal processing mysticism repeats itself every 20 years and has been fueled by tremendous hype since its debut 400 years ago:
1. Linear Regression (which, admittedly, was amazing)
2. Fourier Analysis (which is linear regression on orthonormal bases of functions. it blew people's minds)
3. Perceptrons (which is linear regression but with a logistic loss. it went back to its old name of "logistic regression" once its insane cachet of biological plausibility faded)
4. Neural Networks (stack of logistic regressors. popular with people who didn't know how to filter their inputs through fixed or random nonlinearities before applying linear regression)
5. Self Organizing Maps and Recurrent Nets (which were neural nets that feed back on themselves)
6. Fractals (which is recursion. they were useful for enticing children into math classes)
7. Chaos (which is recursion that's hard to model. useful for movie plots)
8. Wavelets (which is recursive Fourier analysis, and probably still way under-used)
9. Support Vector Machines (which replaces logistic regression's smooth loss with a kink that makes it hard to use a fast optimizer. often conflated with the "kernel trick", which appealed to people who didn't want to pass their inputs through nonlinearities explicitly)
9. Deep Nets (which are bigger neural networks. the jury's out whether they work better because they're deeper, or because they're bigger and require a lot of data to train, or because they require a programmer to spend years developing a learning algorithm for each new dataset. also whether they do actually work better).
Once this Deep Net thing blows over again, my money's on Kernelized Recurrent Deep Self Organizing Maps.
(On a serious note: MNIST is considered a trivial dataset and doesn't require the heavy machinery of deep nets. linear regression on almost any random nonlinearity applied to the data (say f(x;w,t)=cos(w'x+t) with w~N(0,I) and t~U[0,pi2/]) will get you >98% accuracy on MNIST.)
There are some gross simplifications here:
8. Wavelets are not "recursive Fourier analysis". If you want to make it simple, it's more like a spatially localized Fourier extension. I agree that they are under-used though.
9. SVM: the "kernel trick" is a big deal because sometimes defining a vector-space of linear features out of non-vector objects won't give you a good performance, and you'd better define a dot-product.
10. Deep Nets are not only bigger neural networks. The buzz is about how you train them. It's about improving how you train a net in general.
I would say that the next bing thing is more:
Realizing even more that Neural Nets is an optimization problem, and instead of using some heuristics, wait for some Russian mathematician to derive the right SGD schedule / batch solver for the problem. Then what the 1,000 of Google computers have been able to do for the cat face detector, we'll be able to do it on a smartphone chip.
People have to realize that Deep Learning is a bit of a "brute force" solution for the moment (each node is a linear model). We need to derive smarter algorithms.
you've actually probably done this yourself. it's often called "featurization". for example, instead of applying a linear learner on vectors x in R^d, you apply it to vectors f(x), where f computes a bunch of features on x. a popular choice for f are the d-th order monomials. hashing families are another good idea (Alex Smola does this). more generally, any random nonlinear function f is a good candidate (i call that analysis "Random Kitchen Sinks"). when x is structured data, f usually just returns counts in histogram bins of some kind.
Yep the ideas from the 50's have definitely reappeared now we have the compute power and methods to implement them at a large scale. That article gives a nice perspective.
One of the best breakthroughs has been this notion of layer-wise pretraining, which allows the backpropagation algorithm to not get stuck in local minima so easily. It provides a good guess to the starting starting points for the weights. Otherwise, the biggest issue with backpropagation historically has been the diffusion of weights as the layers increase; it is hard to attribute the causality or what portion of the update weighting should be applied to each node since it grows exponentially. This pretraining idea helps against that.
That's what I thought too! But according to my friends on the Google Brain team, unsupervised pretraining is now thought to be an irrelevant detour.
In 2006, Hinton introduced greedy layer-wise pretraining, which was intended to solve the problem of backpropagation getting stuck in poor local optima. The theory was that you'd pretrain to find a good initial set of connection weights, then apply backprop to "fine-tune" discriminatively. And the theory seemed correct since the experimental results were good:
http://www.cs.toronto.edu/~hinton/absps/fastnc.pdfhttp://machinelearning.wustl.edu/mlpapers/paper_files/NIPS20...
But that same year, a student in Geoff Hinton's lab discovered that if you added information about the 2nd-derivatives of the loss function to backpropagation ("Hessian-free optimization"), you could skip pretraining and get the same or better results:
http://machinelearning.wustl.edu/mlpapers/paper_files/icml20...
And around ~2012, a bunch of researchers have reported you don't even need 2nd-derivative information. You just have to initialize the neural net properly. Apparently, all the most recent results in speech recognition just use standard backpropagation with no unsupervised pretraining. (Although people are still trying more complex variants of unsupervised pretraining algorithms, often involving multiple types of layers in the neural network.)
So now, after seven years of work, we're back where we started: the plain ol' backpropgation algorithm from 1974 worked all along.
This whole topic is really interesting to me from a history of science perspective. What other old, discarded ideas from the past might be ripe, now that we have millions of times more data and computation?
Yes this is really interesting. I haven't read those other papers yet (definitely plan on it now thanks for the links), but Bengio's latest paper on denoising autoencoders from earlier this year (http://arxiv.org/abs/1305.6663) still used the unsupervised pretraining. Also the Theano implementation that I run experiments with uses it as well (but that code could be a year or two old).
Definitely going to be researching this more throughout the year.
Very interesting I was not aware unsupervised pretraining was a distant second to availability of data and Flops. So really, deep learning is essentially the same old MLP of recent peasant like status (90's). Stacks of backpropagating perceptrons with the ancient logistic regression on top - now with more stacking! This makes sense.
Machine learning is really just a form of non-human scripting. After all, every ML system running on a PC is either Turing equivalent or less. An analogy would be something that tries to generate the minimal set of regular expressions (that match non deterministically) which cover given examples. The advantage of an ML model vs a collection of regexes is many interesting problems are vulnerable to calculus (optimize) or counting (probability, integration etc.)
So like good notation, the stacking allows more complicated things to be said more compactly. But more complicated things need more explanation and more thinking to understand.
> And around ~2012, a bunch of researchers have reported you don't even need 2nd-derivative information. You just have to initialize the neural net properly.
This sounds very interesting. How do you property initialize the weights? Do you have a link to a paper about this?
RBMs and auto-encoders use backprop too. They just use it for "fine tuning" (due to running pre-training first) instead of propagating error derivatives from randomly initialized weights.
(Thus concludes the smartest thing I've said all day.)
Google, Twitter, Netflix, Yelp, Pandora and more are speaking on Deep Learning and RecSys this Friday at MLconf in San Francisco. We're trying to get a streaming solution going as well for those who can't make it. http://mlconf.com
That's fine for me, but I'm getting ligatures for every st. I thought it was deliberate (and a little pretentious really) but you're not getting them so hurray webfont or something I suppose.
It appears to be a bug with a combination of the ::selected pseudoelement in conjunction with the font Georgia. My guess is it's a Chrome-on-mac bug (Firefox is fine), not a site coding error.
Disabling either the font or the selection style fixes it. Most likely a text rendering issue. At work we've noticed Chrome getting buggier in relation to that, as well as retaining DOM node properties via redraws.
Linear algebra. Bayesian statistics. MUST know these inside out, upside down.
Vector calculus. Convex optimization.
A boatload of machine learning literature. The ideas coalescing into deep learning are based more than a decade of research.
If you know nothing about math... I can't imagine getting to the point of understanding deep learning (which is a fairly rapidly evolving area) without at least 2-3 years of very hard work.
This class is a reasonable attempt to give a quick intro to one major source for DBNs https://www.coursera.org/course/neuralnets understanding this course is a good benchmark.
This is a really good write up. For people looking for practical experience with these types of methods, I'd also recommend checking out theano and/or pylearn2 (which is built w/ theano).
Very interesting stuff written in a clear way.
I'm actually finishing my master thesis on music genre recognition through machine learning, which is focused more on traditional ensemble learning, but I think that it would be nice to study deep learning in greater detail.
Thanks!
Can anyone recommend a good book on the topic? And maybe other recent neural network research topics; I'm esp. also interested in recurrent networks like LSTM.
For reference, I have a copy of my solutions here: https://github.com/danluu/UFLDL-tutorial. Debugging broken learning algorithms can be tedious in a way that's not particularly educational, so I tried to find a reference I could compare against when I was doing the exercises, and every copy I found had bugs. Hope having this reference helps someone.