Deep Learning 101

luu · on Nov 14, 2013

Personally, I've found that I don't retain much of this sort of material without working through exercises. If you learn the same way, you might want to check out the series of progressive exercises from Andrew Ng here: http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial

For reference, I have a copy of my solutions here: https://github.com/danluu/UFLDL-tutorial. Debugging broken learning algorithms can be tedious in a way that's not particularly educational, so I tried to find a reference I could compare against when I was doing the exercises, and every copy I found had bugs. Hope having this reference helps someone.

msvan · on Nov 14, 2013

For some more elementary material, I also recommend Andrew Ng's machine learning course on Coursera. He's a great teacher.

kot-behemoth · on Nov 14, 2013

Link for the impatient https://www.coursera.org/course/ml Looks great indeed!

mbeissinger · on Nov 14, 2013

Nice!

misiti3780 · on Nov 14, 2013

awesome - thanks for the links

ma2rten · on Nov 14, 2013

I've followed the developments in Neural Networks somewhat, but have never applied deep learning so far. This is seems like a good place to ask a couple of question I've been having for a while.

1. When does it make sense to apply deep learning? Could it potentially be applied successful applied to any difficult problem given enough data? Could it also be good at the type of problems that Random Forest, Gradient Boosting Machines are traditionally good at versus the problems that SVMs are traditionally good at (Computer Vision, NLP)? [1]

2. How much data is enough?

3. What degree of tuning is required to make it work? Are we at the point yet where deep learning works more or less out the box?

4. Is it fair to say that dropout and maxout always work better in practice? [2]

5. What is the computational effort? How long e.g. does it take to classify an ImageNet image (on a CPU / GPU)? How long does it take train a model like that?

6. How on earth does this fit into memory? Say in ImageNet your have (256 pixels * 256 pixels) * (10,000 classes) * 4 bytes = 2.4 GB, for a NN without any hidden layers.

[1] I am overgeneralizing somewhat, I know. It's my way to avoid overfitting.

[2] My lunch today was free.

dwiel · on Nov 14, 2013

I don't have great answers to the other questions, though I too am interested in them.

#5) [1] has a some python code and timings mixed in to the docs. One such example (stacked denoising autoencoders on MNIST):

    By default the code runs 15 pre-training epochs for each layer,             
    with a batch size of 1. The corruption level forthe first layer is          
    0.1, for the second 0.2 and 0.3 for the third. The pretraining              
    learning rate is was 0.001 and the finetuning learning rate is              
    0.1. Pre-training takes 585.01 minutes, with an average of 13               
    minutes per epoch. Fine-tuning is completed after 36 epochs in              
    444.2 minutes, with an average of 12.34 minutes per epoch. The              
    final validation score is 1.39% with a testing score of                     
    1.3%. These results were obtained on a machine with an Intel Xeon           
    E5430 @ 2.66GHz CPU, with a single-threaded GotoBLAS.

#6) The size of the NN is not typically num_features * num_classes, but rather num_features * num_layers where num_layers is commonly 3-10 or so. If you want a (multi-class) classifier, you first feed your neural network a bunch of examples, unsupervised. Then once you've got your NN built, you feed the outputs of the NN to a classifier like SVM or SGD. The idea is that the net provides more meaningful features than you would have if you used hand crafted features or the raw input data itself.

[1] http://deeplearning.net/tutorial/SdA.html#sda

ma2rten · on Nov 15, 2013

I understand that this unsupervised approach is out of fashion already.

https://plus.google.com/+YannLeCunPhD/posts/UVT2fYTfoAC

brandonb · on Nov 14, 2013

This is a cool tutorial!

It's ironic that deep neural networks have become the biggest machine learning breakthrough of 2013: they were also the biggest machine learning breakthrough of 1957. The idea dates back to the Perceptron, one of the oldest ideas in AI.

One thing to note: although there was a lot of initial excitement about Restricted Boltzman Machines, Auto-encoders, and other unsupervised approaches, the best results in the last year or so have all used conventional the back-propagation algorithm from 1974, with a few tweaks. http://en.wikipedia.org/wiki/Backpropagation

Ben Lorica wrote a good article on the latest deep learning research from Google, and what's changed since neural networks were last popular in the 1980's: http://strata.oreilly.com/2013/10/deep-learning-oral-traditi...

What's old is new again.

rm999 · on Nov 14, 2013

The history of AI is really interesting. Perceptrons were extremely oversold by their inventor, Frank Rosenblatt after he introduced then in 1958. This led to a lot of funding and interest in AI and perceptrons. Then, in 1969, Marvin Minsky coauthored a book Perceptrons which harshly criticized how underpowered perceptrons were. Most famously, the book proved that a perceptron could not model a simple XOR function. In other words, a technique that many had been led to believe would one day emulate human-like intelligence couldn't even emulate a dead-simple logic gate! The book was devastating and effectively led to a dark age of AI where funding and interest dried up (later, the term "AI winter" was coined).

The next big boom in AI (ignoring some logic/rules-based research in the 70s that I don't think is very interesting from an AI perspective) occurred in the 80s, when computational power increased and researchers discovered/rediscovered neural approaches, including the obscure 1974 research on backpropagation. This led to tons of press and funding from governments who dreamed of killer AI robots and what-not. But, once again, imagination raced ahead of reality and funding dried up when said robots didn't materialize. The field didn't really die off, but funding in AI went way down, leading to another major "AI winter".

I'd say the next big era of AI is the one we're in, driven largely by applied statistics that became known as "machine learning". This has been by far the most successful era, and has probably added 100s of billions of dollars to the economy (I'd argue Google is a machine learning company, for example). I think it's also the most pragmatic era, as people in the field have really learned from the past mistakes of overpromising. In fact, when I was studying "AI" in grad school, my professors warned me to always refer to what I did as machine learning because the concept of "intelligence" was such a joke to so many in the field.

rahimiali · on Nov 14, 2013

Signal processing mysticism repeats itself every 20 years and has been fueled by tremendous hype since its debut 400 years ago:

1. Linear Regression (which, admittedly, was amazing)

2. Fourier Analysis (which is linear regression on orthonormal bases of functions. it blew people's minds)

3. Perceptrons (which is linear regression but with a logistic loss. it went back to its old name of "logistic regression" once its insane cachet of biological plausibility faded)

4. Neural Networks (stack of logistic regressors. popular with people who didn't know how to filter their inputs through fixed or random nonlinearities before applying linear regression)

5. Self Organizing Maps and Recurrent Nets (which were neural nets that feed back on themselves)

6. Fractals (which is recursion. they were useful for enticing children into math classes)

7. Chaos (which is recursion that's hard to model. useful for movie plots)

8. Wavelets (which is recursive Fourier analysis, and probably still way under-used)

9. Support Vector Machines (which replaces logistic regression's smooth loss with a kink that makes it hard to use a fast optimizer. often conflated with the "kernel trick", which appealed to people who didn't want to pass their inputs through nonlinearities explicitly)

9. Deep Nets (which are bigger neural networks. the jury's out whether they work better because they're deeper, or because they're bigger and require a lot of data to train, or because they require a programmer to spend years developing a learning algorithm for each new dataset. also whether they do actually work better).

Once this Deep Net thing blows over again, my money's on Kernelized Recurrent Deep Self Organizing Maps.

(On a serious note: MNIST is considered a trivial dataset and doesn't require the heavy machinery of deep nets. linear regression on almost any random nonlinearity applied to the data (say f(x;w,t)=cos(w'x+t) with w~N(0,I) and t~U[0,pi2/]) will get you >98% accuracy on MNIST.)

cheesycheese · on Nov 14, 2013

There are some gross simplifications here: 8. Wavelets are not "recursive Fourier analysis". If you want to make it simple, it's more like a spatially localized Fourier extension. I agree that they are under-used though. 9. SVM: the "kernel trick" is a big deal because sometimes defining a vector-space of linear features out of non-vector objects won't give you a good performance, and you'd better define a dot-product. 10. Deep Nets are not only bigger neural networks. The buzz is about how you train them. It's about improving how you train a net in general.

I would say that the next bing thing is more: Realizing even more that Neural Nets is an optimization problem, and instead of using some heuristics, wait for some Russian mathematician to derive the right SGD schedule / batch solver for the problem. Then what the 1,000 of Google computers have been able to do for the cat face detector, we'll be able to do it on a smartphone chip. People have to realize that Deep Learning is a bit of a "brute force" solution for the moment (each node is a linear model). We need to derive smarter algorithms.

nrmn · on Nov 14, 2013

Could you explain the filtering "their inputs through fixed or random nonlinearities"? I haven't heard of this before.

rahimiali · on Nov 14, 2013

you've actually probably done this yourself. it's often called "featurization". for example, instead of applying a linear learner on vectors x in R^d, you apply it to vectors f(x), where f computes a bunch of features on x. a popular choice for f are the d-th order monomials. hashing families are another good idea (Alex Smola does this). more generally, any random nonlinear function f is a good candidate (i call that analysis "Random Kitchen Sinks"). when x is structured data, f usually just returns counts in histogram bins of some kind.

mbeissinger · on Nov 14, 2013

Yep the ideas from the 50's have definitely reappeared now we have the compute power and methods to implement them at a large scale. That article gives a nice perspective.

One of the best breakthroughs has been this notion of layer-wise pretraining, which allows the backpropagation algorithm to not get stuck in local minima so easily. It provides a good guess to the starting starting points for the weights. Otherwise, the biggest issue with backpropagation historically has been the diffusion of weights as the layers increase; it is hard to attribute the causality or what portion of the update weighting should be applied to each node since it grows exponentially. This pretraining idea helps against that.

brandonb · on Nov 14, 2013

That's what I thought too! But according to my friends on the Google Brain team, unsupervised pretraining is now thought to be an irrelevant detour.

In 2006, Hinton introduced greedy layer-wise pretraining, which was intended to solve the problem of backpropagation getting stuck in poor local optima. The theory was that you'd pretrain to find a good initial set of connection weights, then apply backprop to "fine-tune" discriminatively. And the theory seemed correct since the experimental results were good: http://www.cs.toronto.edu/~hinton/absps/fastnc.pdf http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS20...

Does pretraining truly help solve the problem of poor local optima? In 2010, some empirical studies suggested the answer was yes: http://machinelearning.wustl.edu/mlpapers/paper_files/AISTAT...

But that same year, a student in Geoff Hinton's lab discovered that if you added information about the 2nd-derivatives of the loss function to backpropagation ("Hessian-free optimization"), you could skip pretraining and get the same or better results: http://machinelearning.wustl.edu/mlpapers/paper_files/icml20...

And around ~2012, a bunch of researchers have reported you don't even need 2nd-derivative information. You just have to initialize the neural net properly. Apparently, all the most recent results in speech recognition just use standard backpropagation with no unsupervised pretraining. (Although people are still trying more complex variants of unsupervised pretraining algorithms, often involving multiple types of layers in the neural network.)

So now, after seven years of work, we're back where we started: the plain ol' backpropgation algorithm from 1974 worked all along.

This whole topic is really interesting to me from a history of science perspective. What other old, discarded ideas from the past might be ripe, now that we have millions of times more data and computation?

mbeissinger · on Nov 14, 2013

Yes this is really interesting. I haven't read those other papers yet (definitely plan on it now thanks for the links), but Bengio's latest paper on denoising autoencoders from earlier this year (http://arxiv.org/abs/1305.6663) still used the unsupervised pretraining. Also the Theano implementation that I run experiments with uses it as well (but that code could be a year or two old).

Definitely going to be researching this more throughout the year.

Dn_Ab · on Nov 14, 2013

Very interesting I was not aware unsupervised pretraining was a distant second to availability of data and Flops. So really, deep learning is essentially the same old MLP of recent peasant like status (90's). Stacks of backpropagating perceptrons with the ancient logistic regression on top - now with more stacking! This makes sense.

Machine learning is really just a form of non-human scripting. After all, every ML system running on a PC is either Turing equivalent or less. An analogy would be something that tries to generate the minimal set of regular expressions (that match non deterministically) which cover given examples. The advantage of an ML model vs a collection of regexes is many interesting problems are vulnerable to calculus (optimize) or counting (probability, integration etc.)

So like good notation, the stacking allows more complicated things to be said more compactly. But more complicated things need more explanation and more thinking to understand.

jules · on Nov 14, 2013

> And around ~2012, a bunch of researchers have reported you don't even need 2nd-derivative information. You just have to initialize the neural net properly.

This sounds very interesting. How do you property initialize the weights? Do you have a link to a paper about this?

kkjkok · on Nov 15, 2013

Check out this paper:

Practical recommendations for gradient-based training of deep architectures, Y. Bengio

http://arxiv.org/abs/1206.5533

There is a section on weight initialization on page 15. In general, this paper has a lot of good information in one place.

seiji · on Nov 14, 2013

RBMs and auto-encoders use backprop too. They just use it for "fine tuning" (due to running pre-training first) instead of propagating error derivatives from randomly initialized weights.

(Thus concludes the smartest thing I've said all day.)

shon · on Nov 14, 2013

Google, Twitter, Netflix, Yelp, Pandora and more are speaking on Deep Learning and RecSys this Friday at MLconf in San Francisco. We're trying to get a streaming solution going as well for those who can't make it. http://mlconf.com

DISCLAIMER: This is my event

mbeissinger · on Nov 14, 2013

I'll definitely check this out if you get a stream going.

shon · on Nov 14, 2013

Check mlconf.com on Friday. The default will be Ustream here: http://www.ustream.tv/search?q=mlconf

We'll also post that and any updates to the main site on Friday.

cocoflunchy · on Nov 14, 2013

OT but the text selection behavior on this page is fascinating! (Or horrific if you don't want to be nice). I've never seen anything like it.

https://www.dropbox.com/s/4k72g8b2tl3mgzt/Screenshot%202013-...

ingrownpsyche · on Nov 14, 2013

That's fine for me, but I'm getting ligatures for every st. I thought it was deliberate (and a little pretentious really) but you're not getting them so hurray webfont or something I suppose.

mbeissinger · on Nov 14, 2013

Ahh what are you viewing it on?

cocoflunchy · on Nov 14, 2013

Chrome 31.0.1650.48 on OSX

sudont · on Nov 14, 2013

31.0.1650.48 here as well.

It appears to be a bug with a combination of the ::selected pseudoelement in conjunction with the font Georgia. My guess is it's a Chrome-on-mac bug (Firefox is fine), not a site coding error.

Disabling either the font or the selection style fixes it. Most likely a text rendering issue. At work we've noticed Chrome getting buggier in relation to that, as well as retaining DOM node properties via redraws.

mbeissinger · on Nov 14, 2013

Odd. I can't replicate on any browser but I don't have OSX. Does it work fine on your other browsers?

Noxchi · on Nov 14, 2013

What does it take to be good at machine learning such as this? In terms of mathematics, and computer science knowledge?

I know how to code through self learning, and I've pretty much solely done web development. So I barely know much CS. Also not very good at math.

So what are the essential prerequisites you would say are necessary for doing neat, useful stuff with machine learning?

elq · on Nov 14, 2013

Math. math. math.

Linear algebra. Bayesian statistics. MUST know these inside out, upside down.

Vector calculus. Convex optimization.

A boatload of machine learning literature. The ideas coalescing into deep learning are based more than a decade of research.

If you know nothing about math... I can't imagine getting to the point of understanding deep learning (which is a fairly rapidly evolving area) without at least 2-3 years of very hard work.

This class is a reasonable attempt to give a quick intro to one major source for DBNs https://www.coursera.org/course/neuralnets understanding this course is a good benchmark.

mbeissinger · on Nov 14, 2013

Definitely a solid foundation in linear algebra and statistics (mostly Bayesian) are necessary for understanding how the algorithms work. Check out the wiki portals (http://en.wikipedia.org/wiki/Machine_learning) and (http://en.wikipedia.org/wiki/Artificial_intelligence) for overviews of the most common approaches.

Also, Andrew Ng's coursera course on machine learning is amazing (https://www.coursera.org/course/ml) as well as Norvig and Thrun's Udacity course on AI (https://www.udacity.com/course/cs271)

sown · on Nov 14, 2013

I ran through Udacity's CS373 course first (https://www.udacity.com/course/cs373). It was neat.

dave_sullivan · on Nov 14, 2013

This is a really good write up. For people looking for practical experience with these types of methods, I'd also recommend checking out theano and/or pylearn2 (which is built w/ theano).

theano: http://deeplearning.net/tutorial/

pylearn2: http://deeplearning.net/software/pylearn2/

NIPS, a big ML conference, is in December, so expect to see a large amount of new ideas and applications re: deep learning to come out of that.

mbeissinger · on Nov 14, 2013

Thanks for putting those links up - the Theano documentation has some great tutorials for how to code these in practice.

faxmulder · on Nov 14, 2013

Very interesting stuff written in a clear way. I'm actually finishing my master thesis on music genre recognition through machine learning, which is focused more on traditional ensemble learning, but I think that it would be nice to study deep learning in greater detail. Thanks!

mbeissinger · on Nov 14, 2013

Awesome, do you have any demos for the music genre recognition?

faxmulder · on Nov 14, 2013

not yet, I've still some work to do. One question: do you think that Optimum-Path Forests could be used also in the context of deep learning?

albertzeyer · on Nov 14, 2013

Can anyone recommend a good book on the topic? And maybe other recent neural network research topics; I'm esp. also interested in recurrent networks like LSTM.

arjunrajjain · on Nov 14, 2013

This is awesome!

mbeissinger · on Nov 14, 2013

Thanks!

sidcool · on Nov 14, 2013

An excellent 101 article.

mbeissinger · on Nov 14, 2013

Thanks!