What's Wrong with Deep Learning?

discardorama · on June 14, 2015

If someone were to ask me what's wrong with DL (and not that anyone would, since I'm an unknown), I'd say the lack of theory. Most DL results look very hacky to me. Someone says Max Pooling works; someone again comes along and says it's not necessary. Someone says sigmoid or tanh are the best activation functions; someone else says ReLUs are better. And so on. Why? Why is one better than the other?

I'm no biologist, but I don't think our brains are going around trying to do a grid search for the best hyperparameters. Most DL results today are the result of throwing 1000s of Titans on the problem and then sitting back for a week for the beast to cough up a solution.

Tangential nitpick: one (very minor) nit I have with Prof LeCun's presentations is that I don't see him give more credit to Hinton and Schmidhuber. Hinton is mentioned a couple of times (3), but Schmidhuber is totally ignored; for example, when he mentions LSTM, it's cited as [Hochreiter 1997], even though it was a join publication with Schmidhuber. It should be cited as [Hochreiter et 1997], as he does in the very next line.

dave_sullivan · on June 14, 2015

I agree re: lack of theory. No easy answer on that one other than "keep looking and get more people to help". We are making major practical gains along the way (although many are quick to discount those--"that's it???"). It's science in practice, theory follows.

I disagree re: the 1000s of titans thing. Google, Baidu, etc are building large GPU clusters and have basically shown "similar resources = similar results", but everyone else is mostly using single machine--maybe multi-GPU--and doing fine. A single Titan X is a BEAST for deep learning--nobody is using 1000s and you only need 1 for great results on most datasets I've seen.

On the subject of Schmidhuber, I saw him speak once and he spent half the talk explaining how he invented everything he's talking about (EVERYTHING!) and the other half talking about how no one gives him credit. I'm half joking, but I think there's more to his story. Or it's a miscarriage of justice.

rasz_pl · on June 14, 2015

Funny what you say about Schmidhuber. I took Hinton's machine learning class, and he spend more than half of it teaching about things that didnt work.

versteegen · on June 15, 2015

OK, but what's the relevance to Schmidhuber?

Houshalter · on June 14, 2015

Max pooling tests if a feature occurs anywhere in a certain area, rather than being sensitive to the exact location.

ReLUs fit combinations of piece-wise linear functions. Whereas sigmoids are more nonlinear and can be harder to optimize. They were originally continuous approximations of binary threshold functions.

All these things can approximate each other. Neurons can approximate the max function, and ReLUs can approximate sigmoids. So there really isn't much to fret over.

It's like asking for a theory of which programming language is better. In practice they will have different advantages in different domains, but they are all Turing complete.

joe_the_user · on June 15, 2015

It's like asking for a theory of which programming language is better.

There's nothing wrong with just ignoring programming-language theory and just deciding on one, seat of the pants style. But this is because programming as it exists now is a static "art form" with only marginal progress expected.

However, assuming deep learning currently works unexplainably well and one aims to scientifically explain that good working, one would want an explanation which guides one's approach to extending the process.

I've done a bit of applied math, where knowing which kind of function to pull out of one's toolbox for which situation was the really-smart-people's purview, a fairly well guarded folk-knowledge, actually. I'm used to the "little bit of this, little bit of that" kind of explanation for which functions to use when and why. If one weighs them long enough, I assume one can intuitively figure out what to do.

But if we're aiming to advance fundamentally beyond the state-of-the-art, we would aim to quantify these advantages and disadvantages, to automate one more layer. So here we really should know and have a "real" theory here.

nightski · on June 15, 2015

Do you think no one is trying? Should researchers just ignore all results until the underlying theory is found? What if we don't find it for another 50 years? I find it incredibly hard to be critical in this situation.

discardorama · on June 15, 2015

> Max pooling tests if a feature occurs anywhere in a certain area, rather than being sensitive to the exact location.

From Geoff Hinton's AMA on Reddit: The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.

Houshalter · on June 15, 2015

Hinton doesn't like that pooling loses track of the exact locations where features are located, and just tests if a feature occurs in some area.

The basic effect of this is to decrease the resolution, so it's more tractable to operate on. Without pooling you are stuck with a huge resolution at each layer.

cec · on June 14, 2015

> I'm no biologist, but I don't think our brains are going around trying to do a grid search for the best hyperparameters.

Is that not essentially the process of evolution through natural selection? One giant optimisation problem?

discardorama · on June 14, 2015

But not in the same brain. Sure, evolution happened and picked the right parameters; but today's brain comes with the hyperparameters baked in (with some small amount of randomness). It is able to do all the things it can do without the luxury of parallel training and tuning.

For example: you don't need to show a baby a 1000 photos of mugs before it can tell what's a mug and what isn't. Just show it 1 example of a mug a couple of times, and from then on it's able to identify mugs and mug-like objects pretty reliably.

modeless · on June 14, 2015

> [the brain] is able to do all the things it can do without the luxury of parallel training and tuning.

I beg to differ. Newborn babies can hardly do anything. Their brains are undergoing "parallel training and tuning" 24/7 starting even before they are born. Babies train themselves on thousands of hours of visual stimuli to gain the 3D object recognition capabilities to reliably identify objects such as mugs.

discardorama · on June 15, 2015

Newborn babies can hardly do anything because their parameters have not been set; but their hyperparameters (things like, to use NN analogies: activation function, learning rate, etc.) are baked in.

modeless · on June 15, 2015

Right. Evolution picked the hyperparameters, and experiential learning picks the parameters. (To first order; certainly there are some instinctual behaviors programmed by evolution and I wouldn't be surprised if some hyperparameters are influenced by experience as well.)

rasz_pl · on June 14, 2015

One data point is perfectly fine for NN, it doesnt stop learning after first sample

https://www.youtube.com/watch?v=1GhNXHCQGsM

sgt101 · on June 14, 2015

No, I don't think so. I think that evolution is a complex system in the sense that it creates feedback loops altering the fitness landscapes it optimizes over. Organisms interact with each other through competition and predation and with their environment, for example by liberating oxygen from water. If evolution were about optimization then the current set of organisms would be "fitter" compared with dinosaurs (for example). I don't fancy my chances vs. a T.Rex, more to the point I don't fancy anything that is around at the moment's chances v. a T.Rex!

aswanson · on June 14, 2015

But the T.Rex was a local maxima. I don't fancy a T. Rexes chance against humankind with the technology, intelligence and group organization we have. The T.Rex was optimized for physical force, but that proved useless against asteroids. Humans would have a chance against that type of threat.

ma2rten · on June 14, 2015

Lack of theory is actually mentioned as one of the issues in the presentations.

I don't think your examples are good though, Max polling reduces noise. RuLU learn faster than Sigmoid or tanh.

_ibu9 · on June 14, 2015

> I don't think your examples are good though, Max polling reduces noise. RuLU learn faster than Sigmoid or tanh.

That's not theory, that's just observation of the results. Why should we expect it to work that way?

jokoon · on June 15, 2015

> I'd say the lack of theory.

Well that's what makes it research then. It's a new field, and obviously there needs to be more science and creative thinking involved into drawing theories about deep learning. The field seems already pretty hard for beginners, so of course there will be less scientists involved into making theories.

jhartmann · on June 14, 2015

It is always impressive to me how both Professor LeCun, Bengio & Hinton stuck to their guns and worked on these problems while others were not so interested. I'm very glad Hinton's group at DNN Research was able to blow away the competition in the Imagenet challenge. Now many more people are working on these ideas and some really amazing things have been accomplished in a very short time. I can't wait to see where we are in five years, and I love that LeCun's points out the areas where we should focus and what we are not good at yet.

fizixer · on June 14, 2015

Do you have any information about what were the hot topics just before this new resurgence of NNs (meaning around late 90's, and early 2000's).

I know symbolic AI was big in 60s, and 80s, but not sure about recent past.

oergiR · on June 14, 2015

Probabilistic models. Recent research often focuses on Bayesian models.

Probabilistic models have never really gone away. This presentation by LeCun actually suggests embedding neural networks inside of various types of probabilistic models: factor graphs and conditional random fields. This is, for example, how speech recognition works: the output of a neural network is fed into a probabilistic model (a hidden Markov model).

jhartmann · on June 14, 2015

Actually, state of the art speech recognition has switched over to having a Recursive Neural Network directly run over the audio input. Take a look at the paper at http://arxiv.org/abs/1412.5567 and http://usa.baidu.com/deep-speech-lessons-from-deep-learning/

However combining learning features with other systems is a very powerful approach and combining SVM's on top of the learned features of a Neural Network I would say is common. I personally am more interested in approaches like Deep Fried Convnets (http://arxiv.org/abs/1412.7149) that combine kernel methods as part of the Neural Networks themselves.

agibsonccc · on June 14, 2015

Not to nitpick. I just want people to realize there are actually recursive nets that rely on a parser to be built (this is the recursive net that relies on backpropagation through structure). Then there is the recurrent net (LSTMs,multimodal) that rely on backpropagation through time.

Talking to some of the users of Recursive nets, they will be renaming them to tree rnns which should help clear up confusion a bit.

oergiR · on June 14, 2015

I know that Andrew Ng and colleagues say that they don't use HMMs. I haven't spoken with them (I haven't seen them at speech conferences) so I do not know whether they actually believe this themselves.

I believe the best comparison between "CTC" (which is billed as recurrent neural networks without the HMMs) and the traditional approach is by people at Google, Sak et al, "Learning Acoustic frame labeling for speech recognition with recurrent neural networks", ICASSP 2015. (I can't find a PDF online.)

speechduh · on June 14, 2015

state of the art is still very much using WFSTs and DNNHMMs. IBM and Google are still beating baidu.

pigscantfly · on June 14, 2015

In the case of vision, there were a lot of things going on, but support vector machines, sliding window search, descriptors with invariances to different transforms like SIFT and HOG, spatial pyramids, deformable parts models, and mixtures of gaussians are all hot topics that you'll regularly see in papers from the early 2000's. There was a lot of work on improving runtime for these techniques going on, since training SVM experts and evaluating anything over a sliding window search space is very expensive. You'll also see a lot of work on approaches rooted in graph theory like conditional random fields, different flavors of Markov models, min-cut/max-flow based algorithms, and other types of probabilistic graphical models. Most of this stuff is still in widespread use; AI is a big field.

albertzeyer · on June 14, 2015

Support Vector Machines was one of the hot topics.

discardorama · on June 14, 2015

As far as I remember, everyone was jumping on top of SVMs and Markov Models.

spin · on June 15, 2015

I really like the book "Pattern Recognition and Machine Learning" by Christopher Bishop. It's packed full of all the latest-and-greatest algorithms.

WRT your question, an interesting "feature" of that book is that it was published just before deep neural networks started taking off, so there's no mention of DNNs in the book. You can see what the world was like right when they started taking off.

bra-ket · on June 14, 2015

kernels

Animats · on June 14, 2015

Is that document available in some standard format? The player that's playing it from Google Docs is buggy, and about 20% of the slides display an error message.

"10:01:47.662 Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at https://drive.google.com/viewerng/img?id=ACFrOgBySwSrGvI-XLL.... (Reason: CORS header 'Access-Control-Allow-Origin' missing).1 <unknown> "

ipsin · on June 14, 2015

Look to the nav bar at the top of the page. The source file is (purportedly) a PDF and there's a download button.

sgt101 · on June 14, 2015

The final 20 or so slides about building general AI using deep learning strike me as really interesting. Seymour Papet said that if you can fit concepts into your cognitive architecture then they are learned. I think that this part of the presentation speaks to a need to demonstrate this as "learning" proper. It's strange, because I believe that this needs to happen, but the idea that you would do it with an "all nn all the way down" architecture, rather than breaking out into a symbolic layer a-la SOAR just seems odd.

codewithcheese · on June 14, 2015

Wow what in incredible amount of knowledge in those slides. Is there a video of the keynote?

sjtrny · on June 14, 2015

CVPR usually makes all talks available when the conference has concluded. Watch this page http://www.pamitc.org/cvpr15/.

chestervonwinch · on June 14, 2015

I would like to see or hear more regarding the theory slides - in particular on the objective being a piecewise polynomial, and the distribution of weights using random matrix theory. Anyone know where I could find more?

selimthegrim · on June 14, 2015

http://arxiv.org/abs/1412.0233 (and the references within he cites by Gerard Ben Arous)

kragen · on June 15, 2015

For some reason, most of these slides say, "¡Vaya! Hubo un problema para cargar la página." Is there a better URL, maybe with the PDF itself? (https://doc-04-4c-docs.googleusercontent.com/docs/securesc/h... doesn't look like it's going to work reliably for other people.)

abecedarius · on June 16, 2015

I downloaded the pdf; will mail you.

Animats · on June 14, 2015

See slides 134-135. It's amazing that works. They get induction without any understanding at all. It still gets the right answer.

Intelligence may be dumber than we thought it was.

rlucente · on June 15, 2015

I have attempted to put together the math stack for deep learning at

http://rlucente.blogspot.com/2014/08/deep-learning-mathemati...

aswanson · on June 14, 2015

Is there a volume summary of research papers or recommended book(s) for DNN?