Stanford Stats 385: Theories of Deep Learning

inputcoffee · on Nov 7, 2017

I feel like asking: did they solve the problem?

Let me see if I can state the problem: Neural Networks are non-linear because of their activation functions. You need a differentiable function in order to take the derivative so you can back-prop the error, more or less.

The consequence of the non-linearity is that you can't do some kind of short hand calculation to figure out what the network will do. You have to crank through the network to see the result. There is no economy of thought. That is to say, there is no theory.

I am excited that they are working on it, but I would love to have a summary or overview of how they approach what I consider to be the basic problem.

mikebenfield · on Nov 7, 2017

> Neural Networks are non-linear because of their activation functions. You need a differentiable function in order to take the derivative so you can back-prop the error, more or less.

To clarify: linear functions would still be differentiable without an activation function. The problem is that the composition of linear functions is just another linear function, so you gain nothing by having multiple layers; you might as well just do some kind of linear regression or classification. Activation functions introduce nonlinearity so that deep learning methods can hopefully learn things linear methods can't.

candiodari · on Nov 7, 2017

> Activation functions introduce nonlinearity so that deep learning methods can hopefully learn things linear methods can't.

I would just like to clarify: deep learning methods definitely DO learn things that linear methods can't.

Using linear functions, no matter how many layers, essentially boils down to the "perceptron" architecture. You can Google it and they will be mostly talking about it's limitations (for instance, it's famously unable to learn XOR, Google "perceptron XOR", essentially the issue is that XOR is not linearly separable, so it can only be expressed as a nonlinear function).

You can give a simple proof for this. If you examine what a neural network layer (without activation function) does, in matrix terms. You take the input vector X (1xn), the layer weights W (nxm), and the output vector O (1xm).

Then computing the layer output is simply O = WX.

Now we can envision what happens with two layers. W (mxn) and V(mxn). The output of running two layers then becomes:

O = V(WX)

However, matrix multiplication is associative:

O = V(WX) = (VW)X, and VW is just a matrix

So for every 2 layer linear neural network there is a 1 layer neural network that gives the exact same result. So there's no reason to have 2 layers.

A famous result proving this is the "deep" networks (with activation funcions) are universal function approximators (explained here: [1]). Note that this "deep" should be understood in the 1995 meaning of deep neural networks, which is essentially "at least 2 layers", and usually "exactly 2 layers", not the 2010+ one where one means 10 to 200 layer deep networks.

[1] http://neuralnetworksanddeeplearning.com/chap4.html

inputcoffee · on Nov 8, 2017

Wait a second, a 2-layer perceptron can solve the XOR problem. The first layer maps it to a linearly separable problem which the second layer solves.

Furthermore, a non-linear 2-layer node can map any arbitrary function so the argument for more nodes is learnability.

https://en.wikipedia.org/wiki/Universal_approximation_theore...

candiodari · on Nov 8, 2017

Given that every perceptron layer does a linear transformation of the input, I have to disagree. There is no linear function that separates XOR, and that also means there is no sequence of linear functions that separates XOR.

That means there is no 1-layer perceptron that can learn XOR, and there is no multilayer perceptron that can learn XOR.

inputcoffee · on Nov 9, 2017

The XOR problem is famously solvable by adding a layer to a single layer perceptron assuming a unit step function. This is a very basic exercise taught in many intro courses.

I agree with every second sentence: Given that every perceptron layer does a linear transformation of the input.<- True There is no linear function that separates XOR, and that also means there is no sequence of linear functions that separates XOR. <- False That means there is no 1-layer perceptron that can learn XOR, <- True and there is no multilayer perceptron that can learn XOR.<-False

Please look it up. Here are a few links:

http://toritris.weebly.com/perceptron-5-xor-how--why-neurons...

The graph in slide 3 of this link helps explain it: http://www.di.unito.it/~cancelli/retineu11_12/FNN.pdf

http://www.mind.ilstu.edu/curriculum/artificial_neural_net/x...

candiodari · on Nov 15, 2017

Ah I see. My confusion comes from what is called multilayer perceptrons, which do have activation functions. Presumably that is done exactly because they don't make sense without adding those.

But that makes multilayer perceptrons different from ordinary perceptrons in more than just the multilayer part, which is very confusing.

inputcoffee · on Nov 7, 2017

When I wrote that, I was thinking of the unit step function, for which the derivative is not defined.

iaw · on Nov 7, 2017

There are two approaches to this type of problem, bottom-up and top-down:

Bottom-up is the preferred solution because you can derive everything from earlier principles, math is always bottoms up. Physics tries to be but occasionally fails.

Top-down is investigatory, you have something that exists and you want to understand it. Chemistry, biology, and genetics are excellent examples of systems that we can't derive from first principles that we've done well with using the top-down approach.

Charting the space of the meta-problem (this config --> this behavior) is a valuable first step in better understanding these structures even if we can never derive their behavior.

bitL · on Nov 7, 2017

Why is the lack of theory a problem? At some point we have to accept some problems are "out of our league" and use whatever is available even without fully understanding it. We can't even understand simple specializations in computer vision, yet we expect to understand a more general method? It's not like 15 years ago there weren't math theorems proven by enumerating them on a computer. I understand it breaks psyche and pride of some scientists, so what? Universe can't be expected to fit into humanity's collective brain.

mikebenfield · on Nov 7, 2017

Because it's much easier to work with and improve something if there's a theory behind it?

There's a spectrum between "just keep trying tons of crap and see what works" and "do this simple, well-understood calculation to see exactly what will work." Is it not obvious why it's nicer to be on the latter end of the spectrum than the former?

bitL · on Nov 7, 2017

Sure, but currently deep learning is more like experimental physics. You try stuff and see what works and empirically improve your understanding. Then you can generalize some heuristics from this and use that "recipe" in the future. You figured out ReLU suddenly made something work, yet Swish turned out better so you can forget about ReLU now. And as you can treat deep learning (in supervised mode) as non-linear optimization, I doubt we'll come up with a proper theory unless P=NP. We can't even understand far simpler non-linear optimization problems, not to mention ones which can be arbitrarily parametrized in 2nd order...

mturmon · on Nov 7, 2017

But to just restate the comment you're replying to, clearly theory can prune a lot of branches on the "iterative experiment-based design refinement" method you are proposing.

Also, I'll mention that you're being too pessimistic about what theory can accomplish. The learning problem is much more constrained than P = NP.

The use, for example, of a training set of N examples, drawn iid, and evaluation on samples drawn from the same distribution imposes a lot of structure. I don't know if you're familiar with VC theory, but it's an example of the kind of "surprising" guarantees that can be derived in this setting. Other general examples are weak learning, the bias/variance tradeoff, and (in SVMs) the notion of large margin classifiers.

An applicable "theory" of design is what separates engineering from just mucking around.

YeGoblynQueenne · on Nov 8, 2017

>> Sure, but currently deep learning is more like experimental physics. You try stuff and see what works and empirically improve your understanding.

The way I understand experimental physics, its purpose is either (a) to make observations on which new theories can be built or (b) to experimentally verify, or falsify, existing theories.

I don't think it's quite like throwing stuff at a wall and looking to see what sticks.

enord · on Nov 7, 2017

Inscrutable models? Check. "I know it when i see it" success criteria? Check. Unjustified optimism? Check.

Sounds more like alchemy than experimental physics to me.

YeGoblynQueenne · on Nov 8, 2017

>> It's not like 15 years ago there weren't math theorems proven by enumerating them on a computer.

Well, the reason why we have computers today in the first place is because Turing (and Church, btw) solved a long-standing problem in mathematical logic. That was a theoretical problem -Gödel's incompleteness theorem- and his solution was also purely theoretical: Turing machines (Church's solution was the lambda calculus; btw, both solutions basically led to a new problem, the halting problem. I digress).

Without these advances in theory, do you think we could have ever invented computers? If you do, you shouldn't. Instead, you should try to imagine trying to invent a computer by enumerating math theorems without a computer.

Today of course, we have computers. However, we also still have problems that cannot be solved by brute-force enumeration of solutions- just like pre-computer mathematicians had problems they couldn't solve just by enumerating solutions without a computer. For this type of problem, we have to use our brains and come up with theoretical solutions in order to make progress.

inputcoffee · on Nov 7, 2017

I wouldn't consider it "a problem," but a theory is surely better than the lack of one.

It may be out of our league, or we may just need a genius to help figure it out.

Calculus and Newtonian Mechanics and Relativity were probably out of our league before they weren't.

alexnewman · on Nov 7, 2017

Is that the only problem? Also why make a network of a certain length. Why not farther. Depth and it’s relation to the degree of the estimated polynomial is well undertood. What’s less understood is why not so deep

inputcoffee · on Nov 7, 2017

I didn't mean to say that it is the only problem. It is more like the "meta problem" as far as theory is concerned.

There are many parameters (layers, nodes, connectivity, drop off, number of trials etc). Right now they are guessed at by trial and error. The thing preventing you from saying, "Oh, just reduce the layers by 1, and add more nodes" is the non-linearity.

eduren · on Nov 7, 2017

Does anybody know if they plan on releasing the lecture videos? I couldn't find them on the site and this looks very interesting.

walrus · on Nov 8, 2017

They are being recorded and posted to YouTube as unlisted videos. You can find links to some of them on Twitter[1] and ResearchGate[2]. It looks like highlights of the lectures are being posted to Twitter and the full lectures are being posted to ResearchGate.

[1] https://twitter.com/stats385

[2] https://www.researchgate.net/project/Theories-of-Deep-Learni...

rayuela · on Nov 9, 2017

Oh those full lectures on researchgate are awsome. Thanks for sharing!

AlexCoventry · on Nov 8, 2017

I'm a bit surprised that Soatto's and Tishby's papers aren't on the reading list (https://stats385.github.io/readings). I think they have some of the most interesting theories, at the moment, about why Deep Learning works.

https://arxiv.org/abs/1706.01350

https://openreview.net/pdf?id=ry_WPG-A-

trashtoss · on Nov 8, 2017

Is anyone looking into/using algebraic topology and sheaves to analyze or interpret these deep networks?

kleiba · on Nov 7, 2017

What a privilege to study at that college.

kleiba · on Nov 7, 2017

From the downvotes, I suppose people thought I was being sarcastic. Quite the contrary, I was really in awe about what a cool lecture (series) this is! If you're into that kind of topic, of course, YMMV. But if you are, you will agree that this is an excellent class -- it underlines (not surprisingly) what an outstanding place to study Stanford is.

inputcoffee · on Nov 7, 2017

I didn't down-vote you but I suspect people reacted negatively because the implication is you have to stop learning after college. They may feel you can always learn it at any time. You can take deep learning and math courses online etc.

exelius · on Nov 7, 2017

Around here, he was likely downvoted by all the Stanford grads who thought he was being sarcastic. Or by the dropout hackers who make just as much as their peers without having gone to college. Who knows :)

cgmg · on Nov 8, 2017

> I didn't down-vote you but I suspect people reacted negatively because the implication is you have to stop learning after college.

How on earth does their comment imply that?

inputcoffee · on Nov 10, 2017

I misread it as "to study that at college" instead of "study at that college"!

I now have no idea why it was down-voted.

cerebrum · on Nov 8, 2017

Why? All the material is available online.

kleiba · on Nov 8, 2017

Do you not think there is a difference between having access to course material online, and being able to interact with the people who teach that stuff and do research in that area?

cerebrum · on Nov 8, 2017

Certainly there is. However, how much interaction do you expect to get if you are one of hundreds of students? Also there are probably other people who also teach/research this. It might be more relevant if you are doing a PHD.

tequila_shot · on Nov 7, 2017

Can general public attend these sessions?

jaytaylor · on Nov 7, 2017

Technically, no, because you have to be enrolled as a student (not just a student of life ;).

However, Quora seems to think you may be able to sneak in, at least sometimes:

https://www.quora.com/What-does-Stanford-do-about-non-Stanfo...

pwaivers · on Nov 7, 2017

The pictures on the first page look exactly like what I expect graduate students and professors of Stats to look like.