Failures of Deep Learning

Houshalter · on March 25, 2017

Just yesterday there was a big discussion on here about how academic papers needlessly complicate simple ideas. Mainly by replacing nice explanations with impenetrable math notation in an attempt to seem more formal. This paper is very guilty of this.

E.g. page 5. They attempt to explain a really simple idea, that they generated images of random lines at a random angle. Then labelled the lines positive or negative examples, based on whether the angle was greater than 90 degrees or not. Then they take sets of these examples. And label them based on whether they contain an even or odd number of positive examples.

They take several paragraphs over half a page to explain this. Filled with dense mathematical notation. If you don't know what symbols like U, :, ->, or ~ mean, you are screwed because that's not googleable. It takes way longer to parse than it should. Especially since I just wanted to quickly skim the ideas, not painfully reverse engineer them.

Hell, even the concept of even or odd, is pointlessly redefined and complicated as multiplying + or - 1's together. I was scratching my head for a few minutes just trying to figure out what the purpose of that was. It's like reading bad code without any comments. Even if you are very familiar with the language and know what the code does, it takes a lot of effort to figure out why it's that way. If it's not explained properly.

The worst part is, no one ever complains about this stuff because they are afraid of looking stupid. I sure fear that by posting this very comment. I actually am familiar with the notation used in this example. I still find it unnecessary and exhausting to decode.

FridgeSeal · on March 25, 2017

May I ask what your background is? I come from a maths + stats background and the "impenetrable math notation" is literally second nature to me now: I can test and understand it with far more speed and ease than I do any programming language.

Academic papers make the very reasonably assumption that you are familiar with basic mathematical notation, and probably, the basics of your field.

It's intended audience already knows these things, criticising these papers because essentially "I don't like the notation" is a pretty useless exercise.

Side note: maths papers used to be written with words instead of notation, but we dropped that, because it was inefficient and difficult to read and understand.

Houshalter · on March 25, 2017

Math notation isn't inherently bad. Even in this example, it could be justified. If it was accompanied by any sort of explanation of what it meant.

No matter how familiar with the notation, it's always going to be less clear and take time to unravel it. "OK, I see the author is multiplying a bunch of numbers together? Why are they doing this? [Reads over it 10 more times]... Oh... it's to tell if it's even or odd. Why on Earth couldn't they just say that.."

And literally every use of notation in this paper is like that. I was incredibly fortunate they decided to stop at some point, and just say they then draw a line with those parameters. They could have kept going and defined a procedure for drawing lines. I would not be remotely surprised.

I believe it's entirely about signalling. Math notation looks professional and academic. Just like describing yourself as "we". There's also something called the illusion of transparency. It's a bias where people believe they are much more understandable than they actually are. Like if you explain an idea to someone, you will expect them to have a much higher chance of understanding it then they actually do. I believe people that write papers are incredibly guilty of that.

And every freaking academic paper is like this. So many papers pointlessly give exact equations for a neural network. Instead of just saying "and we created a 3 layer neural net with a softmax output trained withstochastic gradient descent." But figuring out that's what the equations describe is going to take 15 minutes and lots of confusion.

Really, imagine a programmer that explained his ideas entirely in code without any comments. With one letter, undescriptive variable names. Who used as much premature optimization as possible to obscure it further. I doubt such a post would make it to the top of HN. But this crap regularly does.

yorwba · on March 25, 2017

> If it was accompanied by any sort of explanation of what it meant.

Most of the paper is just text, explaining their results in plain English, interspersed with technical vocabulary.

> "OK, I see the author is multiplying a bunch of numbers together? Why are they doing this? [Reads over it 10 more times]... Oh... it's to tell if it's even or odd. Why on Earth couldn't they just say that.."

Using powers of -1 to test for parity is a fairly standard trick, and they explain it right at the bottom of page 2. Did you overlook that, were you unfamiliar with this use, or is the explanation not clear enough? Only one of these possibilities can be blamed on the authors.

Or maybe you were talking about the formulas at the top of page 4, were they are multiplying two of those random parity functions together? That's not really to check whether they are even or odd, it's to prove their mutual orthogonality, which it says in the sentence above. In these transformations, the notation using powers of -1 is very convenient, since it allows to apply simple algebraic transformations. Those would have been very tedious had they used a boolean even/odd indicator.

> They could have kept going and defined a procedure for drawing lines.

They do this: Note that as the images space is discrete, we round the values corresponding to the points on the lines to the closest integer coordinate. They don't use notation for this, because it's not required for clarity.

Basically, I think the ideas in this paper are laid out pretty clearly for someone who is already familiar with the notation and conventions used. Research papers are primarily about communicating ideas to people working in the same field, who don't need to have the basics explained to them. That this makes the results hard to understand for a more general audience, even if they are experts in their field, is unfortunate, but basically unavoidable. Adding links to definitions and examples in introductory textbooks to every research paper would be pretty awesome, but it shouldn't be the author's burden.

reader5000 · on March 25, 2017

The first author on this paper does have a great intro book on precisely this area where the notation ramps up more slowly.

soVeryTired · on March 25, 2017

But he's not testing whether it's even or odd. The lines don't necessarily pass through the origin.

hueving · on March 25, 2017

Great, but if what GP says is true, the math notation in this case was so terrible that it took up more space than a simple English explanation and still took longer to parse for someone versed in the notation.

abecedarius · on March 25, 2017

I've complained about this sort of thing too (usually more privately). The upside is supposed to be precision and concision, but as you point out in this case they spent all their concision on buying precision. For this precision I think code should be considered more often instead of the usual informal math notation, which sometimes gets sloppy and harder to follow from outside the author's research community. (Edit: I'm not accusing that passage of sloppiness, but it's a problem that's frustrated me before with math in place of code.)

nojvek · on March 25, 2017

I will dance around the day someone figures our an AI to translate papers into readable code that compiles into a nice jupyter like notebook format.

Some papers are quite easy to comprehend and truly ground breaking e.g the CNN paper by Alex and Hinton.

Some like this are like "WTF are you even trying to say?"

reader5000 · on March 25, 2017

Code will be hard to follow outside of the particular language's community. There are efforts to standardize pseudo-code, but then you're basically back to math notation.

Tarq0n · on March 25, 2017

Code has the advantage that functions and operators are googleable, while mathematical notation is not. I think the search problem is a fairly serious issue for maths actually.

reader5000 · on March 25, 2017

True, but it's not maths' fault Google and OSs dont support copy/paste for notation. Also, if you have to google e.g. the expectation operator you probably shouldnt be reading research papers in theoretical machine learning.

abecedarius · on March 25, 2017

And the even greater plus of being runnable. If it's confusing/ambiguous to you, you're not stuck.

outlace · on March 25, 2017

I think mathematical formalism is necessary to communicate precisely what you're doing, however, I do wish academic papers supplemented the math with just some intuitive plain English describing what's going on.

Sometimes you just want to get the gist of what the authors are doing without having to solve math problems in your head. Other times you really are interested in actually re-producing what they did and so you really need the math/code.

gajjanag · on March 25, 2017

I do not agree with your point, for various reasons already elaborated upon in the comments section here.

Regardless of that, I find it very puzzling why you don't simply email the authors with your critique, instead of writing a pretty thorough comment on HN that they will likely not read? That would be far more constructive IMHO.

Keep in mind that this is a preprint. From my experiences in this field, preprints are usually very sloppy and are simply a means to lay "first claim" to a discovery. There are usually a ton of typos/other issues to be sorted out, or even worse, bugs in many key proofs - I am not saying this paper is guilty of this, just that expectations should be low regarding exposition quality in preprints.

chaoxu · on March 25, 2017

I do see this kind of behavior in engineering oriented fields. Maybe the authors try to be as formal as possible, but doesn't known when to stop? Or maybe this is how they naturally think? (I doubt this, but because this is so common, I wonder if this is actually the case...)

I had a similar problem in my earlier career, because I try to be as precise as possible. But for conciseness, I don't bother to explain anything. Later on I learned to say things in English, or give intuition, before writing out the formal definitions.

Fortunately, papers the reviewers cannot understand have a hard time getting into top conferences in my area (CS theory).

Houshalter · on March 25, 2017

>Or maybe this is how they naturally think?

I doubt this. If you watch a lecture by an academic, they will usually speak in understandable English. Even if the subject is highly technical. They will give motivation and explanation for why things are that way. It seems only in academic papers do they go to great lengths to remove those things.

bunderbunder · on March 25, 2017

Keep in mind that the theoretical work in machine learning tends to come from a rather different academic sphere from a lot of work in application. That means a different professional vernacular, different notational conventions, and different ways of expressing certain concepts.

I don't think it's quite reasonable to fault an academic for writing under the conventions of their field. That's not being needlessly obtuse, that's being as clear as possible for the intended audience. There's no call for sour grapes just because you happen to not be the intended audience of an academic paper in a field you don't actively participate in.

EpicEng · on March 25, 2017

It's not unnecessary, you're just not the target audience. I often implement algorithms documented only in academic papers. Those that lack the precision you're complaining about are often useless. I need the math.

soVeryTired · on March 25, 2017

I don't agree that they're redefining even vs. odd. It's more like positive sloping vs. negative sloping. The reason they've introduced the y variable instead of referring to the slope of those lines is that they're going to use the y value as a class label later.

I agree that the paper is written in dense technical language, but the point is that the authors are writing for other academics, not laypeople. They could go to some lengths to avoid using symbols like ~, but those symbols are completely standard in stats and ML. If you haven't seen them, there's a good chance you don't have the right background to follow their paper anyway.

All writing styles have their place. In textbooks and blog posts, it's completely sensible to have an index of notation, and to use natural language to express ideas. But when writing for an audience of trained scientists, it's a different matter.

gesman · on March 25, 2017

Otherwise it won't get pass through arxiv's nonsense moderation.

dmreedy · on March 24, 2017

I think some of the most exciting and interesting work comes out of proving, not just capabilities, but constraints for systems, be it Gödel, Shannon, Aaronson, or any of the others in the smaller-than-desirable tradition of those who say, "No". I think a better understanding what Deep Learning can't do (well) is fertile material for better understanding the kinds of problems it can do, and am very excited to see more work in this space, and movement towards an underlying structural theory.

xamuel · on March 24, 2017

The paradox of the heap[1] shows that vagueness is an inherent, inescapable part of reality.

Machine learning would provide a way around that paradox: a heap is what a neural network says is a heap!

If this worked, it would be too good to be true. Thereore, it cannot work.

[1] https://en.wikipedia.org/wiki/Sorites_paradox

westoncb · on March 24, 2017

> The paradox of the heap[1] shows that vagueness is an inherent, inescapable part of reality.

Or it's an inescapable part of how we interface with reality, i.e. it's an artifact of the structure of the human brain.

curuinor · on March 24, 2017

Bak of the Bak-Tang-Wiesenfeld model called his model the sandpile, basically explicitly to remind one of the Sorites paradox. Interesting thing to talk about. You can definitely take it as a sort of different descendent of the Ising model: it has had much less influence on the world compared to backpropagation neural net.

The dynamical nature of the definition stays, which has bedraggled Sorites people for millenia, but McClelland has been talking about the dynamical nature of representation for decades too. I think some of the anti-neural net cognitive science people (Pinker, Fodor maybe? I forget) brought up some Sorites arguments when fighting the 80's connectionists.

bitL · on March 24, 2017

I think the issue with Deep Learning is that it is like a hyperdimensional optimization heuristics sequence surpassing what human mind can comprehend and pushing limits of computing (depth of 100 layers max at the moment). Given how difficult are far more trivial optimization techniques and proving bounds for them, it seems the times where we could just define a new approach and prove some nice properties are behind us :-(

curuinor · on March 24, 2017

Why can't we pull out the statistical mechanics? You can't understand the 10^27 variables in a lump of coal, either, but you can still do thermodynamics. People already do it, Ganguli, some stuff from Bengio

bitL · on March 24, 2017

I wish I were so optimistic - if you use statistics, then you might remove most of interesting outliers and study just those more stable prevailing characteristics and not the problematic ones, like when using Hooke's law for designing buildings while ignoring its limited scope. And physics hits hard math limits while studying dynamic systems (i.e. fractals) anyway.

We can always use Ramsey Theory stating there is no chaos possible and you can always find some regularity in anything, yet if you need to scan 10^132 items to find the order it becomes quickly uninteresting.

curuinor · on March 24, 2017

RNN is natively and obviously dynamical but FFNN can also be seen as transient dynamical system.

Little Ramsey theory ever used in neural network land, but theory of dynamical systems used from quite a long time ago, from Werbos-BPTT which was explicitly envisioned as a dynamical algorithm to the description of the dynamical instability of gradients in RNN.

The statistics of statistical mechanics don't necessarily need to be CLTish sorts of things, you know?

bitL · on March 24, 2017

Possibly ;-) I'll study these things in detail soon (hopefully), so far just practical experience with all funny things from DL like self-driving cars, composing music with RNNs etc. but having extensive operations research exp in the past I tend to be careful.

tianlins · on March 24, 2017

The high dimensionality is merely for the convenience of optimization, so that we get smoother manifold on which a good approximation solution can be found. In that regard, it is interesting to think about of space of models that we humans can interpret, and the possibility of "distill" deep neural nets into models in that space.

boxcardavin · on March 24, 2017

I don't think the problem here is what the human mind can comprehend because identifying where these methods break down is actually pretty easy from a math perspective, and the breakdown doesn't change as you go up and down with the number of dimensions. What has always surprised me about ML and DL is how far we can stretch simple regression techniques and how useful it is on such a variety of problems.

I agree that it has turned into an NP problem in a lot of cases now but the concepts behind the problem of N-dimensional optimization are pretty well understood.

curuinor · on March 24, 2017

It's not like we don't know anything about NP complete problems, neither. Critical phase transition in transition of alpha on random kSAT and other stuff, the realization that this is neither necessary or sufficient for NP completeness just a really common complementary phenomenon, etc etc

bitL · on March 24, 2017

If you use 40M+ optimization variables and a really really bad optimization technique (gradients), which is however fast, as is the case in DL, the amount of nice practical things (e.g. limits) you might be able to say could be very low.

Yes, we all were stunned when somebody pulled some awesome trick and found a better upper/lower bound on something, yet sometimes it's better to be realistic - maybe once we transfer to quantum computers, we can test more limits practically when utilizing parallel power of QP.

rectangletangle · on March 24, 2017

A lot of useful analogies for N-dimensional space can be abstracted from the simple transition from 2 to 3 dimensions.

joeyo · on March 24, 2017

"To deal with a 14-dimensional space, visualize a 3-D space and say 'fourteen' to yourself very loudly." -- Geoff Hinton

bitL · on March 24, 2017

A math student and an engineering student went together to a presentation called "14-dimensional space topology". As the presentation progressed, engineering student became more and more frustrated as she could only grasp a few simple initial examples but the deeper the presentation went, the worse was the understanding. Yet the math student was absolutely enthusiastic, often asking the presenter various complicated questions and looking he enjoyed himself a lot. With a massive headache in the end, the engineering student turned to math student and asked: "How could you understand the presentation? I was able to understand 2D, 3D but once we increased number of dimensions I got lost and never understood anything about 14D!" The math student looked at her with a confident careless look and said: "It was simple. I imagined everything in N dimensions, and then just reduced it to 14".

wnoise · on March 25, 2017

"and then let N=14" is how I've heard it.

LeanderK · on March 24, 2017

I strongly dissagree. Broken down, Neural Networks are surprisingly simpel and elegant. The combination makes them so powerful. Ideally, an elegant theory would abstract over these to provide bounds for arbitrary combinations, or groups of combinations.

curuinor · on March 24, 2017

Well, they already had a big instance of that, in the Minsky Papert book "Perceptrons", talking about linear separability. Talking about backpropagation (as opposed to delta rule) in opposition to that is a sort of mushing of history but it's interesting to think about

csfoo · on March 25, 2017

The lead author will be giving a talk on this work next week (which will be live streamed and recorded) as part of a workshop on Representation Learning:

https://simons.berkeley.edu/talks/shai-shalev-shwartz-2017-3...

smdz · on March 24, 2017

Link to the PDF: https://arxiv.org/pdf/1703.07950.pdf

bra-ket · on March 24, 2017

the biggest failure of deep learning is the lack of common sense

visarga · on March 25, 2017

It is growing, gradually: ontologies, word embeddings, mechanical dynamics prediction, but it will take some time. I don't know either why there isn't more of a push to bring together all the common sense resources.