Hacker News new | past | comments | ask | show | jobs | submit login
Measuring the Tendency of CNNs to Learn Surface Statistical Regularities (arxiv.org)
66 points by tim_sw on Jan 17, 2018 | hide | past | favorite | 36 comments



Adversarial noise is becoming quite a crises in the deep learning field and causing lot of heated debates. There have been effort for defenses using distillation and denoised auto-encoders but only thing that has worked well is actually training on adversarial noise itself.

However the bigger insight is that deep networks aren't learning high level features. This paper has very strong evidence that deep networks are just more fancy statistical models as opposed to developing more abstract representations. When if you make network more deeper, you are just developing more higher capacity models as opposed to more higher level features. This is quite damning... If you were hoping current form of deep learning will open the door to AGI, your hopes should be shattered by now.

Some of the things to investigate are (1) can proper regularization help turn this around (for example, forcing gradients to be as small as possible) (2) are there any other architectures using gradient descent that might actually develop abstract features.

This is definitely crisis time in the field and therefore more interesting than ever :).


Adversarial noise is interesting, but I think it's more of a toy problem that will go away as we switch to more sophisticated models with life-long learning that are able to develop complicated, hierarchical/compositional classifications.

For instance, there was a talk at NIPS this year where the researcher developed virtual stickers that can be added to a photo and cause the image to be misclassified (eg you have a "banana" sticker that you can apply to any picture to make the classifier classify that picture as a banana). The underlying issue, IMO, is that the classifier has only a limited number of categories from which to choose, and can't develop new categories (such as "adversarially perturbed picture"). The modified image does look more like a banana than anything else, and as the classifier never trained on adversarial images as a separate category it doesn't really make sense to make any other classifications.


It's also interesting whether there are people with apperceptive visual agnosia who are susceptible to adversarial examples.


Agreed it is an interesting question. Humans are just far less likely to build a mono culture of vision capabilities. Such that the applicability of any adversarial examples is likely far more limited.


aren't optical illusions human adversarial examples?


It's probably fair to see them as such, but they operate in the semantic domain rather than the pixel domain.


Do they? Specifically the graffiti style illusions that appear 3d to people. Those seem to work by specifically lining up the "pixels" you see.


It's not a seemingly random (meaningless) perturbations that cause misclassification/misrecognition, but a very specific (meaningful) tuning of features that serve to trigger some misrecognition. The distinction here is whether the modified features themselves have semantic information (i.e. have a representation in the semantic domain with respect to the content of the image). In the case of graffiti illusions, it seems that it does: the misrecognition is due to a particular kind of coherence with the observers perspective and the relative positioning/alignment of the features in the image. Both perspective and relative alignment of features is meaningful.


I think you are somewhat shifting the goalposts, though. For adversarial images, you are changing what you know the system is looking at. In such a way that confuses it to "see" something else.

For humans, we just happen to somewhat understand the image that is being used against us. How is this different?


The difficulty here is operationalizing the terms involved to make the point of semantic domain vs pixel domain meaningful at all. But I think this point makes it a substantive distinction: have a representation in the semantic domain with respect to the content of the image.

There are various semantic features of an image and we presume that our visual system is extracting and operating on these semantic features. So it seems important to ask whether the features causing the misclassification can be represented in a semantic domain of the image. Shapes, (relative) sizes, gradients, high level patterns, etc are examples of semantic domains. There seems to be a distinction between the kinds of optical illusions humans are susceptible to and the kinds CNNs are. In the graffiti illusions, the placement of the image that causes our 3D object recognition system to kick in can be described through angles, lines, focal points, focal lengths, etc. These are all semantic features of the scene depicted in an image. Contrast this with CNN adversarial perturbations which have zero semantic features that we recognize. This seems like an important result. It means that our visual system is robust against certain classes of adversarial images, namely the small imperceptible deltas that trip up CNNs. To trip up our visual system requires certain combinations of semantic input which are harder to exploit (larger delta is harder to exploit).


Then I offer the ease with which we see faces where they aren't. The very constellations?

I get your point to an extent. However, I think it is presented stronger than it is. Specifically, the semantics of imagery are not much more than non frequency analysis of the pixel data your eye sees.


There's no semantic reason why we see this image as moving, but everyone does, even animals.

http://24.p3k.hu/app/uploads/2014/05/Rotsnake1.gif


>There's no semantic reason

It's hard to be certain because we don't know the details of our visual system.


more fancy statistical models

This would be true even if there were intermediate representations being learned.


Interesting paper that looks at the generalization abilities of deep CNN architectures. They begin by highlighting the generalization gap or the difference between the learning curves of the test and transit and how typically, in deep CNN architectures, the gap is relatively small. They then go on to hight how that this small generalisation gap is often attributed to the fact that that deep CNNs learn high-level semantic meaning. They then counter that common notion by highlighting the recent and popular research in Adversarial Examples, and note the high sensitivity to adverbial pertubtations. If a CNN is learning semantic meaning, then why does adding static to the image break it? Also, related is the recent research by Zhang, Chiyuan, et al. “Understanding deep learning requires rethinking generalization.” in which they show that CNN arch. can perfectly fit random labels which leads us to more down the path that generalization capabilities of CNNs are currently unknown to the community. They then go on to introduce their experiment to try and isolate what CNNs are doing.

http://www.eggie5.com/129-Paper-Review-Measuring-the-tendenc...


After stating their hypothesis the authors disclaim the issues CNNs with the following:

... we feel the need to stress that it is not fair to compare the generalization performance of CNN to a human being. In contrast to a CNN, a human being is exposed to an incredibly diverse range of lighting conditions, viewpoint variations, occlusions, among a myriad of other factors.

Why is it that evolutionary heritage is so rarely stated as an important factor in these comparisons (as far as I have seen)? I appreciate that evolution can be framed as another form of learning, but it is much more powerful than that employed by standard CNNs, in that it can change the network structure (and the I/O interface) as well as the network weights.


Humans do well with blurred images because our vision is blurry everywhere except our fovea and our current focal plane. So I'm a bit skeptical that this tells us something about the difference between cnns and humans. Especially since the models did seem to adapt well to the modified domain when trained on it.


>Humans do well with blurred images because our vision is blurry everywhere except our fovea and our current focal plane.

I don't see how this follows. What we're not focused on is blurred, yes, but what we're focused on is at least partially related to our ability to recognize patterns (e.g. reading from peripheral vision is very difficult).

The fact that CNNs are tricked by imperceptible perturbations while the semantic content is held constant is highly informative information, don't reject it for superficial reasons.

The standard fair of CNN+RELU+Residuals are very powerful modelling tools but that also means they're prone to model degenerate regularities if they exist. This paper shows that they do exist and that these models are picking up on them at least to some significant extent.


>I don't see how this follows. What we're not focused on is blurred, yes, but what we're focused on is at least partially related to our ability to recognize patterns (e.g. reading from peripheral vision is very difficult).

Object detection from peripheral vision is not difficult though. I think you're overestimating how much of our vision is actually clear.

>The fact that CNNs are tricked by imperceptible perturbations while the semantic content is held constant is highly informative information, don't reject it for superficial reasons.

Yes, but that is not this paper. These perturbations are not imperceptible, and unlike other adversarial examples the model adapted well when allowed to train on them.

Also, it looks like the model did reasonable well on the random filtered versions, only failing on the blurred versions. The random filtered images looked much more corrupted to me than the blurred ones, which is consistent with blurred images being part of the training for the human visual system, but not randomly filtered ones.


>Object detection from peripheral vision is not difficult though.

Because we understand high level semantic information, which is largely robust against blurring.

>The random filtered images looked much more corrupted to me than the blurred ones, which is consistent with blurred images being part of the training for the human visual system,

It's also consistent with the neural network training on surface level regularities, which a random filter would corrupt at a much lower rate than a blur filter would.


The radial masking in the Fourier domain is pretty much imperceptible for me, but it seems to cause the most problems for the CNNs.


I'm pretty sure that's because you are constantly trained on that type of image in your day to day life. "Radial masking in the Fourier domain" is pretty much identical to an image that it is out of focus or seen with a low resolution part of your eye. (Most of your vision.)


> Especially since the models did seem to adapt well to the modified domain when trained on it.

But only when trained on it. The models trained on only one modified domain didn't generalize well to the others. Although the model trained on all datasets together generalized quite well to the corresponding test sets, that doesn't mean that there isn't some other kind of modification that can fool it.

This is what the authors say:

Our last set of experiments involves training on the fully augmented training set, which now enjoys a variance of its Fourier image statistics. We note that this sort of data augmentation was able to close the generalization gap. However, we stress that it is doubtful that this sort of data augmentation scheme is sufficient to enable a machine learning model to truly learn the semantic concepts present in a dataset. Rather this sort of data augmentation scheme is analogous to adversarial training [40, 9]: there is a nontrivial regularization benefit, but it is not a solution to the underlying problem of not learning high level semantic concepts, nor do we aim to present it as such.


Is it me or the part of the article you cite basically says : CNN are nice statistical tools but are nowhere close to forming practical conceptual understanding of the data set they're trained on ?


The first part of this paragraph (up to 'however') seems to be saying that training with their modified images did improve the ability of the resulting networks to generalize, while the remainder of the paragraph seems to cautioning against assuming that this means the problem is solved.


While the two clauses in your first sentence both seem to state established facts, is there any evidence that the first is explained by the second?


If we couldn't deal with blurry images through some means, we couldn't recognize objects because our fovea is too small to contain them (a quarter held at arm's length.)


This is an unrealistic model of vision - we don't perceive the world as a tiny spot of clarity surrounded by a blur. We scan to compensate.

To say that we compensate for the blurring is quite different than saying we can do better (than CNNs?) with blurred images because most of the visual field is blurred. Did you mean to say that we are good at visual recognition despite most of the visual field being blurred? If so, I do not think it throws much light on what is happening in CNNs.


Compensation happens within our visual system as well, not just by saccading. As an experiment, with your eyes focused on your phone, identify the objects around you without moving your eyes. You can still do it very well, and likely need to do it all the time.


Yes, we are agreed that compensation happens, but how does that justify "Humans do well with blurred images because our vision is blurry everywhere except our fovea and our current focal plane"? Was that statement supposed to begin "we know that..."? - because, as it stands, the sentence states that our generally blurry visual field causally explains why we can identify things in blurred images, something that I still have seen no evidence for.


The assumption is that or brains are trained to deal with anything our eyes send them. The observation is that blurred images are a common thing our eyes send, while images with random frequencies deleted are not. This model would explain why the blurred images in the paper look relatively normal to us, while the images put through the random filter don't. (Note that this is the inverse of the CNNs response that was better able to classify the random filter images than the blurred ones.)


This seems to me to be as speculative as the sentence that you won't offer any justification for. I don't think your stare-at-the-phone experiment shows what you claim, firstly because, as I am not generally aware of my saccades, I am not sure that I am suppressing them, and secondly, I am generally aware of what is around me, so I can make an informed guess as to what is in my peripheral vision. A better experiment would be to present images that I cannot guess, and this can be done by picking a website that has changing advertisements in the margin. Generally, I cannot tell what they are, unless it is something simple or guessable, until I look at them. I think you are exaggerating our ability to identify things in our peripheral visual field.


You can try the phone experiment while loading this url on the screen behind it. Sit far enough away that the screen is out of focus when you are focusing on your phone. https://picsum.photos/1024/768/?random


The same as before: a much-reduced ability to recognize features. In one image, I mistook a panoramic vista for the surface of a pond just a few feet away - and its not so easy to avoid refocusing.


I think the real question that needs to be asked is.. does predicting things mean the same thing as learning? CNNs can predict fairly well but do they actually learn? This paper says no.


This paper says that CNNs learns from different features than we do and that can be filtered out in the frequency domain. That's still learning.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: