I don't know that I'm looking to refute you per se but... A better example than ...

I don't know that I'm looking to refute you per se but...

A better example than face masks maybe is the recent controversy over Twitter's AI and Obama images (https://www.theverge.com/2020/9/20/21447998/twitter-photo-pr...).

A lot was made of racial issues, which is fine, but the broader issue is why subtle changes in photos, like cropping, should confuse things so completely.

The target piece (the focus of this HN thread) sums itself up this way:

"There’s a good set of params somewhere nearby. When we start walking to it, we can’t ever get stuck along the way, because there are no local optima. Once we’ve stumbled upon a good set of parameters, we’ll know it and we can just stop."

I think there's some useful insights there, but this is in many ways the definition of local optima. What I might argue is that because there's so many locations in high-dimensional space that will satisfy some classification goal, it's "easy" to find one that works with regard to some population that defines the model development space (training + test). However, that model development space/population is itself implicitly defined by a certain set of constraints -- it's a subpopulation of some broader population. What you want to generalize to to define overfitting is broader. You can still not overfit to your model development space, but be overfitting with regard to some broader set of possible inputs.

Whether or not the constraints of the model development population/space are important and reasonable considerations -- e.g., in your argument, not having access to things like body language etc -- is maybe a little variable. In some cases the implicit defining characteristics of the model development population are meaningful, but in other cases they're hidden.

In Twitter's case, you end up finding out later that there's weird things that probably defined the space of their images that they didn't intend. It's only in the adversarial case that you learn about this.

In classical statistics, you talk about generalization and overfitting, but there's an implicit population you're sampling from that defines both of those things. That is, you have a training/fitting/initial sample, and you ask yourself how well your model would perform on a test/validation sample. But implicit in that is some assumption about what it means to be a random sample from the same population.

I think lots of times with DL, the cross-validation/test sample is also implicitly defined as coming from some population. But the population isn't some model, it's some source. Some image dataset, something like that. There will be things about that source that are "of interest", but other things that are idiosyncratic about it, and unrepresentative of the "real" population of interest. In this way, I'm not sure that held-out samples from some source are really the right way to think of generalization and overfitting -- I think the adversarial setting is the generalization setting.

https://www.sciencemag.org/news/2020/05/eye-catching-advance...

Along the way from the classical to the DL regime I think there have been some overlooked issues about what it means to generalize, what you're really sampling from, and what your "population" actually is. It parallels tensions about theory versus experimentation because having a population in the classical case that you're sampling from requires a certain data-generating theory, which is often lacking in DL. The closest thing in the classical regime to DL generalization theory is maybe a sort of ultra-high-dimensional bootstrapping with random effects: showing that your bootstrap samples are close to your observed sample isn't the same thing as showing they're close to the population, or to other samples drawn from that population, especially in the presence of random effects.