Hacker News new | past | comments | ask | show | jobs | submit login
Training computer vision models on random noise instead of real images (unite.ai)
133 points by Hard_Space on Dec 10, 2021 | hide | past | favorite | 64 comments



Generate a scribble. For a positive example, take two random crops of the scribble. For a negative example, take crops of two different scribbles. Ask the network if they both came from the same scribble.

The meat of the paper is that the scribbles can be remarkably shitty and still get you decent pretraining.

Compared to alternatives:

imagenet: exhaustively annotated, disk-space-consuming, nonfreeforcommercialuse real world photos - huge and has licensing issues

rotnet: take random images, rotate them and ask the network whether they're right side up or not - easy, but still takes up disk space

3d renders: render realistic 3d scenes on the fly and use those as input - nvidia has enough of my money already, thx

fractaldb: generate fractal images and pretrain by predicting the fractal's parameters. - why bother with fractals if you don't need them


The title is click-bait then because it is not only unit/pixel-level noise, but also "noise" at the gestalt-level (e.g. large geometric shapes).

It is basically an experiment how the mid-to-low-level features in NNs generalize from random abstract generated images to real images.

This is not a very surprising result, because it is known NNs generalize fairly well and at these levels you mostly only have blobs and edges that do not look much different from artificial ones.


>> This is not a very surprising result, because it is known NNs generalize fairly well and at these levels you mostly only have blobs and edges that do not look much different from artificial ones.

ANNs generalise atrociously badly, hence the need to train with ever bigger data to cover every nook and cranny of the instance space of each class of interest.

Incidentally, when people say that ANNs "generalise" they mean many different things, for example that they "genealise on the test set" which is usually only observed when the tests set is known in advance (and so it has been used in tuning hyperparameters and the like) or even, incredibly that they "generalise on the training set" (i.e. the validation partitions in cross-validation). Conversely, there is a glut of novel terminology like "out-of-sample" or "out-of-distribution" to describe generalisation beyond the test set, but this kind of generalisation is typically held up as a weakness of ANN, because they're genreally really bad at it.

In any case, strong evidence of robust generalisation on out-of-sample data from few examples and with no or little pre-training, in ANNs, would be a surprise, indeed.


Exactly this.

The difference between interpolation and extrapolation is almost the most important concept in all of machine learning practice.

It's infinitesimally rare (from what I've seen so far) that a practical machine learning model can perform high quality extrapolation, for many different metrics of quality.

There's almost always far, far too many confounding variable.


Depends on how you define interpolation and extrapolation.

In high dimensional spaces basically everything is extrapolation including in pixel space and embedding space.

> The notion of interpolation and extrapolation is fundamental in various fields from deep learning to function approximation. Interpolation occurs for a sample x whenever this sample falls inside or on the boundary of the given dataset's convex hull. Extrapolation occurs when x falls outside of that convex hull. One fundamental (mis)conception is that state-of-the-art algorithms work so well because of their ability to correctly interpolate training data. A second (mis)conception is that interpolation happens throughout tasks and datasets, in fact, many intuitions and theories rely on that assumption. We empirically and theoretically argue against those two points and demonstrate that on any high-dimensional (>100) dataset, interpolation almost surely never happens. Those results challenge the validity of our current interpolation/extrapolation definition as an indicator of generalization performances.

https://arxiv.org/abs/2110.09485

Also:

> The location of decision boundaries inside the convex hull of training set can be investigated in relation to the training samples. However, our analysis shows that in standard image classification datasets, all testing images are considerably outside that convex hull, in the pixel space, in the wavelet space, and in the internal representations learned by deep networks. Therefore, the performance of a trained model partially depends on how its decision boundaries are extended outside the convex hull of its training data.

https://arxiv.org/abs/2101.09849


The problem with the second paper is that having a "test set" is meaningless when you can access that test set during training - "you" as in the researcher, who develops a system. This is especially so for machine vision datasets like the ones in the paper that have been "done to death". Basically anyone who has access to the test set of a popular benchmark and wants to get their paper published will do everything that can be done to ensure their system does well on the test set.

That is a big flaw in machine learning research in general, but that's for another conversation, I guess. My point above is that if neural nets could generalise well, they wouldn't need so much data. In a sense, even if trained neural net models can generalise to instances outside the dense region of instance space circumscribed by their training set that is not that important, if that region has to be gigantic for this generalisation to be possbile in the first place. For one thing, at that point it becomes difficult to separate what is "training" and what is "test", especially so when test sets are four times the size of training sets as in typical practice.


I think that the parent comment is referring to extrapolation in the semantic space (ex: use brown cats as training data and see if the ML algorithm can recognize albino cats).

Edit: or take photos of brown cats indoors from the front and see if the model recognizes albino cats from the side outside.


You still have to deal with the precise definition of interpolation vs. extrapolation.

And it does not matter what space you are using as long as you operate under the convex hull definition of interpolation vs. extrapolation you will need exponentially more samples as intrinsic dimensionality of the space increases.

This means that even under the manifold hypothesis, as long as intrinsic dimensionality is reasonably high i.e. in low hundreds, models will be doing extrapolation.


I think we've just jumped from my "stochastic software engineering tip of the day" to a postgraduate level examination of validation and test set decision boundaries vs dataset boundaries.

The thing is also, any individual hyper-dimensional case can be outside of the training set's convex hull itself and be correctly classified.

However, you would still have to quantify what the relationship of the dimensions with the highest feature importance were to said space. Which is why the second paper is so fascinating.

From the perspective of issues product/engineering teams face in the field, I'd definitely maintain that fire alarms should start sounding once you see any sort of extrapolation and you should dive deeper.

Unfortunately the maturity level of this space is still at the point where peer review of data-set transformations before deploying to production and committing Jupyter Notebooks to GitHub is a heated in-office discussion.

The majority of the commercial world is a long way from that kind of best practice.


> ANNs generalise atrociously badly

This statement is meaningless without controlling for model complexity and data type. For their simplicity, ANNs generalize well on a wide variety of data. GPT-3 yields almost human-level generalization ability for some tasks.

I also clarified that the generalization is probably not far. There is not much complex "realism" to be found in low-to-mid-level features; they're almost mathematical in their simplicity, similar to basis functions.


>> GPT-3 yields almost human-level generalization ability for some tasks.

That's an extravagant claim. There's no machine learning or other system, or algorithm, or technique, that can approach the ability of humans to "generalise" in any task, no matter how you want to define "generalisation". The models built by ANNs in particular are shallow and over-specialised and have none of the depth or complexity of whatever "models" of the world and the entities in the world that humans build in our heads.

Evaluations that show "superhuman" ability are poorly designed. Machine learning research is following benchmarks and metrics that mean nothing and show nothing, beyond the ability to beat said benchmarks with said metrics, which is then blithely taken to mean "progress" towards the approximation of human intelligence. This then leads to hyperbole like in your comment.


> no matter how you want to define "generalisation"

... and how you want to define "task". For some prompts/"tasks", GPT-3 does generate impressive (more than trivial) outputs that cannot be found on the internet and that are indistinguishable from what a human would respond, so it generalizes in that sense. Maybe human ingenuity and generalization is also just slightly perturbed interpolation? It is very difficult to produce something truly novel, so we are also rather tightly limited by prior experience. Who knows? Also, who cares if submarines swim? Anyhow, it seems 50% of the internet is bikeshedding about definitions.


Language generation is a very good example of a task that is very hard to evaluate with any degree of objectivity and for which there are no good metrics.

So, suppose you say that a particular bit of text generated by GPT-3 is "indistinguishable from what a human would respond". If I say it isn't, how can we decide who is right in a way that we can both agree on?

And that's all before we try to figure out "generalisation".


Is this not literally the (generalized version of the) Turing Test?

You simply hand the text to someone and ask to guess if it was produced by a human or not.

Or hand them two (or ten, or ten thousand) texts and ask them to label the human and AI texts without knowing the actual distribution.


In the Turing Test you have one human judge. I'm asking what happens when two humans disagree about the human-ness of some automatically generated text.


In complete agreement, not only the title but the entire article is cluttered with click-bait/hype wording. Simply refer to the paper instead [1], which is well written and should be accessible to even those with a modest understanding of the area if you stay away from the technical details. From my own scientific viewpoint, I think the finding is very neat and at least moderately surprising. So I did enjoy my quick read of the work and even wonder if the finding will be transferable to my own research area.

[1]: https://proceedings.neurips.cc/paper/2021/hash/14f2ebeab937c...

However, this is now the second time this week I have spotted shoddy writing coming from unite.ai – I am referring to the conspiracy-esque spin from [2] that we saw just a few days ago. Apparently they come from the same author even, that seems to put a great deal of emphasis on their articles reaching the Hacker News front page [3]. I am not sure how I feel about this, I would like to believe that “we” as a community are better than to fall for hype and click bait; I am also very uncomfortable with the idea of there being professional prestige in getting onto the front page.

[2]: https://www.unite.ai/a-cartel-of-influential-datasets-are-do...

[3]: https://martinanderson.ai


The second paragraph of your comment should have been at the top of the linked article. Thank you for the explanation!


Thanks, this is the perfect explanation that was missing from the original article.


> nvidia has enough of my money already, thx

Wednesday was RSU vesting day. Your purchase is appreciated. <3


I'm confused. Are they just saying if you throw noise-like images at your hot dog classier and label them 'not hot dog' it improves performance?


Hi, author here. To hopefully clarify, our work is in the context of representation learning, which is a bit different from a "standard" classification.

For example, to classify a hotdog it might be useful to first generate an intermediate representation of the image (think "cylindrical, brown, meaty thing"). Such a representation can then fairly easily be mapped to the concept "hot dog".

These representations can be learned from large image datasets alone (they do not require labels!). In our work we show that you don't even need real images, but that images that are generated from noise processes are enough to train such representations, and that these representations are surprisingly good for classification.

Hope this clarifies things a bit, and happy to answer any other questions!


Is this basically teaching the network how to do pattern processing, like line and corner detection, and then using that trained network as a starting point when training on real images?


The part you lost me on is noise processes - what goes in to the noise and how does it help if it is random?


One thing to note is that here noise != Gaussian iid noise, so these are not typical white noise images. I think we were not really clear on that part, but for us noise is basically a random process, which takes a seed as input (plus potentially some very low-level assumptions over image statistics, such as a 1/f spectrum) and produces a synthetic image.

It is then possible to generate arbitrary amounts of these images as samples from the stochastic process - these images exhibit certain image-like structures (such as oriented edges), but are as a whole still random and extremely varied, which is good and necessary for the representation learning.

In terms of helping, though, it is important to note that we do not achieve state-of-the-art performance yet, and when looking at absolute performance for a task like image classification, using real images is still better. That being said, something that is in the paper but generally seems to get lost is that our representations work very well when analyzing data that is very different from normal images, such as medical images or satellite images.


But then you aren't really throwing "random noise" at it are you? It's more like you are throwing generated data sets with abstract structures at it, and use the randomization part to ensure that it does not overfit on other accidental structures that might be in an individual image, because the randomization ensures that there are no other structures to speak of in the "average" (which does sound like a very sensible way to train a network on abstract structures). Or do I misunderstand the method here?


I'm pretty sure the title is clickbait, it's not uncorrelated random per-pixel noise as it gives the impression of.


Oh for sure, and I don't mean to accuse the authors; if pop-sci articles spread confusion about their work that's not their fault. I just want to clear things up for myself


It could be very interesting to try Gabor noise.


Why would that be interesting?


It's not random noise. Look at the images in the paper. Horizontal lines, vertical lines, snakeskin patterns, Minecraft textures. Examples of miscellaneous surface patterns, in other words.

Back before deep learning, people used to make recognizers for features like that as a lower level of feature recognition. Now it's expected that features will be derived automatically from real imagery. This is kind of a return to that level.

A useful training set might be a big texture library used for game development or animation. Those are easily available.


That would indeed be an interesting thing to try, use real data, but only in terms of textures - so effects like occlusions, perspective, etc. would not be present.

I would expect it to be somewhere in the ballpark of our StyleGAN images, which also look very "textural", but lack these effects that are an result of imaging the 3D world. Interestingly, modelling these effects without realistic textures seems to result in worse performance - this is for example the case for images taken from CLEVR or generated from Minecraft, and both perform worse than the StyleGAN images.


This is a fascinating paper. Does training on generated noisy images make the resultant classifier more resistant to standard adversarial examples?


Hi, out of curiosity, what is the expected performance of a random classifier on your test datasets?


Consider an analog. Say you wanted to design an image compression format in the spirit of jpeg, so you take a bunch of 8x8 blocks of image data and you use a principal component analysis to design an optimal orthogonal transformation to apply to 8x8 blocks to compact their energy into fewer coefficients so you can compress it.

But collecting lots of image data is hard so someone comes along and points out that the most important characteristic of your images is that they're appear drawn from an autoregressive process... That is that each pixel is (say) 95% correlated to the pixel 1 away, 95%^2 to the pixel two away, 95^3 three away and so on. So to compute the PCA you only need the covariance matrix anyways, so you just generate an AR0.95 covariance matrix and use the transform derived from that. And you find this works pretty well too.

This work is along those likes, they're generating random images with some simple natural-like statistical properties and training the first part of the classifier using them and getting useful results.

One interesting promising part is that this line of thinking may result in better network designs or insights that allow skipping training these initial layers: Going back to the block transform example, the next step would be to notice that the PCA of the AR0.95 matrix is the discrete cosign transformation, which is agreeable to an extremely efficient implementation.


If I'm reading it right, it seems to be saying that if you throw random noise at your image classifier, it's able to do produce an intermediate artifact that you can train more effectively when you throw hot dogs at it. It won't be quite as good as you get with the best models, but it's strikingly good for random noise. The paper puts it like so: "Although GANs need real images to learn the network parameters, we show in this paper that they introduce a structural prior useful to encode image properties without requiring any training."

https://openreview.net/pdf?id=RQUl8gZnN7O


I’m not an expert, but my light reading of the paper makes it seem like it isn’t “random noise” in the colloquial sense, but more like “there are properties to the randomness even though the images look like noise.”

The abstract seems to say that, but again, I could be misinterpreting: “Our findings show that it is important for the noise to capture certain structural properties of real data but that good performance can be achieved even with processes that are far from realistic.”


That is correct -- by "noise", we don't meet pixel noise, but instead a stochastic process (in fact, different processes different properties which we compare in the paper) from which we can sample large amounts of varied training images.


Except your not actually training on noise, your training on procedurally generated images which include random modifications. It’s just the summery is really bad at describing the process.


No. If I understand correctly, what they are doing is generating "noise" samples with statistics similar to natural images. Then they use unsupervised contrastive learning to create representations of these noise images. This network is then employed in some classification tasks and it does well under a specific training mode (linear classifier training in the final layer only). The details of the evaluation aside, what is being shown is that a network trained to generate representation based on these artificially generated images (noise) can encode a good prior for most vision tasks thereby potentially reducing the need for a very large number of real training images.

edited: Typo,clarity


I'm also confused. They seem to be missing a paragraph near the start of the article.


>and have found that instead of producing garbage, the method is surprisingly effective:

I'm not enjoying this article. Effective at what exactly?


I can’t tell if you want an actual answer or an ELI5.

Actual: various detection tasks, as compared to other image networks.

Simple version: Many nn tasks work by feeding a lot of data into a network, then “refining” it with a task you want it to actually do.

So let’s say I want to detect corgis.

I take a network that can detect all kinds of shit, and feed it a bunch of images of dogs and tell it, no, only these ones please.

Why?

…because you don’t have 10TB of dog images, you have 10TB of pictures of random crap, and 1GB of corgi images.

…and fair intuitively, this works. If something knows how to tell dogs from cats from cars, it’s not a big stretch to become more specific and only detect corgis.

Now, this paper is showing that instead of feeding the initial network labelled images of cats, dogs, etc … you just feed it with procedural noise, it still works.

That’s a) surprising (wtf, why does telling the difference between a squiggly and another squiggly help you tel corgis from cats?) …

And b) really important, because it means you don’t need to spend hundreds (thousands) of hours or $$$ collecting the initial datasets.

Practically, what does that mean?

Well, here’s some food for thought: Google / Amazon etc are considered to have a fairly defensible moat for their voice recognition tech, because they are the only ones with enough data to train good models.

…but that moat vanishes in a blinding flash of steam if you can get comparable results from just feeding enough generated noise into a network.

So this is pretty interesting stuff.


Very interesting, and aligns very well with my experience with synthetic data as well. (In that diversity trumps realism by a LARGE margin).


It is interesting. I wish you wrote the article


Interestingly this has some parallels to how the brain may get wired up. During development (I.e. before the eyes have opened), there are spontaneously-generated waves of activity in the retina that feed into the brain and may help structure the connections formed there.

https://en.m.wikipedia.org/wiki/Retinal_waves

[edit: I see the authors have already made this connection in the paper]


Using the term "procedural noise processes" instead of "random noise" in the title would have been less click-baity and closer to what the paper is about.


meh. any kind of 'random noise' is drawn from a distribution, one shouldn't assume 'random' just means uniformly random bits in some common pixel format.


No, but you'd assume that all the pixels in a "random noise" image are independent and identically distributed. Procedural noise generally has correlation between pixels


:) I had started to write in my cooment that really the only thing I'd find surprising is that it's not IID but these are random pictures not random pixels.

An image of IID pixels is a very unusual one, and I assume that picture to picture their process is IID.

Another way of looking at it, say you image an image with iid exponentially distributed pixels. The bits in the image file, however, would not be iid. So just because you can point to some part and say it's not iid doesn't make it wrong to call it random, it's just a question of what scale you're operating at.

Similarly, if you made an image that was 1/f instead of totally spectral flat, it wouldn't be IID (looking at the pixels alone, again)-- but I don't think anyone would fail to call a such an image "random noise".


That would sound more click-baity


The actual paper is https://openreview.net/pdf?id=RQUl8gZnN7O and the title is "Learning to see by looking at noise". I think a lot of the negative reactions here are from the rather ridiculous and inaccurate re-interpretation of the paper in the article.

The paper itself seems ok. It's certainly not ultra groundbreaking, but I think the research is useful and presents a pretraining step that could be used in many applications.


There are tons of hand-crafted features in the 'random' datasets. The utility of these same features (edges, corners, gradients, etc) where 'discovered' over time by previous generations of CV researchers. Saying "training neural nets on known useful training data leads to decent performance" is not interesting or surprising.


A conceptually related finding that computer vsion models can effectively classify patient race from significantly noised x-ray images

https://arxiv.org/abs/2107.10356

> However, our findings that AI can trivially predict self-reported race -- even from corrupted, cropped, and noised medical images -- in a setting where clinical experts cannot, creates an enormous risk for all model deployments in medical imaging: if an AI model secretly used its knowledge of self-reported race to misclassify all Black patients, radiologists would not be able to tell using the same data the model has access to.

Potentially the method in OP's paper could be used as an adversarial critic or similar counterbalancing effort to eliminate "secret knowledge" when it's not desirable.


A recent HN submission on physics talked about discovering hidden symmetries of gravity.

Somehow this seems a little bit similar. Neural networks have a great many parameters. If training on very abstract images gives good results, then it seems like it indicates that there are certain symmetries that makes the neural network better, irrespective of dataset or task.

If that hypothesis is true, then it may be possible to change the architecture of the network to directly provide those properties without any training at all.


Interesting approach, and (in hindsight) not surprising that it works.

Consider that dimension reduction with the Locality Sensitive Hashing algo, or something like this [], has proven utility. You could make some hand-wavey argument that the approach in the article is similar in extracting features from randomness.

[] https://scikit-learn.org/stable/modules/random_projection.ht...


TLDR: One can use fake noise-generated images to pre-train detection models. But "noise" here means something like the output of a GAN when triggered with a random input, not "TV has no connection noise".

I expect this result to be not that interesting for most AI programmers, because you'll use a pretrained ResNet preprocessing anyway. But it is a very elegant solution for the licensing issues that come with large unsupervised image collections like ImageNet.


This is along the larger theme of self supervised learning where you can programmatically generate labels without humans and train to generate decent weights for further tuning


Bad title. It's a grabby headline that doesn't turn out to be what it claims. Unless this definition of "noise" is some AI/math version that most people don't recognise, like the way they use "bias" to mean overfitting, but happily flip-flop to mean "racist" when people are looking for a dramatic soundbyte or headline. Should be changed IMO, it's not noise.


Bias is underfitting.


This caught my attention:

The researchers suggest that the current crop of machine learning architectures may be inferring something far more fundamental (or, at least, unexpected) from images than was previously thought...

Is it more plausible that this shows they are inferring something fundamental, rather than that they are differentiating images on the basis of some of their accidental (i.e. non-essential) features?


How many real images do you need to match this performance though? Say I take my camera and make a 10 minute video filming my apartment, my street and the park. That's a dataset with thousands of images with structure and decent variation, and very cheap.


To me it seems like feeding garbage to train the model would expose any sort of ‘bias’ the model has no?


I'm no ML expert, but my understanding is that the bias is usually in the training data, not the model. The idea here is that you can generate bias-free "garbage" and therefore reduce the bias of the trained model


Yeah, very interesting... Like you could do a preliminary training that uses really good quality labeled data. Then almost infinitely continue training on a negative only dataset of random noise, which would theoretically hammer out any false positives that can be generated by the model.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: