A stack of feed-forward layers does surprisingly well on ImageNet

lukemelas · on Jan 18, 2023

Author here. Was not expecting to see this pop up on Hacker News!

This was a short write-up of a set of experiments exploring the importance of attention in transformers. I'm glad to see that people are still reading it and hopefully finding it interesting.

Side note: the title should probably have (2021) in it, as this was posted to Arxiv in May 2021 (concurrently with a number of similar works such as MLP-Mixer).

astrange · on Jan 18, 2023

An issue with using the same benchmark for everything is, you eventually start training on the test set by natural selection via people throwing out all their attempts that don't work.

I'm not sure how to prevent this, maybe you can preregister your programmers?

iflp · on Jan 18, 2023

If you only care about identically distributed test data, test set overfitting doesn't happen that fast: if you evaluate M models on N test samples, the overfitting error is on the order of sqrt(log M / N). And even as this error becomes more noticeable, the relative ranks among the models are even more stable, as you can apply the small variance bounds. This is actually verified on models proposed for CIFAR-10.

rsfern · on Jan 18, 2023

That’s cool, is that from https://proceedings.mlr.press/v97/feldman19a.html ?

I hadn’t seen that result before, definitely interested in related work

iflp · on Jan 18, 2023

No. I was referring to the "standard concentration bound" in that paper, which applies when you have separate validation and test sets. I think the argument can usually be improved by applying small-variance inequalities such as Bernstein's, to excess risk-like quantities such as l(f_hat(x), y) - l(f_ref(x), y), to show that accuracy difference / relative rank enjoys better guarantees. For ImageNet we can use the 01 loss and set f_{ref} to a SoTA classifier which, while having its loss bounded away from 0, is "mostly similar" to most f_hat's, and thus leads to a small excess risk.

The CIFAR experiments I mentioned were https://arxiv.org/pdf/1806.00451.pdf. It doesn't contain this argument (unfortunate wording) but appears to support it well.

tpoacher · on Jan 18, 2023

Very insightful comment. A sort of publication/survivor bias for models that survived the benchmark.

f_devd · on Jan 18, 2023

A similar paper showed this for Language modeling and vision back in 2021: https://arxiv.org/abs/2105.08050

onos · on Jan 18, 2023

Looking forward to the (very productive) tinkering phase of neural networks transitioning to a synthesis period, where we understand better the essential ingredients and why they’re essential.

fxtentacle · on Jan 18, 2023

Great result :)

Looks like the attention layers aren't as important as everyone thought because using a similarly sized feed-forward layer works almost as well (74% vs 77% top1 accuracy)

BobbyJo · on Jan 18, 2023

Wouldn't attention networks use less computation for the same number of weights? Feed forward networks have higher connectivity no?

If a network is the same size, less computationally demanding, and gives you 3% improvement, it seems extremely worthwhile, especially given the (mostly) diminishing returns of just adding more weights/layers.

jeffreyrogers · on Jan 18, 2023

It is less general though, so this is interesting because it shows you can learn a lot without adding very much structure.

Teodolfo · on Jan 18, 2023

Going from 25% top-one error to 22% top-one error is a massive jump on ImageNet and very meaningful in a lot of applications. That said, there is no reason to believe attention-based models are the only way forward on image classification. The humble ResNet-50 can get near 22% top-1 error when trained properly.

Whether we need attention or not is a more interesting question on seq2seq models on text data.

amelius · on Jan 18, 2023

What top-1 error do humans get on this problem?

rcme · on Jan 18, 2023

Not an expert, but as I understand it, "attention" is just a learned feature weighting. More important features get higher weights. Less important features get lower weights. I can see why this would be useful as an explicit step in terms of computation efficiency, e.g. "calculate the most important features and operate on them" However, it seems like such information could easily be learned implicitly by other parts of the model without needing to model attention directly. Which is, I guess, the result in this paper.

peterfirefly · on Jan 19, 2023

A curated list of papers in the same vein:

https://github.com/fawazsammani/awesome-mlp-mixer

"MLP Mixer: An all-MLP Architecture for Vision" is #1 on the list. The paper we are commenting on is #2 on the list. They were uploaded to arxiv two days apart!

bilsbie · on Jan 18, 2023

But maybe the replacement layer is doing attention?

CGamesPlay · on Jan 18, 2023

This is somewhat related: how do we build datasets that are better discriminators of model architectures, and are also easier to experiment on?

https://greydanus.github.io/2020/12/01/scaling-down/

amelius · on Jan 18, 2023

Any good resource out there explaining feed-forward networks and how they achieve learning?

singularity2001 · on Jan 18, 2023

(How) is Geoffrey Hintons recent result related?

arc-in-space · on Jan 18, 2023

Not at all, if you're thinking of the forward-forward stuff

beecafe · on Jan 18, 2023

This paper is a bad copy of MLP-Mixer.

tbruckner · on Jan 18, 2023

Check the date of the papers before you lazily criticize.

beecafe · on Jan 18, 2023

This paper came out after. You read the date wrong. The weird formatting and very short size also point to this. This paper is a clear rip-off and an instance of academic fraud.

Khelavaster · on Jan 18, 2023

Residual neural nets are a stack of feed-forward layers, and are equivalent to how the brain actually processes images

robwwilliams · on Jan 18, 2023

Plenty of feedback within retina (e.g. horizontal cell modulation of photoreceptor activity) and from layer 6 of cortex to dorsal lateral geniculate to modulate transmission characteristics and suppress noise in LGN input to layer 4.