Hacker News new | past | comments | ask | show | jobs | submit login
A stack of feed-forward layers does surprisingly well on ImageNet (arxiv.org)
97 points by fzliu on Jan 18, 2023 | hide | past | favorite | 25 comments



Author here. Was not expecting to see this pop up on Hacker News!

This was a short write-up of a set of experiments exploring the importance of attention in transformers. I'm glad to see that people are still reading it and hopefully finding it interesting.

Side note: the title should probably have (2021) in it, as this was posted to Arxiv in May 2021 (concurrently with a number of similar works such as MLP-Mixer).


An issue with using the same benchmark for everything is, you eventually start training on the test set by natural selection via people throwing out all their attempts that don't work.

I'm not sure how to prevent this, maybe you can preregister your programmers?


If you only care about identically distributed test data, test set overfitting doesn't happen that fast: if you evaluate M models on N test samples, the overfitting error is on the order of sqrt(log M / N). And even as this error becomes more noticeable, the relative ranks among the models are even more stable, as you can apply the small variance bounds. This is actually verified on models proposed for CIFAR-10.


That’s cool, is that from https://proceedings.mlr.press/v97/feldman19a.html ?

I hadn’t seen that result before, definitely interested in related work


No. I was referring to the "standard concentration bound" in that paper, which applies when you have separate validation and test sets. I think the argument can usually be improved by applying small-variance inequalities such as Bernstein's, to excess risk-like quantities such as l(f_hat(x), y) - l(f_ref(x), y), to show that accuracy difference / relative rank enjoys better guarantees. For ImageNet we can use the 01 loss and set f_{ref} to a SoTA classifier which, while having its loss bounded away from 0, is "mostly similar" to most f_hat's, and thus leads to a small excess risk.

The CIFAR experiments I mentioned were https://arxiv.org/pdf/1806.00451.pdf. It doesn't contain this argument (unfortunate wording) but appears to support it well.


Very insightful comment. A sort of publication/survivor bias for models that survived the benchmark.


A similar paper showed this for Language modeling and vision back in 2021: https://arxiv.org/abs/2105.08050


Looking forward to the (very productive) tinkering phase of neural networks transitioning to a synthesis period, where we understand better the essential ingredients and why they’re essential.


Great result :)

Looks like the attention layers aren't as important as everyone thought because using a similarly sized feed-forward layer works almost as well (74% vs 77% top1 accuracy)


Wouldn't attention networks use less computation for the same number of weights? Feed forward networks have higher connectivity no?

If a network is the same size, less computationally demanding, and gives you 3% improvement, it seems extremely worthwhile, especially given the (mostly) diminishing returns of just adding more weights/layers.


It is less general though, so this is interesting because it shows you can learn a lot without adding very much structure.


Going from 25% top-one error to 22% top-one error is a massive jump on ImageNet and very meaningful in a lot of applications. That said, there is no reason to believe attention-based models are the only way forward on image classification. The humble ResNet-50 can get near 22% top-1 error when trained properly.

Whether we need attention or not is a more interesting question on seq2seq models on text data.


What top-1 error do humans get on this problem?


Not an expert, but as I understand it, "attention" is just a learned feature weighting. More important features get higher weights. Less important features get lower weights. I can see why this would be useful as an explicit step in terms of computation efficiency, e.g. "calculate the most important features and operate on them" However, it seems like such information could easily be learned implicitly by other parts of the model without needing to model attention directly. Which is, I guess, the result in this paper.


A curated list of papers in the same vein:

https://github.com/fawazsammani/awesome-mlp-mixer

"MLP Mixer: An all-MLP Architecture for Vision" is #1 on the list. The paper we are commenting on is #2 on the list. They were uploaded to arxiv two days apart!


But maybe the replacement layer is doing attention?


This is somewhat related: how do we build datasets that are better discriminators of model architectures, and are also easier to experiment on?

https://greydanus.github.io/2020/12/01/scaling-down/


Any good resource out there explaining feed-forward networks and how they achieve learning?


(How) is Geoffrey Hintons recent result related?


Not at all, if you're thinking of the forward-forward stuff


This paper is a bad copy of MLP-Mixer.


Check the date of the papers before you lazily criticize.


This paper came out after. You read the date wrong. The weird formatting and very short size also point to this. This paper is a clear rip-off and an instance of academic fraud.


Residual neural nets are a stack of feed-forward layers, and are equivalent to how the brain actually processes images


Plenty of feedback within retina (e.g. horizontal cell modulation of photoreceptor activity) and from layer 6 of cortex to dorsal lateral geniculate to modulate transmission characteristics and suppress noise in LGN input to layer 4.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: