Author here. Was not expecting to see this pop up on Hacker News!
This was a short write-up of a set of experiments exploring the importance of attention in transformers. I'm glad to see that people are still reading it and hopefully finding it interesting.
Side note: the title should probably have (2021) in it, as this was posted to Arxiv in May 2021 (concurrently with a number of similar works such as MLP-Mixer).
An issue with using the same benchmark for everything is, you eventually start training on the test set by natural selection via people throwing out all their attempts that don't work.
I'm not sure how to prevent this, maybe you can preregister your programmers?
If you only care about identically distributed test data, test set overfitting doesn't happen that fast: if you evaluate M models on N test samples, the overfitting error is on the order of sqrt(log M / N). And even as this error becomes more noticeable, the relative ranks among the models are even more stable, as you can apply the small variance bounds. This is actually verified on models proposed for CIFAR-10.
No. I was referring to the "standard concentration bound" in that paper, which applies when you have separate validation and test sets. I think the argument can usually be improved by applying small-variance inequalities such as Bernstein's, to excess risk-like quantities such as l(f_hat(x), y) - l(f_ref(x), y), to show that accuracy difference / relative rank enjoys better guarantees. For ImageNet we can use the 01 loss and set f_{ref} to a SoTA classifier which, while having its loss bounded away from 0, is "mostly similar" to most f_hat's, and thus leads to a small excess risk.
The CIFAR experiments I mentioned were https://arxiv.org/pdf/1806.00451.pdf. It doesn't contain this argument (unfortunate wording) but appears to support it well.
Looking forward to the (very productive) tinkering phase of neural networks transitioning to a synthesis period, where we understand better the essential ingredients and why they’re essential.
Looks like the attention layers aren't as important as everyone thought because using a similarly sized feed-forward layer works almost as well (74% vs 77% top1 accuracy)
Wouldn't attention networks use less computation for the same number of weights? Feed forward networks have higher connectivity no?
If a network is the same size, less computationally demanding, and gives you 3% improvement, it seems extremely worthwhile, especially given the (mostly) diminishing returns of just adding more weights/layers.
Going from 25% top-one error to 22% top-one error is a massive jump on ImageNet and very meaningful in a lot of applications. That said, there is no reason to believe attention-based models are the only way forward on image classification. The humble ResNet-50 can get near 22% top-1 error when trained properly.
Whether we need attention or not is a more interesting question on seq2seq models on text data.
Not an expert, but as I understand it, "attention" is just a learned feature weighting. More important features get higher weights. Less important features get lower weights. I can see why this would be useful as an explicit step in terms of computation efficiency, e.g. "calculate the most important features and operate on them" However, it seems like such information could easily be learned implicitly by other parts of the model without needing to model attention directly. Which is, I guess, the result in this paper.
"MLP Mixer: An all-MLP Architecture for Vision" is #1 on the list. The paper we are commenting on is #2 on the list. They were uploaded to arxiv two days apart!
This paper came out after. You read the date wrong. The weird formatting and very short size also point to this. This paper is a clear rip-off and an instance of academic fraud.
Plenty of feedback within retina (e.g. horizontal cell modulation of photoreceptor activity) and from layer 6 of cortex to dorsal lateral geniculate to modulate transmission characteristics and suppress noise in LGN input to layer 4.
This was a short write-up of a set of experiments exploring the importance of attention in transformers. I'm glad to see that people are still reading it and hopefully finding it interesting.
Side note: the title should probably have (2021) in it, as this was posted to Arxiv in May 2021 (concurrently with a number of similar works such as MLP-Mixer).