Going from 25% top-one error to 22% top-one error is a massive jump on ImageNet and very meaningful in a lot of applications. That said, there is no reason to believe attention-based models are the only way forward on image classification. The humble ResNet-50 can get near 22% top-1 error when trained properly.
Whether we need attention or not is a more interesting question on seq2seq models on text data.
Whether we need attention or not is a more interesting question on seq2seq models on text data.