Right, but part of the argument in the paper, as I understand it, is that the self-attention layers can increase the intrinsic dimension of the input data if you feed it additional, relevant context.
I guess you could also use this result to find that a smaller network might be sufficient for your particular problem.
If you have additional context that is relevant, feed it to the network. Why wouldn't you? As to the size of the network, this is not a simple benefit, because you need to account for the trade off between model size and training efficiency.
I guess you could also use this result to find that a smaller network might be sufficient for your particular problem.