Ah, I see. Well, the data has an intrinsic dimension of a specific size. You don't get to choose that. And, in any case, you want something quite a bit larger than the intrinsic dimension, because deep-learning needs redundancy in its weights in order to train correctly.
Right, but part of the argument in the paper, as I understand it, is that the self-attention layers can increase the intrinsic dimension of the input data if you feed it additional, relevant context.
I guess you could also use this result to find that a smaller network might be sufficient for your particular problem.
If you have additional context that is relevant, feed it to the network. Why wouldn't you? As to the size of the network, this is not a simple benefit, because you need to account for the trade off between model size and training efficiency.