But the point is to focus on the intrinsic dimension[1], not dimensions of the v...

slashdave · 2024-07-08T05:15:13 1720415713

Ah, I see. Well, the data has an intrinsic dimension of a specific size. You don't get to choose that. And, in any case, you want something quite a bit larger than the intrinsic dimension, because deep-learning needs redundancy in its weights in order to train correctly.

magicalhippo · 2024-07-08T11:29:36 1720438176

Right, but part of the argument in the paper, as I understand it, is that the self-attention layers can increase the intrinsic dimension of the input data if you feed it additional, relevant context.

I guess you could also use this result to find that a smaller network might be sufficient for your particular problem.

slashdave · 2024-07-08T17:16:57 1720459017

If you have additional context that is relevant, feed it to the network. Why wouldn't you? As to the size of the network, this is not a simple benefit, because you need to account for the trade off between model size and training efficiency.