Hacker News new | past | comments | ask | show | jobs | submit login

Okay. So more weights = more parameter space for expression.

And?




The way I understood it, it's more like the opposite. That is if you feed the non-linear layers "dense" data, ie with higher intrinsic dimension, they perform better. Thus, you could potentially get by using a smaller non-linear layers by "condensing" the input before passing it through the non-linear layers.


This doesn't make any sense. Higher dimension means less dense. Far less dense, actually.


But the point is to focus on the intrinsic dimension[1], not dimensions of the vector itself. I meant dense in the sense that the two are close, relative to another vector where they are not so close. Perhaps a poor choice of words on my part.

[1]: https://en.wikipedia.org/wiki/Intrinsic_dimension


Ah, I see. Well, the data has an intrinsic dimension of a specific size. You don't get to choose that. And, in any case, you want something quite a bit larger than the intrinsic dimension, because deep-learning needs redundancy in its weights in order to train correctly.


Right, but part of the argument in the paper, as I understand it, is that the self-attention layers can increase the intrinsic dimension of the input data if you feed it additional, relevant context.

I guess you could also use this result to find that a smaller network might be sufficient for your particular problem.


If you have additional context that is relevant, feed it to the network. Why wouldn't you? As to the size of the network, this is not a simple benefit, because you need to account for the trade off between model size and training efficiency.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: