> First, if an attention node has low confidence, it can already assign similar ... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

sweezyjeezy 11 months ago | parent | context | favorite | on: Attention Is Off By One

> First, if an attention node has low confidence, it can already assign similar scores pre-softmax. Then we get what looks like a uniform distribution as output.

Disagree here, I think neural nets are quite bad at implicitly learning low entropy transforms, similar to how they struggle to model the identity function, necessitating residual connections. In both cases the change doesn't increase expressivity, but it does bake these needle-in-a-haystack transformations into the model that may be hard to access with gradient descent.

Can't speak to how useful it is though.

uoaei 11 months ago [–]

Surely you mean high-entropy, ie, uniform? We are talking about extremely low-entropy predictions as being the problem here.

sweezyjeezy 11 months ago | [–]

yep - always get that the wrong way round haha

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact