Hacker News new | past | comments | ask | show | jobs | submit login

> First, if an attention node has low confidence, it can already assign similar scores pre-softmax. Then we get what looks like a uniform distribution as output.

Disagree here, I think neural nets are quite bad at implicitly learning low entropy transforms, similar to how they struggle to model the identity function, necessitating residual connections. In both cases the change doesn't increase expressivity, but it does bake these needle-in-a-haystack transformations into the model that may be hard to access with gradient descent.

Can't speak to how useful it is though.




Surely you mean high-entropy, ie, uniform? We are talking about extremely low-entropy predictions as being the problem here.


yep - always get that the wrong way round haha




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: