This is very informative and well written. Thanks OP. This stood out:
> Self-attention allows a model to “look back” or “look elsewhere” in the data to figure out what to do in its current location. For example, when translating French to Chinese, to choose the next Chinese word to output, it can look back at all the different input French words to help decide.
> Self-attention allows a model to “look back” or “look elsewhere” in the data to figure out what to do in its current location. For example, when translating French to Chinese, to choose the next Chinese word to output, it can look back at all the different input French words to help decide.