Attention is a convex linear combination of values (meaning the coefficients sum...

Attention is a convex linear combination of values (meaning the coefficients sum to 1) where the coefficients are generated at runtime.

The values are given from the Value kernel; which is often just a linear map.

The strength is given from the dot product of the query and the keys. The queries and keys are given from the Query and Key kernels. Both of which are usually linear maps.

For each token, you have a key value query triplet (D,D,D). You take the query (D,1) and the matrix (N x D) of keys of all tokens, and do a Matrix vector product to get the scaling. You apply softmax to make it convex; and now you have a vector (N, 1). You do another matrix vector product with the Values mapping (N, D), which gives you the value for the current token (1,D).

Repeat N times, and that’s it.

If you would like to change which tokens each token considers, eg a token knows only of preceding tokens, you can do element wise product of the scaling vector with a mask vector which describes what you consider for current token ; and divide by their inner product. This will renormalise the valid coefficients and keep the rest 0; thus they are ignored when you combine all values together.

If you would like to increase the number of attention heads; which can give you finer control of the scaling process, you can split the key query and value vectors into segments, one segment for each head. Eg first 2 indices go to head 1, indices 3,4 go to head 2, etc etc. then you use those to do attention, repeating the process above for each head; and then combine the results to a single vector.

Idk if I confused query with keys, but this is the gist of it.

In practice the operations are much more optimised than what I described.