This part of his post where he explains vector embeddings of the input/output tokens just looks wrong to me:
>This vector seems to get taller every model year, for example the recent LLaMA 2 model from Meta uses an embedding vector of length 3,204, which works out to 6KB+ in half-precision floating-point, just to represent one word in the vocabulary, which typically contains 30,000 - 50,000 entries.
>Now if you’re a memory-miserly C programmer like me, you might wonder, why in the world are these AI goobers using 6KB to represent something that ought to take, like 2 bytes tops? If their vocabulary is less than 2^16=65,384, we only need 16 bits to represent an entry, yeah?
>Well, here is what the Transformer is actually doing: it transforms (eh?) that input vector to an output vector of the same size, and that final 6KB output vector needs to encode absolutely everything needed to predict the token after the current one. The job of each layer of the Transformer is quite literally adding information to the original, single-word vector. This is where the residual (née skip) connections come in: all of the attention machinery is just adding supplementary material to that original two bytes’ worth of information, analyzing the larger context to indicate, for instance, that the word pupil is referring to a student, and not to the hole in your eye.
Firstly, he is confusing representation with encoding--he's right that 2 bytes is enough to encode any token. That is in fact approximately how it's done: a code book is indexed into (with a longint in pytorch, at least last I worked with it ~6 months ago). The purpose of the embedding is to allow the model to learn a representation of the token, a la word2vec. (Though this representation is purely based on the characters comprising the token and does not distinguish between "student" and "eye" in the case of "pupil" as in his example.)
Secondly, his description of each layer's function as adding information to the original vector misses the mark IMO--it is more like the original input is convolved with the weights of the transformer into the output. I am probably missing the mark a bit here as well.
Lastly, his statement that the embedding vector of the final token output needs all the info for the next token is plainly incorrect. The final decoder layer, when predicting the next token, uses all the information from the previous layer's hidden layer, which is the size of the hidden units times the number of tokens so far.
> Secondly, his description of each layer's function as adding information to the original vector misses the mark IMO--it is more like the original input is convolved with the weights of the transformer into the output. I am probably missing the mark a bit here as well.
> Lastly, his statement that the embedding vector of the final token output needs all the info for the next token is plainly incorrect. The final decoder layer, when predicting the next token, uses all the information from the previous layer's hidden layer, which is the size of the hidden units times the number of tokens so far.
I think the author is correct. Information is only moved between tokens in the attention layers, not in the MLP layers or in the final linear layer before the softmax. You can see how it’s implemented in nanoGPT:
https://github.com/karpathy/nanoGPT/blob/f08abb45bd2285627d1...
At training time, probabilities for the next token are computed for each position, so if we feed in a sequence of n tokens, we basically get n training examples, one for each position, but at inference time, we only compute the next token since we’ve already output the preceding ones.
>This vector seems to get taller every model year, for example the recent LLaMA 2 model from Meta uses an embedding vector of length 3,204, which works out to 6KB+ in half-precision floating-point, just to represent one word in the vocabulary, which typically contains 30,000 - 50,000 entries.
>Now if you’re a memory-miserly C programmer like me, you might wonder, why in the world are these AI goobers using 6KB to represent something that ought to take, like 2 bytes tops? If their vocabulary is less than 2^16=65,384, we only need 16 bits to represent an entry, yeah?
The reason we have 3204 2B allocations is that each of the 2B contains info (latent space dimension). If you go to just the 2B representation, it is effectively one hot encoding which completely defeats the purpose of word embedding
> The reason we have 3204 2B allocations is that each of the 2B contains info (latent space dimension).
I think the author is more correct than you are. It is not necessarily the case that we need 3,204 dimensions to represent the information contained in the tokens; in fact, the token embeddings live in a low-dimensional subspace; see footnote 6 here:
> We performed PCA analysis of token embeddings and unembeddings. For models with large d_model, the spectrum quickly decayed, with the embeddings/unembeddings being concentrated in a relatively small fraction of the overall dimensions. To get a sense for whether they occupied the same or different subspaces, we concatenated the normalized embedding and unembedding matrices and applied PCA. This joint PCA process showed a combination of both "mixed" dimensions and dimensions used only by one; the existence of dimensions which are used by only one might be seen as a kind of upper bound on the extent to which they use the same subspace.
So some of the embedding dimensions are used to encode the input tokens and some are used to pick the output tokens (some are used for both), and everything else is only used in intermediate computations. This suggests that you might be able to improve on the standard transformer architecture by increasing (or increasing and then decreasing) the dimension, rather than using the same embedding dimensionality at each layer.
>This vector seems to get taller every model year, for example the recent LLaMA 2 model from Meta uses an embedding vector of length 3,204, which works out to 6KB+ in half-precision floating-point, just to represent one word in the vocabulary, which typically contains 30,000 - 50,000 entries.
>Now if you’re a memory-miserly C programmer like me, you might wonder, why in the world are these AI goobers using 6KB to represent something that ought to take, like 2 bytes tops? If their vocabulary is less than 2^16=65,384, we only need 16 bits to represent an entry, yeah?
>Well, here is what the Transformer is actually doing: it transforms (eh?) that input vector to an output vector of the same size, and that final 6KB output vector needs to encode absolutely everything needed to predict the token after the current one. The job of each layer of the Transformer is quite literally adding information to the original, single-word vector. This is where the residual (née skip) connections come in: all of the attention machinery is just adding supplementary material to that original two bytes’ worth of information, analyzing the larger context to indicate, for instance, that the word pupil is referring to a student, and not to the hole in your eye.
Firstly, he is confusing representation with encoding--he's right that 2 bytes is enough to encode any token. That is in fact approximately how it's done: a code book is indexed into (with a longint in pytorch, at least last I worked with it ~6 months ago). The purpose of the embedding is to allow the model to learn a representation of the token, a la word2vec. (Though this representation is purely based on the characters comprising the token and does not distinguish between "student" and "eye" in the case of "pupil" as in his example.)
Secondly, his description of each layer's function as adding information to the original vector misses the mark IMO--it is more like the original input is convolved with the weights of the transformer into the output. I am probably missing the mark a bit here as well.
Lastly, his statement that the embedding vector of the final token output needs all the info for the next token is plainly incorrect. The final decoder layer, when predicting the next token, uses all the information from the previous layer's hidden layer, which is the size of the hidden units times the number of tokens so far.