This part of his post where he explains vector embeddings of the input/output to...

rlanday · on July 25, 2023

> Secondly, his description of each layer's function as adding information to the original vector misses the mark IMO--it is more like the original input is convolved with the weights of the transformer into the output. I am probably missing the mark a bit here as well.

I think his description is basically correct given how the residual streams work. The output of each sublayer is basically added onto the input. See https://transformer-circuits.pub/2021/framework/index.html

> Lastly, his statement that the embedding vector of the final token output needs all the info for the next token is plainly incorrect. The final decoder layer, when predicting the next token, uses all the information from the previous layer's hidden layer, which is the size of the hidden units times the number of tokens so far.

I think the author is correct. Information is only moved between tokens in the attention layers, not in the MLP layers or in the final linear layer before the softmax. You can see how it’s implemented in nanoGPT: https://github.com/karpathy/nanoGPT/blob/f08abb45bd2285627d1...

At training time, probabilities for the next token are computed for each position, so if we feed in a sequence of n tokens, we basically get n training examples, one for each position, but at inference time, we only compute the next token since we’ve already output the preceding ones.

johntiger1 · on July 25, 2023

Yep the author is completely wrong on point one:

>This vector seems to get taller every model year, for example the recent LLaMA 2 model from Meta uses an embedding vector of length 3,204, which works out to 6KB+ in half-precision floating-point, just to represent one word in the vocabulary, which typically contains 30,000 - 50,000 entries.

>Now if you’re a memory-miserly C programmer like me, you might wonder, why in the world are these AI goobers using 6KB to represent something that ought to take, like 2 bytes tops? If their vocabulary is less than 2^16=65,384, we only need 16 bits to represent an entry, yeah?

The reason we have 3204 2B allocations is that each of the 2B contains info (latent space dimension). If you go to just the 2B representation, it is effectively one hot encoding which completely defeats the purpose of word embedding

rlanday · on July 25, 2023

> The reason we have 3204 2B allocations is that each of the 2B contains info (latent space dimension).

I think the author is more correct than you are. It is not necessarily the case that we need 3,204 dimensions to represent the information contained in the tokens; in fact, the token embeddings live in a low-dimensional subspace; see footnote 6 here:

https://transformer-circuits.pub/2021/framework/index.html

> We performed PCA analysis of token embeddings and unembeddings. For models with large d_model, the spectrum quickly decayed, with the embeddings/unembeddings being concentrated in a relatively small fraction of the overall dimensions. To get a sense for whether they occupied the same or different subspaces, we concatenated the normalized embedding and unembedding matrices and applied PCA. This joint PCA process showed a combination of both "mixed" dimensions and dimensions used only by one; the existence of dimensions which are used by only one might be seen as a kind of upper bound on the extent to which they use the same subspace.

So some of the embedding dimensions are used to encode the input tokens and some are used to pick the output tokens (some are used for both), and everything else is only used in intermediate computations. This suggests that you might be able to improve on the standard transformer architecture by increasing (or increasing and then decreasing) the dimension, rather than using the same embedding dimensionality at each layer.