> misses the query/key/value weight Did you click the right link? The words "que...

espadrine · 2023-10-05T12:57:53.000000Z

Ah, I didn’t notice the picture came from Jürgen Schmidhuber. I understand his arguments, and his accomplishments are significant, but his 90s designs were not transformers, and lacked substantial elements that make them so efficient to train. He does have a bit of a reputation claiming that many recent discoveries should be attributed to, or give credit to, his early designs, which, while not completely unfounded, is mostly stretching the truth. Schmidhuber’s 2021 paper is interesting, but describes a different design, which while interesting, is not how the GPT family (or Llama 2, etc.) were trained.

The transformer absolutely uses many things that have been initially suggested in many previous papers, but its specific implementation and combination is what makes it work well. Talking about the query/key/value system, if the fully-connected layer is supposed to be some combination of the key and value weight matrices, the dimensionality is off (the embedding typically has the same vector size as the value (well, the combined size of values for each attention head, but the image doesn’t have attention heads) so that each transformer block has the same input structure), the query weight matrix is missing, and while the dotted lines are not explained in the image, the way the weights are optimized doesn’t seem to match what is shown.