Do you have insight into the choice of the term attention, which, according to t...

svcrunch · 2023-12-25T00:54:32.000000Z

No.

But to your point, note that in 2020 neuroscientists introduced the Tolman-Eichenbaum Machine (TEM) [1], a mathematical model of the hippocampus that bears a striking resemblance to transformer architecture.

Artem Kirsanov has a very nice piece on TEM, "Can we Build an Artificial Hippocampus?" [2] The link is directly to the spot where he makes the connection to transformers, although you should watch the whole video for context.

Because I wasn't clear on the chronology, I went back and asked one of the "Attention" authors whether mathematical models of the hippocampus inspired their paper? His answer was "no". If TEM was developed without pre-knowledge of transformers, then it's a very deep result IMHO.

[1] https://www.sciencedirect.com/science/article/pii/S009286742...

[2] https://www.youtube.com/watch?v=cufOEzoVMVA&t=1254s

x1000 · 2023-12-25T01:17:55.000000Z

There’s a video[1] of Karpathy recounting an email correspondence he had with with Bahdanau. The email explains that the word “Attention” comes from Bengio who, in one of his final reviews of the paper, determined it to be preferable to Bahdanau’s original idea of calling it “RNNSearch”.

[1] https://youtu.be/XfpMkf4rD6E?t=18m23s

behnamoh · 2023-12-25T01:49:44.000000Z

"RNNSearch is all you need" probably wouldn't catch on and we'd still be ChatGPT-less.

TeMPOraL · 2023-12-25T07:56:49.000000Z

Worked with PageRank and "map reduce", tho.

rounakdatta · 2023-12-25T05:38:24.000000Z

Nerds pay attention nevertheless.

Me1000 · 2023-12-25T00:45:59.000000Z

Not OP and have no insight, but the thing that caused it to click for me was when I heard “this token attends to that token”. Basically, there’s a new value created that represents how much one thing (in an LLM its tokens) cares about another thing.

Saying “attends to” vs “attention” helped clarify (for me) the mechanics of what’s going on.

casualscience · 2023-12-25T07:52:27.000000Z

An attention layer transforms word vectors by adding information from the other words in the sequence. The amount of information added from each neighboring word is regulated by a weight called the "attention weight". If the attention weight for one of the neighbors is enormously large, then all the information added will be from that word, in contrast, if the attention weight for a neighbor is zero, it will add no information to the word. This is called an 'attention mechanism' since it literally decides which information to pass through the network, i.e. which other words should the model 'pay attention to' when it is considering a particular word.

davidguetta · 2023-12-26T12:58:39.000000Z

Mm attention as used in earlier papers makes a lot of more sense with respect to the term... there was several where it was literally used to focus on some part of an image at a higher resolution for example.