Coding Self-Attention, Multi-Head Attention, Cross-Attention, Causal-Attention

f38zf5vdt · 2024-01-14T16:41:31 1705250491

As mentioned, these are all toy implementations and you should not use them in production. If you want to the fast, easy, and extremely optimized way of doing things, use torch.nn.MultiheadAttention or torch.nn.functional.scaled_dot_product_attention so that you get the optimal implementations. You can use xformers scaled dot product attention if you want the bleeding edge of performance.

> (Note that the code presented in this article is intended for illustrative purposes. If you plan to implement self-attention for training LLMs, I recommend considering optimized implementations like Flash Attention, which reduce memory footprint and computational load.)

Flash attention is already part of torch's kernels as of torch 2, but the latest versions and optimizations land in xformers first.

radarsat1 · 2024-01-14T18:41:03 1705257663

It seems that there are some popular attention methods such as relative embeddings and rotary embeddings (rope embeddings?) that are not possible to implement using pytorch's implementation, if I understand correctly. Do these then require the "slow path" versions that can be more easily modified?

rasbt · 2024-01-14T21:10:18 1705266618

Yes, totally agree. These implementations are meant for educational purposes. You could in theory use them to train a model though (GPT-2 also had a from-scratch implementation if I recall correctly). In practice, you probably want to use FlashAttention though. You use it through `torch.nn.functional.scaled_dot_product_attention` etc.

nightski · 2024-01-17T17:58:16 1705514296

Do you typically take tutorial blog posts and put them in production?

f38zf5vdt · 2024-01-28T20:25:10 1706473510

More than I would like to admit

atticora · 2024-01-14T16:22:22 1705249342

  conscious, kŏn′shəs, adjective -- Characterized by or having an awareness of one's environment and one's own existence, sensations, and thoughts. synonym: aware.

Self-attention seems to be at least a proxy for "awareness of ... one's own existence." If that closed loop is the thing that converts sensibility into sentience, then maybe it's the source of LLM's leverage too. Is this language comprehension algorithm a sort of consciousness algorithm?

sk11001 · 2024-01-14T17:02:46 1705251766

ML attention is nothing like human attention. I think it’s madness to attempt to map concepts from one field we barely understand to another field we also barely understand just because they use overlapping language.

jampekka · 2024-01-14T18:07:56 1705255676

Having done some research into human attention, I have to agree with Hommel et al: No one knows what attention is [1].

In current ANNs "attention" is quite well defined: how to weigh some variables based on other variables. But anthropomorphizing such concepts indeed muddies things more than it clarifies. Including calling interconnected summation units with non-linear transformations "neural networks".

But such (wrong) intuition pumping terminology does attract, well, attention, so they get adopted.

[1] https://link.springer.com/article/10.3758/s13414-019-01846-w

kmeisthax · 2024-01-14T19:49:00 1705261740

No. Self-attention is more akin to kernel smoothing[0] on memorized training data that spits out a weighted probability graph. As for consciousness, LLMs are not particularly well aware of their own strengths and limitations, at least not unless you finetune them to know what they are and aren't good at. They also don't have sensors, so awareness of any environment is not possible.

If you trained a neural network with an attention mechanism using data obtained from, say, robotics sensors; then it might be able to at least have environmental awareness. The problem is that current LLM training approaches rely on large amounts of training data - easy to obtain for text, nonexistent for sensor input. I suspect awareness of one's own existence, sensations, and thoughts would additionally require some kind of continuous weight update[1], but I have no proof for that yet.

[0] https://en.wikipedia.org/wiki/Kernel_smoother

[1] Neural network weights are almost always trained in one big run, occasionally updated with fine-tuning, and almost never modified during usage of the model. All of ChatGPT's ability to learn from prior input comes from in-context learning which does not modify weights. This is also why it tends to forget during long conversations.

dlkf · 2024-01-14T16:34:20 1705250060

It’s debatable to what degree ”attention” in LLMs relates to ”attention” in psychology. See Cosma Shalizi’s note on this http://bactra.org/notebooks/nn-attention-and-transformers.ht...

ben_w · 2024-01-14T17:58:42 1705255122

Careful, depending on who you ask there's 40 different definitions of the term. Any given mind, natural or artificial, may well pass some of these without passing all of them.