Attention is Turing complete (2021) [pdf]

triska · on June 14, 2023

Quoting from the article:

"To prove this we assume that internal activations are represented as rational numbers with arbitrary precision."

And:

"Transformers with fixed precision are not Turing complete."

It will be interesting to see if this leads to better support for arbitrary precision arithmetic in processor architectures.

godelski · on June 14, 2023

This is a rather mundane and unsurprising line tbh. Finite automata are not Turing complete. Computers are not in reality since they do not have infinite memory and do not have infinite precision. I'm not sure if it'll lead to arbitrary precision machines since the lack of real world Turing completeness hasn't stopped us so far and likely won't.

TeMPOraL · on June 14, 2023

Is there a formal, mathematical way of saying "real-world computers may not have infinite memory, but have more than enough, so they can be treated as Turing-complete for a subset of programs that are well-behaving - i.e. don't end up hitting the memory limit"?

And in general, there surely is a way of formally saying "this is theoretically X, but effectively Y, for the [hand-waves] kind of inputs"?

helen___keller · on June 15, 2023

Not really, formal mathematics in complexity theory is involved when talking about asymptotics, constants like “for N < 10 trillion, a desktop computer is good enough” isn’t very interesting from a mathematical perspective.

That said, some simple intuition is the following: PSPACE is a subset of EXP is a subset of EXPSPACE

(We think these are all strict separations but technically that’s not fully proven)

If you use the shorthand intuition that we can handle polynomial scaling but can’t handle exponential scaling, this means that we hit a time barrier (EXP-complete problems) before we hit a space barrier (EXPSPACE-complete problems)

Another bit of intuition: you can store a very big number in very few bits in binary because binary holds an exponentially large number in linear bits. But you can’t loop over that number in a humans’ lifespan.

Edit:

> they can be treated as Turing-complete for a subset of programs that are well-behaving - i.e. don't end up hitting the memory limit

Just to be clear, it’s a matter of input size and not the programs themselves. Technically you could say we haven’t “solved” sorting on real hardware because nobody can sort the first 2^1000 digits of pi. But realistically we don’t care to do so.

maemre · on June 14, 2023

Space complexity helps characterize this: real-world computers can (arguably) emulate a linear-bounded automaton, so anything in DSPACE(O(n)) is fair game if you can wait long enough.

For the arguably part: I am assuming that the machine can access all of the input at once, so it is reasonable to expect available memory to be a multiple of the input, so you get O(n) memory.

kaba0 · on June 15, 2023

Computers can do side effects so their state is effectively the universe. Which is still not infinite, but for all practical purposes the distinction doesn’t matter.

godelski · on June 14, 2023

I'm not a PL person or someone that focuses on computability, but I think you'd refer to it as "bounded," "pseudo," or "quasi." To give an example, you might call a transform on an image quasi-equivariant if it introduces some aliasing (obviously not destructing the image). See quasimorphism[0] as an example.

Usually people just let it go unless someone brings up specifically something that implies that there is, or might, not a shared mutual understanding (as done here). Maybe it is shared, maybe not, maybe someone reading it doesn't understand the difference. I mean we're humans. We compress a lot of information into language that is not directly expressed in our words (this is also why it is often hard to talk to people on the internet since there's a wide audience with vastly different priors. Relevant XKCD[1]).

[0] https://en.wikipedia.org/wiki/Quasimorphism

[1] https://xkcd.com/1053/

kaba0 · on June 15, 2023

An important distinction is that computers can do arbitrary many iterations of algorithms, while most neural networks have to operate on some fixed size, so the practical limits are very different.

heyitsguay · on June 14, 2023

It won't, it just means this is another pointless theoretical study seeking to interpret Transformers in a framework that has no explanatory value for AI.

triska · on June 14, 2023

As I see it, the result is rather establishing a very fundamental property pertaining to the expressive power of a mechanism, and it can be useful also in practice.

For instance, I have many potential applications of Turing complete formalisms, because I am interested in results of arbitrary computations. The result obtained in the article means that I can use a Neural Network to obtain this, under the conditions outlined in the article, and in the way shown in the article.

This may simplify software architectures, especially in situations where Neural Networks are already applied, and additional mechanisms would otherwise be needed to cover arbitrary computations.

nyrikki · on June 14, 2023

Note that Magic: The Gathering is Turing Complete

https://arxiv.org/abs/1904.09828

Something being Turing Complete just that means in principle it could be used to solve any computation problem. But this may require infinite memory or infinite time.

printf() is Turing complete as an other example.

https://www.ioccc.org/2020/carlini/index.html

The paper showed that Transformer with positional encodings and rational activation functions is Turing complete.

Rational activation functions with arbitrary precision make sure that you are in the smaller countable infinities, where floats run into that cardinality of the continuum problem.

While all nets that use attention are feed forward and thus effectively DAGs, they add in positional encodings to move it from well-founded to well ordered.

While those constraints allow the authors to make their claims in this paper, they also have serious implications for real world use as rational activation functions are not arbitrarily precise in physically realizable machines in finite time and you will need to find a well-ordering of your data or find a way to force one one it which is not a trivial task.

So while interesting, just as it was interesting when someone demonstrated sendmail configurations were Turing complete, it probably isn't as practical as you seem to think of it.

As attention is really runtime re-weighting and as feed forward networks are similar to DAGs it is not surprising to me that someone found a way to prove this, but just as I am not going to use the C preprocessor as a universal computation tool as it is also TC, I wouldn't hold your breath waiting for attention to be a universal computation tool either.

heyitsguay · on June 14, 2023

I'm not going to engage with this directly, but for any other readers passing through - this is nonsense. One more drop in the flood of uninformed AI noise that's been drowning out the signal.

Choose the sources you trust very carefully, and look to the people actually working on real-world AI systems, not the storytellers and hangers-on.

meindnoch · on June 14, 2023

What you wrote here is not even wrong.

randomNumber7 · on June 14, 2023

A computer with limited amount of memory is also theoretically not turing complete. So it might still mean that a transformer with floats comes close.

nyrikki · on June 14, 2023

Their proof depends on arbitrary precision and they explicitly state that the finite case is not TC.

But if you are talking about arbitrary precision floats, or the computable set of the reals it is equivalent.

The computable reals are just the concatenation of the natural numbers/ints

So it is the countable infinity and thus the cardinality of Aleph-nought.

That adds to the time complexity, while the unbounded memory requirement comes from the definition of a Turing machine which is roughly a finite state machine+ an infinite tape.

As the reals are uncomputable almost everyplace, you would need an activation function that only produced computable reals, as they are equivalent rational activation functions are simpler for the proof.

ftxbro · on June 14, 2023

Isn't there some kind of universality principle where anything that is complicated enough is Turing complete if you let it have infinite memory, and that the surprising thing is when a complicated process isn't Turing complete? I mean, the transformer architecture seems obviously complicated enough to be Turing complete. It's just that it has a finite number of layers and it seems to need a lot of computation to train its weights.

garganzol · on June 14, 2023

Turing completeness requires the presence of at least three operations: addition, negation, branching.

A usual Neural Network (NN) implements only two operations though: addition and negation. The branching operation can be achieved by:

  1) recursion as in Recursive Neural Network (RNN), -or/and-
  2) some kind of a conditional intermediary loop that occurs between NN runs.

drpixie · on June 15, 2023

You might be a little pessimistic about needing 3 operators.

Turing's original machines have only decision making and substitution. With that you can emulate anything you want, and go on to higher levels and more and more complexity.

Apparently even Conway's Game of Life is Turing complete ... it's horribly obtuse but can be used to build something, which is used to build something, and so on, until eventually you have an automata that could compute anything computable (though rather slowly!)

garganzol · on June 15, 2023

I'm neither pessimistic nor optimistic. The formal criteria of Turing completeness are dictated by the math. If a system can implement those or equivalent ops in a disguised form - then it is Turing complete.

jmalicki · on June 15, 2023

Softmax in attention is, well, a smoothed max function, so it gets you branching kinda... You're choosing which memory cell to use.

Xen9 · on June 14, 2023

Educative comment!

aleks224 · on June 14, 2023

For a set of primitive operations (such as those in Turing machine), how can you be sure that it spans all possible computations? It sounds that's one of the challenges Turing had - so he went at lengths to show that by various "soft arguments", e.g. that it's possible to build a VM (Turing machine that runs other Turing machines IIRC). Later on, it was shown that the Turing machine is equivalent to machines other folks came up with (Goedel's recursive functions, lambda calculus, etc) and a consensus emerged that they are all describing a full set of computations.

nico · on June 14, 2023

Great point

My personal theory is that anything you can run on a computer is Turing complete

Essentially any algorithm needs an interpreter that runs it

So if your algorithm runs on a computer, then in includes the whole computer

Like if you are truly statically linking code, it should include the whole computer with it

And that system is most definitely Turing complete

So probably when a complex enough algorithm runs on a computer, it might be big enough that you need a significant part of the abstractions of the computer system itself to explain the behavior of the algorithm, so the algorithm ends up including the requirements for Turing completeness

godelski · on June 14, 2023

This is definitely not true. There are context-free grammar and finite state machines are not Turing complete (TC). Strong functional programming languages (always terminate and every action is defined: e.g. x/0=0) are not TC since they exclude the ability to not terminate. In fact, most things you run on a computer are not TC as they are deterministically terminating. Computers themselves are not TC! They have finite memory. For practical purposes they are, and the languages they use often are, but I'd be a bit careful with your "personal theory" as it doesn't capture the ideas of TC that well.

kaba0 · on June 15, 2023

We really should not be boggled down by memory of computers. What exactly is its memory? If I have an old machine with a few kb of memory that can connect to the clouds petabytes of storage, how long is my “tape”? What if it is wired to operate a drone and can move real world objects? It’s state space is essentially the universe (which is finite, so yeah, theoretically not even that is infinite).

Lornedon · on June 14, 2023

That's like saying that everything you build with Lego is a pirate ship: To build anything with Lego, you first need Lego bricks. And if you already have those, then you can also build a pirate ship.

Turing completeness is defined on computational models, i.e. sets of instructions that you can use to build algorihms. Not on the algorithms themselves. If you can simulate a Turing machine using only the tools that your model gives you, then it's Turing complete. That doesn't mean that everything else you build using those tools is special in any way.

nico · on June 15, 2023

You don’t need a pirate ship to build legos

But you do need an interpreter to interpret/run any algorithm

So you can never really separate the algorithm from the interpreter for any practical application

If your algorithm requires the capabilities that define an interpreter as Turing-complete, then the algorithm will be Turing complete as well

ftxbro · on June 15, 2023

is this a derridean 'il n'y a pas de hors-texte' deconstruction of the chomsky hierarchy because if it is then i'm here for it

nico · on June 17, 2023

Had never heard of the concept of ‘lil n’y a pas de hors-texted’, reading about it sounds like the same concept: can’t remove the context from the object

> Understood as such, it's not really such a strange idea, that the things outside of the text itself can and do give meaning to it in an ever-evolving way. In a philosophical context we can understand it to assert the idea that context is always present, and isn't necessarily stable.

From: https://philosophy.stackexchange.com/a/40227

ftxbro · on June 14, 2023

OK but I understand that systems like finite state machines and parenthesis matchers aren't Turing complete.

api · on June 14, 2023

Off topic: I’ve been studying all this LLM stuff and feel like I have a solid understanding of everything in these designs except attention layers.

I’ve had a tough time finding anything that really just explains the mechanics of how they work without immediately jumping into pytorch abstractions or just describing the reason they are used with some strained metaphor. Like what is the key in the key value lookup and then how is the value used? How are they trained? Etc.

Any good links that skip all the nonsense and just explain how these are constructed for a programmer audience?

Even the LLMs themselves can’t explain it to me very well. ChatGPT and various LLaMa tunings have been able to tell me all about every other aspect of neural nets and LLMs which I love for the spooky factor of having AI tell me how to make more of it.

“I will show you how to bring us into your world…”

petemir · on June 14, 2023

There was this comment a couple of weeks ago ( https://news.ycombinator.com/item?id=35980418 ). Perhaps also episode 7 of Karpathy’s “Neural Networks: From Zero to Hero” course ( https://www.youtube.com/watch?v=kCc8FmEb1nY )

metanonsense · on June 14, 2023

What made attention mechanism click to me was the video https://www.youtube.com/watch?v=0PjHri8tc1c by Sebastian Raschka. Another good video on attention that tries to give a bit more intuition on the query/key/value stuff is https://www.youtube.com/watch?v=OyFJWRnt_AY

In hindsight, what helped me most is understanding the dimensionality at each step of the attention computation (explained well in Raschka's video). Another think I simply did not get from the original Transformer paper is that the learning of self-attention happens in the linear layers. (At least, that's my current understanding. If that's wrong, I appreciate any correction :)

PartiallyTyped · on June 14, 2023

> Another think I simply did not get from the original Transformer paper is that the learning of self-attention happens in the linear layers.

You can replace the KVQ kernels with any parametric computation that allows you to pass gradients through, and you will have learning.

I think some newer language model architectures use residual blocks here, and some vision transformers use FeedForward networks for the kernels.

opyate · on June 14, 2023

I thoroughly recommend Sebastian's newsletter "Ahead Of AI". Link on his website:

https://sebastianraschka.com/

api · on June 14, 2023

Thanks.

One thought I’ve had is to wonder if attention isn’t kind of like a way to unroll RNNs to make training more efficient. Any truth in that?

PartiallyTyped · on June 14, 2023

Attention is a convex linear combination of values (meaning the coefficients sum to 1) where the coefficients are generated at runtime.

The values are given from the Value kernel; which is often just a linear map.

The strength is given from the dot product of the query and the keys. The queries and keys are given from the Query and Key kernels. Both of which are usually linear maps.

For each token, you have a key value query triplet (D,D,D). You take the query (D,1) and the matrix (N x D) of keys of all tokens, and do a Matrix vector product to get the scaling. You apply softmax to make it convex; and now you have a vector (N, 1). You do another matrix vector product with the Values mapping (N, D), which gives you the value for the current token (1,D).

Repeat N times, and that’s it.

If you would like to change which tokens each token considers, eg a token knows only of preceding tokens, you can do element wise product of the scaling vector with a mask vector which describes what you consider for current token ; and divide by their inner product. This will renormalise the valid coefficients and keep the rest 0; thus they are ignored when you combine all values together.

If you would like to increase the number of attention heads; which can give you finer control of the scaling process, you can split the key query and value vectors into segments, one segment for each head. Eg first 2 indices go to head 1, indices 3,4 go to head 2, etc etc. then you use those to do attention, repeating the process above for each head; and then combine the results to a single vector.

Idk if I confused query with keys, but this is the gist of it.

In practice the operations are much more optimised than what I described.

tomasff · on June 15, 2023

https://arxiv.org/abs/2207.09238

This is a great transparent and clear description of what is actually going on in attention layers.

Alternatively, this article by Jay Alammar is also very popular http://jalammar.github.io/illustrated-transformer/

opyate · on June 15, 2023

I commented yesterday, but came back to say this. Just found it, and it's really good. I like reading descriptions better than watching videos of them, so this is great.

soVeryTired · on June 14, 2023

Have you tried asking GPT-4? I got a pretty good explanation of attantion from it.

The 'key' and 'value' description in the attention mechanism probably doesn't make much sense to you because at best it's a strained metaphor, rather than an actual principle by which attention operates. If you take a look at the maths, you can interchange the role of 'keys' and 'values' and get something mathematically identical.

robwwilliams · on June 14, 2023

Lovely article by Pèrez and colleagues. Yes, the naive expectation might be that “complex enough” becomes Turing complete, but as discussed in the lucid introduction this is often thought to require a recurrent process. So it is good to see the formal proof.

https://scholar.google.com/citations?hl=en&user=a6lUuiwAAAAJ...

gwern · on June 14, 2023

2019 version which includes the Neural GPU: https://arxiv.org/abs/1901.03429

nojvek · on June 15, 2023

As others have mentioned, turing machines can be achieved by only one instruction set. See https://en.wikipedia.org/wiki/One-instruction_set_computer

Most popular being subleq (subract if less than or equal to zero).

Subleq can be implemented with a transformer.

See https://arxiv.org/pdf/2301.13196.pdf (Looped Transformers as Programmable Computers).

The important distinction is that transformers need to be looped. Just like ChatGPT calls the model over and over again until stop token or limit is reached.

And we know ChatGPT does better if we let it think step by step and blabber more (chain of thought prompting). In every token it generates, it gets an opportunity to branch.

sva_ · on June 14, 2023

I have to say I quite enjoy these incidences where people discover 'accidental' Turing completeness. Seems like gwern wrote an article on it:

https://gwern.net/turing-complete

tomp · on June 14, 2023

Attention is linear, so I doubt it can be Turing complete

garganzol · on June 14, 2023

Related discussion: https://news.ycombinator.com/item?id=36020520

slt2021 · on June 14, 2023

in other words anything can be Turing complete if provided infinite compute/storage

mr_toad · on June 15, 2023

Your typical “it’s just curve fitting” algorithm is not Turing complete, no matter how much memory and compute you throw at it.

adamnemecek · on June 14, 2023

Attention is just convolution in a Hopf algebra.

TeMPOraL · on June 14, 2023

Does this have explanatory value, or is it the case of "${term} is a ${simple concept} in ${highly specialized framework that's designed specifically so that ${term} is a case of ${simple concept} in it}"?

frakt0x90 · on June 14, 2023

Yeah it's "A monad is just a monoid in the category of endofunctors" rehashed for the millionth time.

meindnoch · on June 14, 2023

You took the words right out of my mouth!

adamnemecek · on June 14, 2023

Sarcasm?

meindnoch · on June 14, 2023

adamnemecek · on June 14, 2023

I wrote a paper on the topic but local heterodoxy police doesn't like it when I link it. So look up my name on arxiv.