This is simplified a bit - It's just a "machine" that maps [set of inputs] -> [set of probabilities of the next output]
First you define a list of tokens - lets say 24 letters because that's easier.
They are a machine that takes an input sequence of tokens, does a deterministic series of matrix operations, and outputs what is a list of the probability of every token.
"learning" is just the process of setting some of the numbers inside of a matrix(s) used for some of the operations.
Notice that there's only a single "if" statement in their final code, and it's for evaluating the result's accuracy. All of the "logic" is from the result of these matrix operations.
It's kind of hard to interpret these things as "automata" in the sense that one might usually think of them.
Everything is usually a little fuzzy in a neural network. There's rarely anything like an if/else statement, although (as in the transformer example) you have some cases of "masking" values with 0 or -∞. The output is almost always fuzzy as well, being a collection of scores or probabilities. For example, a model that distinguishes cat pictures and dog pictures might emit a result like "dog:0.95 cat:0.05", and we say that it predicted a cat because the dog score is higher than the cat score.
In fact, the core of the transformer, the attention mechanism, is based on a kind of "soft lookup" operation. In a non-fuzzy system, you might want to do something like loop through each token in the sequence, check if that token is relevant to the current token, and take some action if it's relevant. But in a transformer, relevance is not a binary decision. Instead, the attention mechanism computes a continuous relevance score between each pair of tokens in the sequence, and uses those scores to take further action.
But some things are not easily generalized directly from a system based on of binary decisions. For example, those relevance scores are used as weights to compute a weighted average over tokens in the vocabulary, and thereby obtain an "average token" for the current position in the sequence. I don't think there's an easy way to interpret this as an extension of some process based on branching logic.
Jürgen Schmidhuber reminds me of Richard Feynman in one very specific way: while everybody else in their respective fields uses math to hide the deep insights they've been mining for publications, Schmidhuber and Feynman just simply tell you the big insight, and then proceed to refine and illuminate it with math.
Neural networks are Turing machines. You can make them perform any computation by carefully setting up their weights. It would be nice to have compilers for them that were not based on approximation, though.
People typically set the weights of a neural network using heuristic approximation algorithms, by looking at a large set of example inputs/outputs and trying to find weights that perform the needed computation as accurately as possible. This approximation process is called training. But this approximation happens because nobody really knows how to set the weights otherwise. It would be nice if we had "compilers" for neural networks, where you write an algorithm in a programming language, and you get a neural network (architecture+weights) that performs the same computation.
TFA is a beautiful step in that direction. What I want is an automated way to do this, without having to hire vgel every time.
A turing complete system doesn't necessarily mean it's useful, it just means that it's equivalent with a turing machine. The ability to describe any possible algorithm is not that powerful in itself.
As an example, algebraic type systems are often TC simply because general recursion is allowed.
Feed forward networks are effectively DAGs and while you may be able to express any algorithms using them they are also pairwise linear in respect to inputs.
Statistical learning is powerful in finding and matching patterns, but graph rewriting, which is what your doing with initial random weights and training is not trivial.
More importantly it doesn't make issues like the halting problem decidable.
I don't see why the same limits in graph rewriting languages which were explored in the 90s won't hit using feed forward networks as computation systems outside of the application of nation-state scale computing power.
The point of training is to create computer programs through optimization, because there are many problems (like understanding language) that we just don't know how to write programs to do.
It's not that we don't know how to set the weights - neural networks are only designed with weights because it makes them easy to optimize.
There is no reason to use them if you plan to write your own code for them. You won't be able to do anything that you couldn't do in a normal programming language, because what makes NNs special is the training process.
Why would you do that when it's better to do the opposite? Given a model quantize it and compile it to direct code objects that do the same thing much much much faster?
The generality of the approach [NNs] implies that they are effectively a union of all programs that may be represented, and as such there needs to be the capacity for that, this capacity is in size, which makes them wasteful for exact solutions.
it is fairly trivial to create FFNNs that behave as decision trees using just relus if you can encode your problem as a continuous problem with a finite set of inputs. Then you can very well say that this decision tree is, well, a program, and there you have it.
The actual problem is the encoding, which is why NNs are so powerful, that is, they learn the encodings themselves through grad descent and variants.
That makes no entropy or cyber-netic sense at all. You would just get a neural network that outputs the exact formula, or algo. Like, if you would just do a sine it would be a taylor series encoded into neurons
Its like going from computing PI as a constant to computing it as a giantic float.
The whole point is that we don’t know what rules to encode and we want the system to derive the weights for itself as part of the training process. We have compilers that can do deterministic code - that’s the normal approach.
Would I be able to solve the Travelling Salesman Problem with a Transformer with the appropriately assigned weights? That would be an achievement. You'd beat some known bounds of the complexity of TSP.
My point was that Transformers and neural networks as they are now are not Turing machines if you don't allow for the model to grow with the input size. That said, it has to grow in "depth" not just parameters. The fact that people think fixed depth computation can universally compute everything is a worrying trend.