A related line of work is "Thinking Like Transformers" [1]. They introduce a primitive programming language, RASP, which is composed of operations capable of being modeled with transformer components, and demonstrate how different programs can be written with it, e.g. histograms, sorting. Sasha Rush and Gail Weiss have an excellent blog post on it as well [2]. Follow on work actually demonstrated how RASP-like programs could actually be compiled into model weights without training [3].
Huge fan of RASP et al. If you enjoy this space, might be fun to take a glance at some of my work on HandCrafted Transformers [1] wherein I hand-pick the weights in a transformer model to do long-handed addition similar to how humans learn to do it in gradeshcool.
I thought I understood transformers well, even though I had never implemented them. Then one day I implemented them, and they didn't work/train nearly as well as the standard pytorch transformer.
I eventually realized that I had ignored the dropout, because I thought my data could never overfit. (I trained the transformer to add numbers, and I never showed it the same pair twice.) Turns out dropout has a much bigger role than I had realized.
TLDR, just go and implement a transformer.
The more from scratch the better.
Everyone I know who tried it, ended up learning something they hadn't expected.
From how training is parallelized over tokens down to how backprop really works.
It's different for every person.
Andrej Karpathy has a series on YouTube where he builds everything up from scratch. He implements a transformer when he builds a GPT-style model: https://www.youtube.com/watch?v=kCc8FmEb1nY
There are a lot of "simple transformer" implementations on github. But I'll recommend Microsofts new Phi 1.5: https://huggingface.co/microsoft/phi-1_5/blob/main/modeling_...
It's well written, and very modern, including rotary embeddings and a kv-cache for inference.
I've been kicking around a similar idea for awhile. Why can't we have an intuitive interface to the weights of a model, that a domain expert can tweak by hand to accelerate training? For example, in a vision model, they can increase the "orangeness" collection of weights when detecting traffic cones. That way, instead of requiring thousands/millions more examples to calibrate "orangeness" right, it's accelerated by a human expert. The difficulty is obviously having this interface map to the collections of weights that mean different things, but is there a technical reason this can't be done?
Great example. Right, the deep learning approach uncovers all kinds of hidden features and relationships automatically that a team of humans might miss.
I guess I'm thinking about this problem from the perspective of these GPT models requiring more training data than a normal person can acquire. Currently, it seems you need the entire internet worth of training data (and a lot of money) to get something that can communicate reasonably well. But most people can communicate reasonably well, so it would be cool if that basic communication knowledge could be somehow used to accelerate training and minimize the reliance on training data.
I am still learning transformers, but I believe part of the issue may be that the weights do not necessarily correlate to things like "orangeness"
Instead of a transformer for each color, you have like 5 to 100 weights that represent some arbitrary combination of colors. Literally the arbitrariness is defined by the dataset and the number of weights allocated.
They may even represent more than just color.
So I am not sure if a weight is actually a "dial" like you are describing it, where you can turn up or down different qualities. I think the relationship between weights and features is relatively chaotic.
Like you may increase orangeness but decrease "cone shapedness" or accidentally make it identify deer as trees or something, all by just changing 1 value on 1 weight
It is possible that the parameters, like weights in a machine learning model, interact to yield outcomes in a manner analogous to the interactions between genes in biological systems, which produce traits. These interactions involve complex interdependencies, so there really aren't 1 to 1 dials.
> the deep learning approach uncovers all kinds of hidden features and relationships automatically that a team of humans might miss
sitting in a lecture from a decent DeepLearning practitioner, there were two questions from the audience (among others). The first question asked "How can we check the results using other models, so that computers will catch the errors that humans miss?"
The second question was more like "when a model is built across a non-trivial input space, the features and classes that come out are one set of possibilities, but there are many more possibilities. How can we discover more about the model that is built, knowing that there are inherent epistemological conflicts in any model?"
I also thought it was interesting that the two questioners were from large but very different demographic groups, and at different stages of learning and practice (the second question was from a senior coder).
The reason you're looking for is called "The Bitter Lesson".
The short version is, trying to give human assistance to AIs is almost always less cost-effective than making them run on more computing power.
By the time your human expect has calibrated your weight layers to detect orange traffic cones, your GPU cluster has trained its AI to detect traffic cones, traffic ligts, trees, other cars, traffic cones with a slightly different shade of orange, etc.
"The bitter lesson of machine learning is that building knowledge into agents does not work in the long run, and breakthrough progress comes from scaling computation by search and learning.02 This applies to domains where domain knowledge is weak or hard to express mathematically. The rapid progress of ML applied to LQCD, mol.2 dyn., protein folding, and computer graphics is the result of combining domain knowledge with ML."
This passage says that the advantages of scaling a certain kind of learning are especially good when .. two conditions.. but a side-effect of that statement is, when knowledge is well known, and maybe straightforward to express, these kinds of learning systems, not as great.. this is true.
without taking on big other topics, I think this "bitter lesson" is unspecific enough to include some self-serving utility.. just tell the other camps to give up, you lost. that sort of thing.
The number of layers and weights is really not at a scale we can handle updating manually, and even if we could the downstream effects of modifying weights are way too hard to manage. Say you are updating the picture to be better at orange, but unless you can monitor all the other colours for correctness at the same time you probably are creating issues for other colours without realizing it
The technical reason it can't be done (or would be very difficult to do) is that weights are typically very uninterpretable. There aren't specific clusters of neurons that map to one concept or another, everything kind of does everything.
I wonder if an expert can "impose" weights onto a model and the model will opt to continue with them when it resumes training. For example, in the vision example, the expert may not know where "orangeness" currently exists, but if they impose their own collection of weight adjustments that represent orangeness, will the model continue to use these weights as the path of least resistance when continuing to optimize? Just spitballing, but if we can't pick out which neurons do what, the alternative would seem to be to encourage the model to adopt a control interface of neurons.
Humans have much more compressed models, trying to transfer human learning to machine learning could potentially be a way to get more efficient models.
The way we do that currently is by labeling data for training, but maybe there are better ways to do it. Like some semi code to write with hints for the model. Like, instead of labeled data, could have a series of "lectures" of labeled data that would lead to a good end state, instead of training on all the data in parallel.
You don't teach a child calculus by showing them a million calculus problems after all, you ramp up starting with simple numbers and then slowly ramping up with more concepts. But to do that we would need to change how we train models.
Edit: By doing it that way you could see the skill of the model after each lecture, and update the lecture to try to make the model learn better. Not sure how to do that, but such ways to work with parts of models is a potential way forward.
There are techniques for this called "curriculum learning" and "textbooks".
I'm not sure exactly what is in the textbooks since I admit to not reading the papers yet.
I'm personally wondering if you could increase the reliability of training on web data by labeling it with where each document came from, so it knows different authors disagree on things. But this brings back the issue where people don't like it if a model can write "in the style of Author Name"…
On the idea of interpreting the weights, I've been very interested if it's possible to compute basis vectors of the weights matrix to define the core concepts within the model and then do a change of basis to allow reorganizing the model to more human understood concepts?
I think the inherent compression of a specific training set into a matrix makes this more difficult cause the basis vectors likely won't contain clean representations of human ideas, but I also wonder if starting a new training set with an initialized (or fixed) matrix of human defined concepts would help align the model's weights to something that can be interpretable
The attention mechanisms present in transformers don’t seem easy to map to semantics that humans can understand. There are too many parameters involved
I always wanted to at least have a shallow understanding of Transformers but the paper was way too technical for me.
This really helped me understand how they work! Or at least I understood your example, it was very clear. And I also got to brush up my matrix stuff from uni lol.
This is simplified a bit - It's just a "machine" that maps [set of inputs] -> [set of probabilities of the next output]
First you define a list of tokens - lets say 24 letters because that's easier.
They are a machine that takes an input sequence of tokens, does a deterministic series of matrix operations, and outputs what is a list of the probability of every token.
"learning" is just the process of setting some of the numbers inside of a matrix(s) used for some of the operations.
Notice that there's only a single "if" statement in their final code, and it's for evaluating the result's accuracy. All of the "logic" is from the result of these matrix operations.
It's kind of hard to interpret these things as "automata" in the sense that one might usually think of them.
Everything is usually a little fuzzy in a neural network. There's rarely anything like an if/else statement, although (as in the transformer example) you have some cases of "masking" values with 0 or -∞. The output is almost always fuzzy as well, being a collection of scores or probabilities. For example, a model that distinguishes cat pictures and dog pictures might emit a result like "dog:0.95 cat:0.05", and we say that it predicted a cat because the dog score is higher than the cat score.
In fact, the core of the transformer, the attention mechanism, is based on a kind of "soft lookup" operation. In a non-fuzzy system, you might want to do something like loop through each token in the sequence, check if that token is relevant to the current token, and take some action if it's relevant. But in a transformer, relevance is not a binary decision. Instead, the attention mechanism computes a continuous relevance score between each pair of tokens in the sequence, and uses those scores to take further action.
But some things are not easily generalized directly from a system based on of binary decisions. For example, those relevance scores are used as weights to compute a weighted average over tokens in the vocabulary, and thereby obtain an "average token" for the current position in the sequence. I don't think there's an easy way to interpret this as an extension of some process based on branching logic.
Jürgen Schmidhuber reminds me of Richard Feynman in one very specific way: while everybody else in their respective fields uses math to hide the deep insights they've been mining for publications, Schmidhuber and Feynman just simply tell you the big insight, and then proceed to refine and illuminate it with math.
Neural networks are Turing machines. You can make them perform any computation by carefully setting up their weights. It would be nice to have compilers for them that were not based on approximation, though.
People typically set the weights of a neural network using heuristic approximation algorithms, by looking at a large set of example inputs/outputs and trying to find weights that perform the needed computation as accurately as possible. This approximation process is called training. But this approximation happens because nobody really knows how to set the weights otherwise. It would be nice if we had "compilers" for neural networks, where you write an algorithm in a programming language, and you get a neural network (architecture+weights) that performs the same computation.
TFA is a beautiful step in that direction. What I want is an automated way to do this, without having to hire vgel every time.
A turing complete system doesn't necessarily mean it's useful, it just means that it's equivalent with a turing machine. The ability to describe any possible algorithm is not that powerful in itself.
As an example, algebraic type systems are often TC simply because general recursion is allowed.
Feed forward networks are effectively DAGs and while you may be able to express any algorithms using them they are also pairwise linear in respect to inputs.
Statistical learning is powerful in finding and matching patterns, but graph rewriting, which is what your doing with initial random weights and training is not trivial.
More importantly it doesn't make issues like the halting problem decidable.
I don't see why the same limits in graph rewriting languages which were explored in the 90s won't hit using feed forward networks as computation systems outside of the application of nation-state scale computing power.
The point of training is to create computer programs through optimization, because there are many problems (like understanding language) that we just don't know how to write programs to do.
It's not that we don't know how to set the weights - neural networks are only designed with weights because it makes them easy to optimize.
There is no reason to use them if you plan to write your own code for them. You won't be able to do anything that you couldn't do in a normal programming language, because what makes NNs special is the training process.
Why would you do that when it's better to do the opposite? Given a model quantize it and compile it to direct code objects that do the same thing much much much faster?
The generality of the approach [NNs] implies that they are effectively a union of all programs that may be represented, and as such there needs to be the capacity for that, this capacity is in size, which makes them wasteful for exact solutions.
it is fairly trivial to create FFNNs that behave as decision trees using just relus if you can encode your problem as a continuous problem with a finite set of inputs. Then you can very well say that this decision tree is, well, a program, and there you have it.
The actual problem is the encoding, which is why NNs are so powerful, that is, they learn the encodings themselves through grad descent and variants.
That makes no entropy or cyber-netic sense at all. You would just get a neural network that outputs the exact formula, or algo. Like, if you would just do a sine it would be a taylor series encoded into neurons
Its like going from computing PI as a constant to computing it as a giantic float.
The whole point is that we don’t know what rules to encode and we want the system to derive the weights for itself as part of the training process. We have compilers that can do deterministic code - that’s the normal approach.
Would I be able to solve the Travelling Salesman Problem with a Transformer with the appropriately assigned weights? That would be an achievement. You'd beat some known bounds of the complexity of TSP.
My point was that Transformers and neural networks as they are now are not Turing machines if you don't allow for the model to grow with the input size. That said, it has to grow in "depth" not just parameters. The fact that people think fixed depth computation can universally compute everything is a worrying trend.
> maybe even feel inspired to make your own model by hand as well!
Other then a learning exercise to satisfy your curiosity what are you doing with this? I'm starting to get the feeling that anything complex with ml models is unreasonable for a at home blog reader?
In nanoGPT, you pre-train a model on Shakespeare, and in 3 minutes it gets to a Lewis Carroll's Jabberwocky level of fidelity on the source material. It makes up lots of plausible-seeming old English words, learns the basics of English grammar, the layout of the plays, etc. I was pretty amazed that it got that good in such a short period of time.
I think locally training a bunch of models to the fidelity of Shakespeare-from-Wish.com might tell you when you've hit on a winning architecture, and when to try scaling up.
Author states in the first paragraph of their blog post:
"I've been wanting to understand transformers and attention better for awhile now—I'd read The Illustrated Transformer, but still didn't feel like I had an intuitive understanding of what the various pieces of attention were doing. What's the difference between q and k? And don't even get me started on v!"
Minor request: can we have 'neural network' or something in the title? This is related to the machine learning 'transformer' architecture, rather than the bundle of coils that couples two circuits electromagnetically.
"Electrical transformer without any formal training in electric engineering" is basically how the title read to me. Followed by a lot of confusion when the article was not that.
Good on him, it's just... an odd way to phrase it :)
Heh, I went a step further and thought "ooh, a novice EE project, but titling it like that everyone's going to be confused about it involving machine learning" until I realized I was the one that got it backwards...
Well, one comment at the root was flagged and the thread was deleted, so hopefully I can reply to you here, plus you were starting to replying to another guy you thought he was me.
My answer was: "Gender-affirming care and gender transition (as the results on Google will call it if you accidentally search "sex transition"), none of this changes your sex, most mammals (including humans) have the XY sex determination system, taking estrogens or other hormones doesn't make you lose your Y and gain an extra X, nor does it make you start producing ova, the only difference is how you look and sound, which (hopefully) makes you more comfortable with your gender, and that's why it's called "gender-affirming care" and "gender transition"."
I don't understand the focus on genes that some people seem to be placing. After the gonads are formed the Y chromosome does not really matter (unless if you have a particularly nasty mutation). All Y does is carry the SRY gene. If you did magically managed to lose your Y chromosome would that make you a woman? No! You would stay a man and literally nothing would change in your life. Same goes if you have two X chromosomes and one magically gets replaced with a Y.
I am also worried about the focus on genes given the existence of intersex people. It's especially weird since sex categories existed long before the creation of modern biology and at least for some people the term sex got redefined to exclude certain men and women who often don't even realize it.
The term gender affirming care seems accurate to me, as part of affirming their gender, trans people (who used to be called transsexual) attempt to transition so that their sex matches their gender identity. On the other hand, the term gender transition seems to include both sex changes and social transition.
I should also mention that estrogen when does not change how you sound, it does however cause other changes not related to one's appearance.
XY sex-determination system is based on the fact that the chromosomes one has at birth cannot change naturally throughout their lifetime, so magical or hypothetical scenarios where chromosomes can be altered doesn't align with our current scientific understanding, but if it makes you happier, sure, in a magic world humans would be able to change their sex.
Also, I don't see how the existence of intersex people should be somewhat problematic in this regard, intersex is a congenital condition, like people who are born with six fingers, that doesn't make the fact that humans have 5 fingers controversial.
To be honest I think I'm kind of stating the obvious and most people (trans people included obv) would agree with me, which is why they use "gender affirming care", "gender transition" and "transgender" (instead of transsexual which is in fact seen as an outdated term). If you don't care about true biological data (plural of datum), but only the opinions of trans people, I can ask my trans friends (one nonbinary and two trans women) what they think about it, even though I actually know them well, and they would agree with me.
Please help me understand this. You are in fact claiming that if somehow someone lost their Y chromosome that would make him a woman despite the lack of any physical or mental changes? I think you have a radically different view of sex compared to most people if so as it is fundamentally incompatible with our social structure.
The claim that most humans have 5 fingers or that most men have XY chromosomes is not controversial. However the claim that you must have 5 fingers to be a human is! This is what the XY sex-determination system is based on, they noticed that most men have XY chromosomes and that most women have XX chromosomes and declared intersex men and women who do not match this model to actually be of the other sex by making an inaccurate map from genes to sex (which is perceived socially and existed before said model).
I believe that the above 2 arguments together are enough to show that it is not an accurate nor a useful sex determination system from a logical perspective.
I think people nowadays avoid the term transsexual because it might be confused with a sexual orientation. I also feel like some people consider the term sex to be vulgar, which is why even these who consider gender=sex often prefer the term gender even when it does not refer to the grammatical gender. Finally the term transgender is often used by some in order in order to be inclusive to people who are not medically transitioning.
You are free to show this thread to your friends if you want. I am sure there are trans people in both camps of the argument.
I think I was very clear when I said: "XY sex-determination system is based on the fact that the chromosomes one has at birth cannot change naturally throughout their lifetime, so magical or hypothetical scenarios where chromosomes can be altered doesn't align with our current scientific understanding", if people manages to change their chromosomes I guess the sex-determination system would probably change but we don't live in that magical world so things are much easier (yeah!)
>However the claim that you must have 5 fingers to be a human is! This is what the XY sex-determination system is based on
No idea what you're talking about ahah, people with a congenital condition are still human and no one said otherwise.
Gender-affirming care, also known as transition (transition what? Their sex!), is done in order to make trans people more comfortable in society, in their inner self (hormones affect mood, behavior, feelings, etc) and in their body by changing their sex. This is why they are prescribed sex hormones as these are what guides gene expression and decide sex-dependent characteristics. While some trans people would be content with just changing their physical appearance most of them seem to actually aim to change their sex.
Gender-affirming care and gender transition (as the results on Google will call it if you accidentally search "sex transition"), none of this changes your sex, most mammals (including humans) have the XY sex determination system, taking estrogens or other hormones doesn't make you lose your Y and gain an extra X, nor does it make you start producing ova, the only difference is how you look and sound, which (hopefully) makes you more comfortable with your gender, and that's why it's called "gender-affirming care" and "gender transition".
We are not discussing whether sex is a spectrum or not though.
Was the nonbinary person that you talked with undergoing medical transition? If not this might explain their view.
[1] https://arxiv.org/abs/2106.06981
[2] https://srush.github.io/raspy/
[3] https://arxiv.org/abs/2301.05062