Hacker News new | past | comments | ask | show | jobs | submit login
Beyond self-attention: How a small language model predicts the next token (shyam.blog)
474 points by tplrbv 8 months ago | hide | past | favorite | 85 comments



Some of the topics in the parent post should not be a major surprise to anyone who has read https://people.math.harvard.edu/~ctm/home/text/others/shanno... ! If we do not have read the foundations of the field that we are in, we are doomed to be mystified by unexplained phenomena which arise pretty naturally as consequences of already-distilled work!

That said, the experiments seem very thorough, on a first pass/initial cursory examination, I appreciate the amount of detail that seemed to go into them.

The tradeoff between learning existing theory, and attempting to re-derive it from scratch, I think, is a hard tradeoff, as not having the traditional foundation allows for the discovery of new things, but having it allows for a deeper understanding of certain phenomena. There is a tradeoff either way.

I've seen several people here in the comments seemingly shocked that a model that maximizes the log likelihood of a sequence given the data somehow does not magically deviate from that behavior when run in inference. It's a density estimation model, do you want it to magically recite Shakespeare from the void?

Please! Let's stick to the basics, it will help experiments like this make much more sense as there already is a very clear mathematical foundation which clearly explains it (and said emergent phenomena).

If you want more specifics, there are several layers, Shannon's treatment of ergodic systems is a good start (though there is some minor deviation from that here, but it likely is a 'close enough' match to what's happening here to be properly instructive to the reader about the general dynamics of what is going on, overall.)


> the topics in the parent post should not be a major surprise to anyone who has read https://people.math.harvard.edu/~ctm/home/text/others/shanno... !

> which clearly explains it (and said emergent phenomena)

Very smart information theory people have looked at neural networks through the lens of information theory and published famous papers about it years ago. It couldn't explain many things about neural networks, but it was interesting nonetheless.

FWIW it's not uncommon for smart people to say "this mathematical structure looks like this other idea with [+/- some structure]!!" and that it totally explains everything... (kind of with so and so exceptions, well and also this and that and..). Truthfully, we just don't know. And I've never seen theorists in this field actually take the theory and produce something novel or make useful predictions with it. It's all try stuff and see what works, and then retroactively make up some crud on why it worked, if it did work (otherwise brush it under the rug).

There was this one posted recently on transformers being kernel smoothers: https://arxiv.org/abs/1908.11775


I think there is more here than a backward look.

The article introduced a discrete algorithm method for approximating the gradient optimization model.

It would be interesting to optimize the discrete algorithm for both design and inference times, and see if any space or time advantages over gradient learning could be found. Or if new ideas popped as a result of optimization successes or failures.

It also might have an advantage in terms of algorithm adjustments. For instance, given the most likely responses at each step, discard the most likely whenever follow ups are not too far below - and see if that reliably avoided copyright issues.

A lot easier to poke around a discrete algorithm, with zero uncertainty as to what is happening, vs. vast tensor models.


> It's all try stuff and see what works, and then retroactively make up some crud on why it worked

People have done this in earlier days too. The theory around control systems was developed after PID controllers had been succesfully used in praxis.


> It's all try stuff and see what works, and then retroactively make up some crud on why it worked, if it did work (otherwise brush it under the rug).

Reminds me of how my ex-client's data scientists would develop ML models.


I appreciate what you're saying, but convergence (via alternative paths, of various depths) is its own signal. Repeated rediscovery perhaps isn't necessarily wastefulness, but affirmation and validation of deep truth for which there are multiple paths of arrival :)


I wish that this worked out in the long run! However, watching the field spin its wheels in the mud over and over with silly pet theories and local results makes it pretty clear that a lot of people are just chasing the butterfly, then after a few years grow disenchanted and sort of just give up.

The bridge comes when people connect concepts to those that are well known and well understood, and that is good. It is all well and good to say in theory that rediscovering things is bad -- it is not necessarily! But when it becomes groundhog day for years on end without significant theoretical change, then that is an indicator that something is amiss in general in how we learn and interpret information in the field.

Of course, this is just my crotchety young opinion coming up on 9 years in the field, so please take that that with a grain of salt and all that.


Meanwhile in economics you have economists argue that the findings of anthropologists are invalid, because they don't understand modern economic theory. It's history that needs to change.


In another adjacent thread, people are talking about the implications of a neural network conforming to the training data with some error margin with regards to copyright.

Many textbooks on information theory already call out the content-addressable nature of such networks[1], and they’re even used in applications like compression due to this purpose[2][3], and therefore it’s no surprise that the NYT prompting OpenAI models with a few paragraphs of their articles reproduced them nearly verbatim.

[1] https://www.inference.org.uk/itprnn/book.pdf

[2] https://bellard.org/nncp/

[3] https://pub.towardsai.net/stable-diffusion-based-image-compr...


Yes! This is a consequence of empirical risk minimization via maximum likelihood estimation. To have a model not reproduce the density of data it trained on would be like trying to get a horse and buggy to work well at speed, "now just without the wheels this time". It would generally not necessarily go all that well, I think! :'D


Ok but why didn’t Shannon get us gpt


He was busy getting us towards wifi first.


I get the feeling you may not have read the paper as closely as you could have! Section 8 followed by Section 2 may look a tiny bit different if you consider it from this particular perspective.... ;)


Kudos for pluggimg shannomçs masterpiece


I had the exact same idea after seeing Google point out that you can[0] get ChatGPT to regurgitate verbatim training data by asking it to repeat the same word over and over again[1]. I'm glad to see someone else actually bring it to fruition.

This, of course, brings two additional questions:

1. Is this "AI, hold the AI" approach more energy-efficient than having gradient descent backpropagation compress a bunch of training data into a model that can then be run on specialized AI coprocessors?

2. Will this result wind up being evidence in the ongoing lawsuits against OpenAI and Stability AI?

[0] Could. OpenAI now blocks generation if you fill the context window with a single word.

[1] https://arxiv.org/abs/2311.17035


This approach cannot possibly be more efficient than running the original model because it relies on running the original model to get the activations to search the text corpus for strings with similar activations to compute the next-token statistics. You don't get to skip many steps, and you end up having to do a bunch of extra work.

I'd be surprised if doing this with two completely separate corpora, one for training the model and the other to search for strings with similar activations, wouldn't lead to much the same results. Because the hard part is constructing similar activations for strings with similar next-token statistics in the first place.

Note that in the per-layer weights [0.01, 0.01, 0.1, 1.5, 6, 0.01] the penultimate layer ist the most important, where the input has already been transformed a lot. So you can't expect to use this to replace a transformer with a simple grep over the training data. (My guess as to why the penultimate layer has a much higher weight than the final one is that this is due to induction heads https://transformer-circuits.pub/2021/framework/index.html which implement copying repeated strings from the input, with the penultimate layer determining what to look for and the final layer doing the copying.)


I'm confused, you had the exact same idea that LLM output is based on probability of next token, which is based on the training data?

If that's the case, no, its unlikely this result will end up becoming evidence, that is well known and fundamental.

The author's contribution to discussion is showing this to a technical audience writing their own GPT, as they note, most "how to implement this?" focus on transformers


Much of the sales hype and other literature surrounding LLMs specifically obfuscates the role that training data plays in the model. Training data is "learned from", but that's implying the data goes away after the training process ends and you have a model that's solely composed of uncopyrightable knowledge about how to write or draw. If the models are actually retaining training data, and we have a way to extract that data, then the models didn't learn - legally speaking[0], they copied training set data.

The idea I had wasn't "LLMs are based on probabilities", it was "what if you benchmarked an LLM against a traditional search index over the training corpus". The linked blog post doesn't completely rip out the LLMs entirely, just the feed-forward layer, but the result is what I thought would happen: an attention-augmented search index that is producing nearly identical probability distributions to the 66% of the model that was removed.

[0] Programmers talking about copyright usually get tripped up on this, so I'll spell it out: copyright is a matter of data provenance, not bit-exactness. Just because the weights are harder to inspect does not mean no copyright infringement has occurred. Compression does not launder copyright.


Should make sure to establish this up front: People know this, it's not controversial. It's not only known to a few. It's how it works.

Also note that this example purposefully minimizes training data down to an absurdity, so it is possible to correlate 1:1 that the next letter's probabilities to the input. The key of the rest of this comment,, and the discussions you reference, is the observation that's vastly harder once the training data is measured in terabytes, to the point the question becomes interesting.

The argument of which you're speaking, the people you think are speaking literally are speaking figuratively: they know it reproduces _some_ training data, i.e. 2+2=4 was surely in the training data. Or c.f. NY Times v. OpenAI, where they were able to get it to complete an article given the first ~5 paragraphs of the article.

The unsettled question, in US legal parlance, is if LLMs are sufficiently transformative of the training data that it becomes fair use.

Eschewing US legal parlance: where, exactly, on the spectrum of "completely original" to "photocopier with perfect recall" LLMs fall, given we know it isn't at either of those extremes? What responsibility does that give someone operating an LLM commercially to the entities who originated the training data?


In my experience before they blocked it: it hallucinates something that looks like training data. A GitHub readme that under closer inspection doesn't actually exist and is incoherent. Some informational brochure about nothing. A random dialogue.


I found it interesting that in the arxiv paper you linked they are talking about an attack, ethics and responsible disclosure.

But when it comes to scraping the entirety of the internet to train such models that's never referred to as an attack.


Scraping the whole web isn't considered an attack because, well, that's just how search engines work. That being said, there are all sorts of norms (e.g. robots.txt) qualifying what kinds of scraping are accepted.

As far as I can tell, AI researchers assumed they could just piggyback on top of those norms to get access to large amounts of training data. The problem is that it's difficult to call copying an attack unless you go full MAFIAA[0]brain and argue that monopoly rents on creative works are the only functional backstop to the 1st Amendment. Hell, even if you do, the EU and Japan[1] both have a statutory copyright exception explicitly legalizing AI training on other people's text. It's not even accepted dogma among copyright holders that this is an attack.

[0] Music And Film Industry Association of America, a fictional industry association purported to be the merger of the MPAA and RIAA announced on April 1st, 2006: http://mafiaa.org/

[1] Yes, the same country whose copyright laws infamously have no Fair Use equivalent. In Japan, it is illegal to review or parody a copyrighted work without a license, but it is legal to train an AI on it.


Alternatively, you can just believe in standard, copyright law which says you need a license to distribute much content. Most file sharing cases ruled in favor of that.

The AI companies have been bundling and distributing copywritten works for pretraining. They do illegal activities just to make the AI’s. That’s before considering them generating the training data or derivative works. So, there’s lots of risk which they’re just ignoring for money.


I don't want to have copyright law as it currently exists. It is a badly-negotiated bargain. The public gets very little out of it, the artists get very little protection out of it, and the only people who win are intermediaries and fraudsters.

Keep in mind, this is the same copyright that gave us Prenda Law, an extortion scheme that bilked millions of dollars in bullshit settlements. Prenda Law would create shell companies that created porn, post it on BitTorrent, then have the shell companies sue anyone who downloaded it. Prenda Law would even post all their ongoing litigation on their website with the express purpose of making sure everyone Googling for your name saw the porn, just to embarrass you into settling faster.

This scheme was remarkably profitable, and only stopped being profitable because Prenda slipped up and revealed the fraud[0]. Still, the amount of fraud you have to commit is very minuscule compared to the settlements you can extract out of people for doing this, and there's been no legal reform to try and cut off these sorts of extortion suits. Prenda isn't even the only entity that tried this; Strike 3 Holdings did the same thing.

[0] If you upload your own content to BitTorrent, the defense could argue that this is implied license. Prenda's shell companies would lie about having uploaded the content themselves.


re: 2... if you copyright a work, then surely you also hold rights to a zip file of that work. So why not also the probability distribution of letters in that work?


To be precise, you don’t hold rights to a zip file, copyright doesn’t know anything about files. You hold rights to a work, an abstract legal concept. Your rights to the work allow you to control the reproduction of that work, and distributing a zip file is an instance of reproducing the work.

Probability distributions don’t contain enough information to reproduce a work (since they don’t preserve order). They are not copyrightable in and of themselves, and distributing a probability distribution of a work doesn’t amount to reproduction.


If the probability distribution is enough to reproduce a copyrighted work to the level of substantial similarity, then yes, a copy has legally been made.

However, that's not the only question involved in a copyright lawsuit[0].

So far most of the evidence of copying has been circumstantial: a regurgitated Quake lighting function here, a Getty Images watermark there, but everything else has looked like wholly original output. We know from how these models are trained that copyrighted work is involved somewhere, but a court could just say it's Fair Use to scrape data and train a model on it. However, that defense is way harder to make if we can actually open up a model and show "ok, this is where and how it's storing copied training set data". At a minimum, it takes the "how much was used" Fair Use factor from "a few watermarks" to "your honor, the entire fucking Internet".

[0] As usual we will assume jurisdiction in US court


Just to play devil's advocate, if I were on Microsoft's team, I'd say: You just proved our case. Having the "whole fucking internet" == having general knowledge. Synthesizing general knowledge is fair use. If it was trained only on the works of a certain artist, it would be plagiarism. But if it's trained on all artists, it's not really trained on anyone.


>I trained a small (~10 million parameter) transformer following Andrej Karpathy’s excellent tutorial, Let’s build GPT: from scratch, in code, spelled out

As soon as I learned about Andrej Karpathy's NanoGPT, I trained it on War and Peace (in Russian), and what I found interesting is that it almost grokked Russian grammar despite being just a 3 MB model. Russian language has a complex synthetic-inflectional structure. For example, preposition "na" ("upon") requires the following noun to be in accusative case, which is manifested as ending -a for animate masculine nouns, but as null ending for inanimate nouns, or as -ia for nouns which end in a "soft consonant", -u for feminine nouns, etc. etc. Or the verb "to use" requires the following noun to be in instrumental case if it's used as a tool.

Although it's not perfect and had mistakes, I found it interesting that NanoGPT was able to infer certain complex rules in just 3 minutes of training - and I searched in the texts for the exact examples it generated and found nothing verbatim.

However, despite understanding grammar more-less, semantically, it was complete nonsense.


Not too surprising, since the inflections would be among the most common tokens in the training text.


This was a good 3D visualization of the same systems and they probably should be read together for maximum effect ....

LLM Visualization (https://bbycroft.net/llm) https://news.ycombinator.com/item?id=38505211


I appreciate the effort that went into this visualization, however, as someone who has worked with neural networks for 9 years, I found it far more confusing than helpful. I believe it was due to trying to present all items at once instead of deferring to abstract concepts, however, I am not entirely sure of this fact. <3 :'))))


nice project, but the model that was being studied is really just a toy-model (both in size and training data). as such, this model can indeed be approximated by simpler models (I would suspect even n-gram LMs), but it might not be representative of how the larger LMs work.


This is probably true - i.e. you could make an even smaller model and then likely come up with an even-simpler explanation for how it worked.


Is the author claiming that LLMs are Markov Chain text generators? That is, the probability distribution of the next token generated is the same as the probability of those token sequences in the training data?

If so, does it suggest we could “just” build a Markov Chain using the original training data and get similar performance to the LLM?


LLMs are Marokov Chains in the following sense: States are vectors of context-length many tokens. Then the model describes a transitions matrix: For a given context-length sized vector of tokens it gives you probabilities for the next context-length sized vector of tokens.


Could you elaborate what context length means in this context? Maybe an example?


The length of the input in tokens. For the simple case of tokens just being characters, a LLM does nothing but take a string of length n, the context length, and calculate for each character in the alphabet the probability that this character is the next character following the input. Then it picks one character at random according to that distribution, outputs it as the first character of the response, appends it to the input, discards the first character of the input to get it back to length n and then repeats the entire process to produce the next character of the response.


No, because the LLM isn't just copying from the same text. Rather, it's "classifying" the text using the self-attention, and then applying a simple Markov Chain (supposedly). The classification is the hard part because how do you know what text from the training data is "similar" to the prompt text.

From the blog post for example:

Original string: 'And only l'

Similar strings: 'hat only l' 's sickly l' ' as\nthey l' 'r kingly l'


From the post:

> I implemented imperative code that does what I’m proposing the transformer is doing. It produces outputs very similar to the transformer.

This means there is probably a way to bypass transformers and get the same results. Would be interesting if it's more efficient. Like given foundation model train something else and run it on much smaller device.


I explained that it's not bypassing transformers and not more efficient in another comment: https://news.ycombinator.com/item?id=39254966


I'm having a very hard time understanding exactly what the author is claiming to show. I've read the "Interpretation: Why Does the Approximation Work?" section a few times, but it feels like it's just a mechanical description of the steps of a transformer. What's the core claim?


Is the behavior of that the attention + FF displacements tend point in the same direction known? I am kind of surprised they are even in the same latent space across layers. The FF network could be doing arbitrary rotations, right? I suspect I misunderstand what is going on.


It's a 2D representation of very high-dimensional vectors. Something has to be left out and accurately depicting arbitrary rotations in the high-dimensional space is one of those things.


Best to replace attention addition with scaling and see.


Is the behavior that the attention + FF displacements tend point in the same direction known? I am kind of surprised they are even in the same latent space across layers. The FF network could be doing arbitrary rotations, right? I suspect I misunderstand what is going on.


At first, thank you for the insights. Could you please provide a pdf version of your article with a proper formatting so it can be also read offline?


A thousand hands on a Ouija board.


Is that an analogy? If so, it is an extremely interesting one, and I would like to know where does it come from :D



I think OP was asking about the expression "a thousand hands on a ouija board", not what a ouija board is.


This is a weird post. "What the transformer is actually doing"? You can just follow the code and see what it's doing. It's not doing something more or less than that. It's not doing some other thing.


I'll go with an example to demonstrate why that's not always enough. Many people are quite keen to know what this (for example) is actually doing:

  float InvSqrt(float x){
      float xhalf = 0.5f * x;
      int i = *(int*)&x;
      i = 0x5f3759df - (i >> 1);
      x = *(float*)&i;
      x = x*(1.5f - xhalf*x*x);
      return x;
  }
From https://betterexplained.com/articles/understanding-quakes-fa...

In my case I don't have a huge amount of time to chase down every rabbit hole, but I'd love to accelerate intuition for LLM's. Multiple points of intuition or comparison really help. I'm also not a python expert - what you see and what I see from a line of code will be quite different.


The author is attempting to build an explicit mental model of what a bunch of weights are "doing". It's not really the same thing. They are minimizing the loss function.

People try to (and often do) generate intuition for architectures that will work given the lay out of data. But, the reason models are so big now is that trying to understand what the model is "doing" in a way humans understand didn't work out so well.


The post is long and complicated and I haven't read most of it, so whether it's actually any good I shan't try to decide. But the above seems like a very weird argument.

Sure, the code is doing what it's doing. But trying to understand it at that level of abstraction seems ... not at all promising.

Consider a question about psychology. Say: "What are people doing when they decide what to buy in a shop?".

If someone writes an article about this, drawing on some (necessarily simplified) model of human thinking and decision-making, and some experimental evidence about how people's purchasing decisions change in response to changes in price, different lighting conditions, mood, etc., ... would you say "You can just apply the laws of physics and see what the people are doing. They're not doing something more or less than that."?

I mean, it would be true. People, so far as we know, do in fact obey the laws of physics. You could, in principle, predict what someone will buy in a given situation by modelling their body and surroundings at the level of atoms or thereabouts (quantum physics is a thing, of course, but it seems likely that a basically-classical model could be good enough for this purpose). When we make decisions, we are obeying the laws of physics and not doing some other thing.

But this answer is completely useless for actually understanding what we do. If you're wondering "what would happen if the price were ten cents higher?" you've got no way to answer it other than running the whole simulation again. Maybe running thousands of versions of it since other factors could affect the results. If you're wondering "does the lighting make a difference, and what level of lighting in the shop will lead to people spending least or most?" then you've got no way to answer it other than running simulations with many different lighting conditions.

Whereas if you have a higher-level, less precise model that says things like "people mostly prefer to spend less" and "people try to predict quality on the basis of price, so sometimes they will spend more if it seems like they're getting something better that way" and "people like to feel that they're getting a bargain" and so on, you may be able to make predictions without running an impossibly detailed person-simulation zillions of times. You may be able to give general advice to someone with a spending problem who'd like to spend more wisely, or to a shopkeeper who wants to encourage their customers to spend more.

Similarly with language models and similar systems. Sure, you can find out what it does in some very specific situation by just running the code. But what if you have some broader question than that? Then simply knowing what the code does may not help you at all, because what the code does is gazillions of copies of "multiply these numbers together and add them".

Again, I make no claim about whether the particular thing linked here offers much real insight. But it makes zero sense, so far as I can see, to dismiss it on the grounds that all you need to do is read the code.


You’re spot on; it’s like saying you can understand the game of chess by simply reading the rules. In a certain very superficial sense, yes. But the universe isn’t so simple. The same reason even a perfect understanding of what goes on at the level of subatomic particles isn’t thought to be enough to say we ‘understand the universe’. A hell of a lot can happen in between the setting out of some basic rules and the end — much higher level — result.


And yet...alpha zero.


My entire point is that implementation isn’t sufficient for understanding. Alpha Zero is the perfect example of that; you can create an amazing chess playing machine and (potentially) learn nothing at all about how to play chess.

…so what’s your point? I’m not getting it from those two words.


Understanding how the machine plays or how you should play? They aren't the same thing. And that is the point - trying to analogize to some explicit, concrete function you can describe is backwards. These models are gigantic (even the 'small' ones), they are looking to minimize a loss function by looking in multi thousand dimensional space. It is the very opposite of something that fits in a human brain in any explicit fashion.


So is what happens in an actual literal human brain.

And yet, we spend quite a lot of our time thinking about what human brains do, and sometimes it's pretty useful.

For a lot of this, we treat the actual brain as a black box and don't particularly care about how it does what it does, but knowing something about the internal workings at various levels of abstraction is useful too.

Similarly, if for whatever reason you are interested in, or spend some of your time interacting with, transformer-based language models, then you might want some intuition for what they do and how.

You'll never fit the whole thing in your brain. That's why you want simplified abstracted versions of it. Which, AIUI, is one thing that the OP is trying to do. (As I said before, I don't know how well it does it; what I'm objecting to is the idea that trying to do this is a waste of time because the only thing there is to know is that the model does what the code says it does.)


Sure, good abstractions are good. But bad abstractions are worse than none. Think of all the nonsense abstractions about the weather before people understood and could simulate the underlying process. No one in modern weather forecasting suggests there is a way to understand that process at some high level of abstraction. Understand the low level, run the calcs.


> Understanding how the machine plays or how you should play? They aren't the same thing.

And yet, seeing Alpha Zero play has indeed led to new human chess strategies.


Alpha Zero didn't read the rules, it trained within the universe of the rules for 44 million games.


...in fact, one could argue that not only did it not read the rules — it has no conception of rules whatsoever.


It is very promising. In fact, in industry there are jokes about how getting rid of linguists has helped language modeling.

Trying to understand it at some level of abstraction that humans can fit in their head has been a dead end.


Trying to build systems top-down using principles humans can fit in their head has arguably been a dead end. But this doesn't mean that we cannot try to understand parts of current AI systems at a higher level of abstraction, right? They may not have been designed top-down with human-understandable principles, but that doesn't mean that trained, human-understandable principles couldn't have emerged organically from the training process.

Evolution optimized the human brain to do things over an unbelievably long period of time. Human brains were not designed top-down with human-understandable principles. But neuroscientists, cognitive scientists, and psychologists have arguably had success with understanding the brain partially at a higher level of abstraction than just neurons, or just saying "evolution optimized these clumps of matter for spreading genes; there's nothing more to say". What do you think is the relevant difference between the human brain and current machine learning models that makes the latter just utterly incomprehensible at any higher level of abstraction, but the former worth pursuing by means of different scientific fields?


I don't know neuroscience at all, so I don't know if that's a good analogy. I'll make a guess though - if you consider a standard RAG application. That's a system which uses at least a couple models. A person might reasonably say "the embeddings in the db are where the system stores memories. The LLM acts as the part of the brain that reasons over whatever is in working memory plus it's sort of implicit knowledge." I'd argue that's reasonable. But systems and models are different things.

People use many abstractions in AI/ML. Just look at all the functionality you get in PyTorch as an example. But they are abstractions of pieces of a model, or pieces of the training process etc. They aren't abstractions of the function the model is trying to learn.


Right, I've used pytorch before. I'm just trying to understand why the question of "how does a transformer work?" is only meaningfully answered by describing the mechanisms of self-attention layers at the highest level of abstraction, with any higher level of abstraction being nonsense. More specifically, why we should have a ban on any higher level of abstraction in this scenario when we can answer the question of "how does the human mind work?" at not just the atom level, but also the neuroscientific level or psychological level. Presumably you could say the same thing about this question: The human mind is a bunch of atoms obeying the laws of physics. That's what it's doing. It's not something else.

I understand you're emphasizing the point that the connectionist paradigm has had a lot more empirical success than the computationalist paradigm - letting AI systems learn organically, bottom-up is more effective than trying to impose human mind-like principles top-down when we design them. But I don't understand why this means understanding bottom-up systems at higher level of abstractions is necessarily impossible when we have a clear example of a bottom-up system that we've had some success in understanding at a high level of abstraction, viz. the human mind.


It would be great if they were good, but they seem to be bad, it seems that they must be bad given the dimensionality of the space, and humans latch onto simple explanations even when they are bad.

Think about MoE models. Each expert learns to be good at completing certain types of inputs. It sounds like a great explanation for how it works. Except, it doesn't seem to actually work that way. The mixtral paper showed that the activated routes seemed to follow basically no pattern. Maybe if they trained it differently it would? Who knows. It certainly isn't a good name regardless.

Many fields/things can be understood at higher and higher levels of abstraction. Computer science is full of good high level abstractions. Humans love it. It doesn't work everywhere.


Right, of course we should validate explanations based on empirical data. We rejected the idea that there was a particular neuron that activated only when you saw your grandmother (the "grandmother neuron") after experimentation. But just because explanations have been bad, doesn't mean that all future explanations must also be bad. Shouldn't we evaluate explanations on a case-by-case basis instead of dismissing them as impossible? Aren't we better off having evaluated the intuitive explanation for mixtures of experts instead of dismissing them a priori? There's a whole field - mechanistic interpretability - where researchers are working on this kind of thing. Do you think that they simply haven't realized that the models they're working on interpreting are operating in a high-dimensional space?


Mechanistic interpretability studies a bunch of things though. Like, the mixtral paper where they show the routing activations is mechanistic interpretability. That sort of feature visualization stuff is good. I don't know what % of the field is spending their time on trying to interpret the models in a way that involves higher level, human can explain, approximating the following code type work though? I'm certainly not the only one who thinks it's a waste of time, I don't believe anything I've said in this thread is original in any way.

I... don't know if the people involved in that specific stuff have really grokked they are working in high dimensional space? A lot of otherwise smart people work in macroeconomics, where for decades they haven't really made any progress because it's so complex. It seems stupid to suggest a whole field of smart people don't realize what they are up against, but sheesh it kinda seems that way doesn't it? Maybe I'll be eating my words in 10 years.


They certainly understand they're working in a high dimensional space. No question. What they deny is that this necessarily means the goal of interpretability is a futile one.

But the main thrust of what I'm saying is that we shouldn't be dismissing explanations a priori - answers to "how does a transformer work?" that go beyond descriptions of self-attention aren't necessarily nonsensical. You can think it's a waste of time (...frankly, I kind of think it's a waste of time too...), but just like any other field, it's not really fair to close our eyes and ears and dismiss proposals out of hand. I suppose > Maybe I'll be eating my words in 10 years. indicates you understand this though.


Understanding how a given CPU (+ the other computer hardware) works, does not suffice to understand what is going on when a particular program is running. For that, you need to either read the program, or an execution trace, or both, or something along these lines, which is specific to the program being run.


This is the wrong analogy. The transformer block is a bunch of code and weights. It's a set of instructions laying out which numbers to run which operations on. The optimizer changes weights to minimize a loss function during training and then the code implementing a forward pass just runs during inference. That's what it is doing. It's not doing something else.

If the argument is that a model is a function approximator, then it certainly isn't approximating some function that performs worse at the task at hand, and it certainly isn't approximating a function we can describe in a few hundred words.


We have no reason at all to be certain of the latter.


There is pretty good reason. If it could be described explicitly in a few hundred words, it would be extremely unlikely that we'd have seen a jump in capability with model size.


Last week I trained an LLM to auto-unrotate a Caeser cipher, without it knowing the rotation key. It turns out there are only a few possible ways this can be done, e.g. by analyzing letter frequencies. Of course, the LLM must use one of these algorithms! Determining this high-level algorithm would be "understanding" what it is "actually doing".

Through analysis, I proved it indeed counts each letter's frequency. It then combines the "overall" letter frequencies with another distribution of "first letter" frequencies. This increased its accuracy to nearly 100%.

To me, this description is much nicer than had I "followed the code" and described it as millions of multiplications and additions. (Much of this ends up being noise anyway.)


A walk through with what the data at each point looks like is actually pretty useful.


Sure, it is. But trying to explain it as though the weights have some goal is weird. They aren't trying to do anything. You have a loss function. The optimizer keeps moving weights around in an attempt to minimize the loss function. It's not more or less than that.


This is just wrong.

First of all, you're rejecting teleological anthropomorphizing (saying it's weird to act as if "the weights have some goal") but then in the very next line you talk about the optimizer making "an attempt" to accomplish a goal. All of which misses the point, since the question is about _explanations_ not goals and intentions.

Then you reject out of hand any other level of explanation than the one you favor, saying "it's not more or less than that" when in fact it is both more and less than that; you can climb the ladder of abstraction either way to build more or less abstract explanations. We can dig down and talk about how the optimizer adjusts weights, or how tensor math works and how it's used in this case, or about how GPUs work, or gates, or transistors, etc. Or we could climb up and talk about (as this article does) and talk about what attention heads do, and why they work, when they work, when they don't, etc.


The optimizer has a goal. The weights in the model do not. The optimizer isn't the model. There is no contradiction if you know how it works.

Climbing the layer of abstraction from model weights doesn't seem to work in this field. Just saying it's so doesn't make it so.


> Just saying it's so doesn't make it so.

Does that apply to you as well?


Of course. You shouldn't take my word for it. You can learn the basics of AI/ML from a number of good texts. Simon Prince just released a very approachable text, although it doesn't cover much in the way of history to see the move to "more data/more compute, less human lead abstraction". I think Norvig's book covers that but I haven't read the latest version.


You sure like to make assumptions, don't you? :)


> You can just follow the code and see what it's doing. It's not doing something more or less than that.

And that's why we'll never have fun things like an obfuscated code contest. Oh wait[1]...

[1]: https://www.ioccc.org/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: