By minimizing a loss functional with respect to a bunch of numbers that amount to entries in matrices (or tensors, whatever) using an approximate hill climbing approach. I’m not sure what insights there are to be gained here, it doesn’t seem more exotic or interesting to me then asking “how does the pseudo inverse of A ‘learn’ to approximate the formula Ax=b?”. Maybe this seems reductive, but once you nail down what the loss functional is (often MSE loss for regression or diffusion models, cross entropy for classification, and many others) and perhaps the particulars of the model architecture (feed-forward vs recurrent, fully connected bits vs convolutions, encoder/decoders) then it’s unclear to me what is left for us to discover about how “learning” works beyond understanding old fundamental algorithms like Newton-Krylov for minimizing nonlinear functions (which subsumes basically all deep learning and goes well beyond). My gut tells me that the curious among you should spend more time learning about fundamentals of optimization than puzzling over some special (and probably non-existent) alchemy inherent in deep networks.
> it doesn’t seem more exotic or interesting to me then asking “how does the pseudo inverse of A ‘learn’ to approximate the formula Ax=b?
Asking things like properties of the pseudoinverse against a dataset on some distribution (or even properties of simple regression) is interesting and useful. If we could understand neural networks as well as we understand linear regression, it would be a massive breakthrough, not a boring "it's just minimizing a loss function" statement.
Hell even if you just ask about minimizing things, you get a whole theory of M estimators [0]. This kind of dismissive comment doesn't add anything.
You raise a fair point, I do think that it’s important to understand how the properties of the data manifest in the least-squares solution to Ax=b. Without that, the only insights we have are from analysis, while we would be remiss to overlook the more fundamental theory, which is linear algebra. However, my suspicion is that the answer to these same questions but applied to nonlinear function approximators is probably not much different from the insights we have already gained in more basic systems. However, the overly broad title of the manuscript doesn’t seem to point toward those kinds of questions (specifically, things like “how do properties of the data manifold manifest in the weight tensors”) and I’m not sure that one should equate those things to “learning”.
This is overly reductive. Understanding what they're doing at a higher level is useful. If you knew enough about neuron activations and how they change with stimulus that wouldn't be enough for a human to develop a syllabus for teaching maths even if they "understand how people learn".
What you describe also doesn't answer the question of how to structure and train a model, which surely is quite important. How do the choices impact real world problems?
Sure, but their title seems poorly chosen and doesn't match what they are claiming in the article itself, which includes understanding how GPT-2 makes it's predictions.
How does GPT-2 learn, for example, that copying a word from way back in the context helps it to minimize the prediction error? How does it even manage to copy a word from the context to the output? We know that it is minimizing prediction errors, and learned to do so via gradient descent, but HOW is it doing it? (we've discovered a few answers, but it's still a research area)
I haven’t read the manuscript yet, and am not sure that I will. However I don’t agree with the question. Gradient descent, the properties of the loss function are the “how”. It seems like you want to know how some properties of the data are manifested in the network itself during/after training (what these properties are doesn’t seem to be something that people know they are looking for). Maybe that’s what the authors are interested in as well. If I could bet money in Vegas on the answer to that question, my bet would be in most cases that structures we may probe in the network and see in them correlations to aspects of the problem or task that we (as humans) can recognize, well very likely this will boil down to approximations of fundamental and eminently useful quantities like, say, approximate singular value decompositions of regions in the data manifold, or approximate eigenfunctions etc. I could see how these kind of empirical investigations are interesting, but what would their impact be? Another guess, that these investigations may lead to insights that help engineers design better architectures or incrementally improve training methods. But I think that’s about it - this type of research strikes me as engineering and application.
Outside of pure interest - how these LLMs are working, the utility/impact of understanding them would be to be able to control them - how to remove capabilities you don't want them to have (safety), or perhaps even add capabilities, or just steer their behavior in some desirable way.
Pretty much everything about NNs is engineering - it's basically an empirical technology, not one that we have much theoretical understanding of outside of the very basics.
> Pretty much everything about NNs is engineering - it's basically an empirical technology, not one that we have much theoretical understanding of outside of the very basics.
This pretty much answers the question some have asked: “why are the world’s preeminent mathematicians not working on AI if AGI will solve everything eventually anyway?”.
At least for now, the skills required to make progress in AI (machine learning as it largely is now) are those of an engineer rather than a mathematician.
> By minimizing a loss functional with respect to a bunch of numbers that amount to entries in matrices (or tensors, whatever) using an approximate hill climbing approach.
Are the rules of chess all there is to it? Is there really no more to be said?
Well, if neural nets are nothing more than their optimization problem then why isn't there a mathematical proof of this already?
And why isn't that reductionism? We don't say human learning is merely the product of millions of years of random evolution, and leave it at that. So if we take a position on reductionist account of learning, then how do we prove it or disprove it?
Are there arguments that don't rest on our gut feelings? Otherwise this is just different experts factions arguing that "neural nets are/aren't superautocomplete / stochastic parrots" but with more technobabble.
Im with you. My only understanding of ML is a class in 2016 where we implemented basic ML algos and not neutral nets, gpts or whatever but I always assumed its no radically different.
Take a bunch of features or make up a billion features, find a function to that best predicts the greatest number of outputs correctly. Any "emergent" behavior I imagine is just a result of finding new features or sets of features.
I agree with your interpretation. There is something there to be learned for sure, but I’m doubtful whatever that thing is will be a breakthrough in machine learning or optimization, nor that it will come by applying the tools of analysis. The idea of “emergence” is interesting although vague and bordering on unscientific. Maybe complexity theory, graph theory, and information theory might provide some insights. But in the end, I would guess those insights impact will be limited to tricks that can be used to engineer marginally better architectures or marginally faster training methods.
It's all really just basic calculus, with a couple nifty tricks layered on top:
1) Create a bunch of variables and initialize them to random values. We're going to add and multiply these variables. The specific way that they're added and multiplied doesn't matter so much, though it turns out in practice that certain "architectures" of addition and multiplication patterns are better than others. But the key point is that it's just addition and multiplication.
2) Take some input, or a bunch of numbers that convey properties of some object, say a house (think square feet, number of bedrooms, number of bathrooms, etc) and add/multiply them into the set of variables we created in step 1. Once we plug and chug through all the additions and multiplications, we get a number. This is the output. At first this number will be random, because we initialized all our variables to random numbers. Measure how far the output is from the expected value corresponding to the given inputs (say, purchase price of the house). This is the error or "loss". In the case of purchase price, we can just subtract the predicted price from the expected price (and then square it, to make the calculus easier).
3) Now, since all we're doing is adding and multiplying, it's very straight-forward to set up a calculus problem that minimizes the error of the output with respect to our variables. The number of multiplication/addition steps doesn't even matter, since we have the chain rule. It turns out this is very powerful: it gives us a procedure to minimize the error of our system of variables (i.e. model), by iteratively "nudging" the variables according to how they affect the "error" of the output. The iterative nudging is what we call "learning". At the end of the procedure, rather than producing random outputs, the model will produce predictions of house prices that correlate with the distribution input square footage, bedrooms, bathrooms, etc. we saw in the training set.
In a sense, ML and AI are really just the next logical step of calculus once we have big data and computational capacity.
Calculus is all you need! Neural nets are trained to minimize their errors (what they actually output vs what we want them to output). When we build a neural net we know the function corresponding to the output error, so training them (finding the minimum of the error function) is done just by following the gradient (derivative) of the error function.
I think there are still open questions about this that are worth asking.
It is clear enough that following gradients of a bounded differentiable function can bring you to a local minimum of the function (unless I guess if there’s a path that heads away from starting location, going off to infinity, along which the function is always decreasing, asymptotically approaching some value, but this sort of situation can be prevented by adding loss terms that penalize parameters being too big).
But, what determines whether it reaches a global minimum? Or, if it doesn’t reach a global minimum, what kinds of local minima are there, and what determines which kinds it is more likely to end up in? Does including momentum and stochastic stuff in the gradient descent influence the kinds of local minima that are likely to be approached? If so, in what way?
Local minima aren't normally a problem for neural nets since they usually have a very large number of parameter, meaning that the loss/error landscape has a correspondingly high number of dimensions. You might be in a local minima in one of those dimensions, but the probability of simultaneously being in a local minima of all of them is vanishingly small.
Different learning rate schedules, as well as momentum/etc, can also help getting stuck for too long in areas of the loss landscape that many not be local minima, but may still be slow to move out of. One more modern approach is to cycle between higher and lower learning rates rather than just use monotonically decreasing ones.
I'm not sure what latest research is, but things like batch size and learning rate can certainly effect the minimum found, with some resulting in better generalization than others.
Indeed, hopefully they can be diverted from interest in LLMs towards actual science, like the neuroscience which revealed the existence of said mirror neurons.
Your point is a salient one. It would be useful if we could provide guarantees/bounds on generalization, or representation power, or understand how brittle a model is to shifts in the data distributions. Are these questions of the kind that are answered in part by the authors? I haven’t read the manuscript, but the title doesn’t indicate this is the aim of the research, but it indicates an eye to something much broader and vague (“learning”).
What would be the requirements in order for most researchers to agree that a “conclusive” answer has been established to the question “How do neural networks work?”
I ask not because this paper isn’t insightful research, but rather because if you search Google Scholar or arXiv for papers purporting to describe how neural networks “actually work”, you get thousands upon thousands of results all claiming to answer the question, and yet you never really come away with the sense that the question has truly been resolved in a satisfactory way (see also: the measurement problem).
I’ve noticed that each paper uses a totally different approach to addressing the matter that just happens to correspond to the researchers’ existing line of work (It’s topology! No, group theory has the answer! Actually, it’s compressed sensing... computational complexity theory... rebranded old-school quantum chemistry techniques... and so on.)
I suppose my question is more about human psychology than neural networks, since neural networks seem to be working just fine regardless of how well we understand them. But I think it could be useful to organize a multi-disciplinary conference where some open questions regarding machine learning are drafted (similar to Hilbert’s problems), that upon their successful resolution would mean neural networks are no longer widely considered “black boxes”.
If a paper provided general tools that let me sit down with pen and paper and derive that SWIGLU is better than RELU, and batch normalization trains 6 pi /root 5 times faster than no normalization, etc, and I could get the derivation right before running the experiment, then I wpuld believe that I understood how neural networks train
Pretty much this, so long as you can't make any prediction ahead of time of the approximate performance of a architecture on a dataset one shouldn't claim they have a good understanding of how a network "learns".
This doesn't make sense as a requirement - you make no mention of the data. "How it works" will always depend, in part, on the data, you can't fully separate the algorithm from the data if you want to answer questions about why learning works the way it does. IMO.
A conclusive answer to how neural networks work should be both descriptive and prescriptive. It should not only tell you how they work but give you new insights into how to train them or how to fix their errors.
For example, does your theory tell you how to initialize weights? How the weights in the NN were derived from specific training samples? If you removed a certain subset of training samples, how would the weights change? If the model makes a mistake, which neurons/layers are responsible? Which weights would have to change, and what training data would need to be added/removed to have the model learn better weights?
If you can't answer these sorts of questions, you can't really say you know how they work. Kind of like steam engines before Carnot, or Koch's principles in microbiology, a theory is often only as good as it can be operationalized.
There are reasonable attempts at answering those questions. We're not lost in the woods, we have a lot of precedent in research that has brought us closer to understanding these things.
Would you say that there’s a significant enough effort currently to gain an understanding? I work in fintech, and based the kind and variety of security and compliance controls we have in place I can’t imagine a world where we are permitted to include generative ai technology anywhere near the transaction hot path without >=1 human in the loop. Unless the value is so enormous that it dwarfs the risk. I can see at least one use case here, especially at the integration layer which has a very large and boring spend as partners change/modernize and new partners enter the market.
We already understand enough, and have for many years, to know that the Achilles heel of any system we have today that we consider "AI" is that they are fundamentally statistical methods that cannot be formally verified to act correctly in all cases. Modern day chatbots are going to have the worst time at this since there is very little constraining their behavior and they are explicitly built to be general-purpose. You can make the case for special, constrained tools that have limited variability within defined and appropriate limits, but you can't make the case that the no free lunch theorem has been defeated just because a statistical learning system happens to write text kind of like an English-speaking human might.
It's my personal opinion that there should never be a decision system based on statistical approximations without a human in the loop, particularly if the consequences can affect lives and livelihoods.
Perhaps part of the description of the problem should also be that we understand descriptions of how neural networks work but there are a lot of things about training them that we don’t understand. And by “understanding” I mean we often don’t have a better way of training things than trial and error. Prescriptive definitions. Maybe it works, maybe it doesn’t, maybe it’s suboptimal (it almost certainly is). And a lot of these things are data-dependent.
I think there's also an element of moving goalposts. Maybe we understand certain structures of a fully connected nn, but then we need to understand what's happening within a transformer.
They're going to continue to get more complex, and so we will always have more to understand.
I think what you're hoping for is some theory that once discovered will apply to all future NN architectures (and very likely help us find the "best" ones). Do you think that exists?
The problem lies in our understanding of the concept "understanding". It is still pretty unclear at a fundamental level what it means to understand, learn, or conceptualize things.
This quickly leads to thought about consciousness and other metaphysical issues that have not been resolved, and probably never will be.
But we don't have a clue what "understanding" truly means when it comes to animals, including humans, which is the more relevant problem for this particular thread[1]. There is an intuitive sense of "understanding" but we don't have a good formalism around it, and this intuitive sense is easily tricked. But since we don't know how to formalize understanding, we are not yet capable of building AI which truly understands concepts (abstract and concrete) in the same way that birds and mammals do.
As a specific example, I suspect we are several decades away from AI being able to safely perform the duties of guide dogs for the blind, even assuming the robotics challenges are solved. The fundamental issue is that dogs seem to intuitively understand that blind people cannot see, and dogs can therefore react proactively to a wide variety of situations they would have never been trained on, and they can insist on disagreeing with blind people (intelligent disobedience) rather than being gaslit into thinking a dangerous crosswalk is actually safe. The approach of "it works well most of the time, but you gotta check its work" might fly for an LLM but humans need to trust seeing-eye dogs to make good decisions without human oversight.
In particular, being a seeing-eye AI seems much more difficult than fully autonomous driving, even considering that the time constraints are relaxed. Buildings are far more chaotic and unpredictable than streets.
[1] Note these concerns are not at all relevant for the research described in the article, where "learn" means "machine learning" and does not imply (or require) "understanding."
> we don't have a clue what "understanding" truly means when it comes to animals, including humans
Who is "we"? The philosophical literature has some very insightful things to say here. The narrow scientistic presumption that the answer must be written in the language of mechanism requires revisiting. Mechanism is intrinsically incapable of accounting for such things as intentionality.
Furthermore, I would not conflate human understanding with animal perception writ large. I claim that one feature that distinguishes human understanding vs. whatever you wish to call the other is the capacity for the abstract.
> we don't have a good formalism around it
This tacitly defines understanding as having a formalism for something. But why would it? What does that even mean here? Is "formalism" the correct term here? Formalism by definition ignores the content of what's formalized in order to render, say, the invariant structure conspicuous. And intentionality is, by definition, nothing but the meaning of the thing denoted.
> AI which truly understands concepts (abstract and concrete)
Concepts are by definition abstract. It isn't a concept if it is concrete. "Triangularity" is a concept, while triangles in the real are concrete objects (the mental picture of a triangle is concrete, but this is an image, not a concept). When I grasp the concept "Triangularity", I can say that I understand what it means to be a triangle. I have a possession, there is intentionality, that I can predicate of concrete instances. I can analyze the concept to determine things like the 180 degree property. Animals, I claim, perceive concrete instances only, as they have no language in the full human sense of the word.
AI has nothing to do with understanding, but simulation. Even addition does not, strictly speaking, objectively occur within computers (see Kripke's "quaddition"/"quus" example). Computers themselves are not objectively speaking computers (see Searle's observer-relativity). So the whole question of whether computers "understand" is simply nonsensical, not intractable or difficult or vague or whatever. Computers do not "host" concepts. They can only manipulate what could be said, by analogy, to be like images, but even then, objectively speaking, there is not fact of the matter that these things are images, or images of what is said to be represented. There is nothing about the representation of the number 2 that makes it about the number 2 apart from the conventions human observers hold in their own heads.
You seem to attack scientism for being narrow, which I find valid. However, if I understand it correctly, you then proceed to offer solutions by referring to other philosophical interpretations. I would say that those are also limited in a way.
My original intention was to suggest that, as there are multiple possible interpretations, and no good way to decide on which is best, that we simply do not get to fully understand how thinking works.
Science typically would shy away from the issue, by stating that it is an ill-defined problem. The Wittgenstein reference seems to do something similar.
Recent advancements in LLMs might give science a new opportunity to make sense of it all. Time will tell.
It was resolved just over 100 years ago by Wittgenstein. Either you fully define "understanding", in which case you've answered your question, or you don't clearly define it, it in which case you can't have a meaningful discussion about it because you don't even have an agreement on exactly what the word means.
That sounds clever, but it is actually devoid of both insight and information. There is nothing to be learned from that wit, witty as it is. Contrast with the question it is cleverly dismissing, which could potentially help us move technology forward.
Betrand paradox (in probability) (https://en.wikipedia.org/wiki/Bertrand_paradox_(probability)) is a great counter example of how you can have a meaningful discussion with a poorly defined question. A bit of a meta example, as the meaningful discussion of how what appears to be a well defined question isn't, and how specifically defining a question can give different answer. It also shows that there are multiple correct answers for different definitions of the question. While they disagree, they are all correct within their own definition.
Back to the topic of neural networks, just talking about why the question is hard to clearly define can be a meaningful discussion.
Didn't Wittgenstein shift a bit later in career. All 'word' play is built on other words, and language turns into a self-referential house of cards? Didn't his grand plan to define everything fall apart and he gave up on it?
I thought problem was you couldn't get anywhere because of the question, but you're saying the problem is the question didn't get an answer that satisfies your taste?
The problem, as stated by the parent, is that it is "still pretty unclear at a fundamental level what it means to understand, learn, or conceptualize things."
Which Wittgenstein didn't resolve, he describes how to kick the can down the road. Which is fine, every science needs to make assumptions to move on, but in no way is that a "resolution" to the problem of "what it means to understand, learn, or conceptualize things."
A strict definition is almost never required outside maths. We got so far being unable to define "woman" it turns out.
The most naive meaning of understanding, such as "demonstrating ability to apply a concept to a wide range of situations" is good enough for many cases, Goedel be damned.
That's a pretty useless answer. Just because you cannot fully define something doesn't mean you cannot define parts of it or have different useful definitions of it.
A technique that is applied to widely different fields obviously yield a large set of interpretations each with the lense of their own field.
But that doesn't invalidate any of those interpretations no?
It seems they are trying to answer "WHAT do NN's learn?", and "How do NN's WORK?", as much as their title question of "How do NN's learn?".
Here's an excerpt from the article:
"The researchers found that a formula used in statistical analysis provides a streamlined mathematical description of how neural networks, such as GPT-2, a precursor to ChatGPT, learn relevant patterns in data, known as features. This formula also explains how neural networks use these relevant patterns to make predictions."
The trite answer to "HOW do NN's learn?" is obviously gradient descent - error minimization, with the features being learnt being those that best support error minimization by the higher layers, effectively learning some basis set of features that can be composed into more complex higher level patterns.
The more interesting question perhaps is WHAT (not HOW) do NN's learn, and there doesn't seem to be any single answer to that - it depends on the network architecture. What a CNN learns is not the same as what an LLM such GPT-2 (which they claim to address) learns.
What an LLM learns is tied to the question of how does a trained LLM actually work, and this is very much a research question - the field of mechanistic interpretability (induction head circuits, and so forth). I guess you could combine this with the question of HOW does an LLM learn if you are looking for a higher level transformer-specific answer, and not just the generic error minimization answer: how does a transformer learn those circuits?
Other types of NN may be better understood, but anyone claiming to fully know how an LLM works is deluding themselves. Companies like Anthropic don't themselves fully know, and in fact have mechanistic interpretability as a potential roadblock to further scaling since they have committed to scaling safely, and want to understand the inner workings of the model in order both to control it and provide guarantees that a larger model has not learnt to do anything dangerous.
In short we do know how NNs learn and work, but not what NNs learn. The corollary being we don't understand where do the emergent properties come from?
It depends on the type of NN, and also what level of explanation you are looking for. At the basic level we do of course know how NN's learn, and what any architecture is doing (what each piece is doing), since we designed them!
In the case of LLMs like ChatGPT, while we understand the architecture, and how it works at that level (attention via key matching, etc), what is missing is how the architecture is actually being utilized by the trained model. For example, it turns out that consecutive pairs of attention heads sometimes learn to coordinate and can look words (tokens) up in the context and copy them to the output - this isn't something you could really have predicted just by looking at the architecture. The companies like Anthropic developing these have discovered a few such insights into how they are actually working, but not too many!
Yes, we don't really understand where emergent capabilities are coming from, at least not to extent of being able to predict them ahead of time ("if we feed it this amount of data, of this type, it'll learn to do X"). New emergent capabilities arise, from time to time, as models are scaled up, but no one can predict exactly what their next-gen model is going to be capable of.
>Yes, we don't really understand where emergent capabilities are coming from, at least not to extent of being able to predict them ahead of time ("if we feed it this amount of data, of this type, it'll learn to do X"). New emergent capabilities arise, from time to time, as models are scaled up, but no one can predict exactly what their next-gen model is going to be capable of.
While finite precision, finite width transformers aren't TC, I don't see why the same property of the game of life, where one cannot predict the end state from the starting state wouldn't hold.
As we know transformers are at least as powerful as TC^0 which contains AC^0, which is as powerful as first order logic, it is undecidable and thus may be similar to HALT, were we will never be able to accurately predict when emergence happens so approximation may be the best we do unless there are constraints through something like the parallelism tradeoff that allows for it.
If you consider PCP[O(log n),O(1)] = NP, or that only O(log n) bits are required for NP, the results of this paper seems more plausible.
I don't see that the difficulty of predicting/anticipating emergent capabilities is really related to undecidability, although there is perhaps a useful computer analogy... We could think of the trained LLM as a computer, and the prompt as the program, and certainly it would be difficult/impossible to predict the output without just running the program.
The problem with trying to anticipate the capabilities of a new model/training-set is that we don't even know what the new computer itself will be capable of, or how it will now interpret the program.
The way I'd tend to view it is that an existing trained model has some set of capabilities which reflect what can be done by combining the set of data-patterns/data-manipulations ("thought patterns" ?) that it has learnt. If we scale up the model and add more training data (perhaps some of a different type than has been used before), then there are two unknowns:
1) What new data-patterns/data-manipulations will it be able to learn ?
2) What new capabilities will become possible by using these new patterns/manipulations in combination with what it had before ?
Maybe it's a bit like having a construction set of various parts, and considering what new types of things could be built with if it if we added some new parts (e.g. a beam, or gear, or wheel), except we are trying to predict this without even knowing what those new parts will be.
No - emergent properties are primarily a function of scaling up NN size and training data. I don't think they are much dependent on the training process.
Of course they are? If you train in a different order, start with different weights, or change the gradient delta amount, different things will emerge out of an otherwise exactly the same NN.
You can see this out of videos where people train a NN to do something multiple times and each time, the NN picks up on something slightly different. Slight variances in what is fed as inputs during training can cause actually high variation in what is picked up on.
I’m getting decently annoyed with HNs constant pretending that this is all just “magic”.
You're talking about something a bit different - dependence on how the NN is initialized, etc. When people talk about "emergent properties" of LLMs, this is not what they are talking about - they are talking about specific capabilities that the net has that were not anticipated. For example, LLMs can translate between different languages, but were not trained to do this - this would be considered as an emergent property.
Nobody is saying this is magic - it's just something that is (with our current level of knowledge) impossible to predict will happen. If you scale a model up, and/or give it more training data, then it'll usually get better at what it could already do, but it may also develop some new (emergent) capabilities that no-one had anticipated.
Finding unexpected connections is something we’ve known LLMs are good at for ages. Connecting things you didn’t even know are connected is like “selling LLM business to business 101”. It’s the first line of a sales pitch dude.
And that’s still beside the point that the properties that emerge can greatly differ just by changing the ordering of your training.
Again, we see this on NNs training to play games. The strategies that emerge are completely unexpected, and when you train a NN multiple times, often differ greatly, or slightly.
It's curve fitting at its core. It just so happens that when you fit a function with enough parameters, you can start using much more abstract and higher-level methods to more quickly describe other outcomes and applications of this curve fitting. Really simple algebra just with a lot of variables. It's not a black box at all and it's disingenuous when people call it that. It outputs what it does because it's multiplying A and B and adding C to the input you gave it.
This is a misunderstanding. You're right that you can always find some weights that approximate any reasonable (that is, continuous) mathematical function. The problem is, you have to find those weights in a small amount of computation time, and you need to estimate them from only finitely many samples. In general, ***no neural network architecture can solve this problem efficiently in every single case***. See "No Free Lunch theorem". The fact that current architectures work well in practice, despite the NFL theorem - for seemingly difficult problems - is surprising, and demands an explanation.
Appendix: By "current architecture", I mean Transformers plus Stochastic Gradient Descent.
We don't understand how GenAI systems work, so we can't analytically predict their behavior; I think they're artifacts of synthetic biology, or chaos theory, not engineering or computer science. Giving GenAI goals and the ability to act outside of a sandbox is roughly parallel to releasing a colony of foreign predators in Australia, or creating an artificial pathogen that is so foreign the immune system doesn't have the tools to fight it.
That's why I consider understanding the internals of GenAI systems to be very important, independent of human psychology.
AI is not and really never has been computer science -- if you trust the pedigree of statistical learning as the appropriate one for modern AI, computers don't enter the picture til you want to approximate a solution numerically to an otherwise well-posed analytical question.
Interesting: Given an input x to a layer f(x)=Wσ(x), where σ is an activation function and W is a weight matrix, the authors define the layer's "neural feature matrix" (NFM) as WᵀW, and show that throughout training, it remains proportional to the average outer product of the layer's gradients, i.e., WᵀW ∝ mean(∇f(x)∇f(x)ᵀ), with the mean computed over all samples in the training data. The authors posit that layers learn to up-weight features strongly related to model output via the NFM.
The authors do interesting things with the NFM, including explaining why pruning should even be possible and why we see grokking during learning. They also train a kernel machine iteratively, at each step alternating between (1) fitting the model's kernel matrix to the data and (2) computing the average gradient outer product of the model and replacing the kernel matrix with it. The motivation is to induce the kernel machine to "learn to identify features." The approach seems to work well. The authors' kernel machine outperforms all previous approaches on a common tabular data benchmark.
This NFM is a curious quantity . It has the flavor of a metric on the space of inputs to that layer. However, the fact that W’W remains proportional to DfDf’ seems to be an obvious consequence of the very form of f… since Df is itself Ds’WW’Ds, then this should be expected under some assumptions (perhaps mild) on the statistics of Ds, no?
It has more than just the flavor of a metric, it’s exactly a metric because any “square” matrix (M’M) is positive definite (since x’M’Mx=y’y=<y,y>, where y=Mx). It could be interpreted as a metric which only cares about distances along dimensions that the NN cares about.
I agree that the proportionality seems a little obvious, I think what’s most interesting is what they do with the quantity, but I’m surprised no one else has tried this.
Yes of course any positive definite matrix can be used as a metric on the corresponding Euclidean space - but that doesn’t mean it’s necessarily useful as a metric. Hence I think it’s useful to distinguish things which could be a metric (in that a metric can be constructed from them), versus things which when applied as a metric actually provide some benefit.
In particular, if we believe the manifold hypothesis, then one should expect a useful metric on features to be local and not static - the quantity W’W clearly does not depend on the inputs to the layer at inference time, and so is static.
How are you defining the utility of a metric here? It’s not clear to me why a locally-varying metric would be necessarily more 'useful' than a global one in the context of the manifold hypothesis.
Moreover, if I’m understanding their argument right then W’W is proportional to an average of the exterior derivative of the manifold representing prediction surface of any given NN layer (averaging with respect to the measure defined by the data generating process). While this averaging by definition leaves some of the local information on the cutting room floor, the result is going to be far more interpretable (because we've discarded all that distracting local data) and I would assume will still retain the large-scale structure of the underlying manifold (outside of some gross edge-cases).
If one thinks of a metric as a “distance measure”, which is to say, how similar is some input x to the “feature” encoded by a layer f(x), and if this feature corresponds to some submanifold of the data, then naturally this manifold will have curvature and the distance measure will do better to account for this curvature. Then generally the metric (in this case, defining a connection on the data manifold) should encode this curvature and therefore is a local quantity. If one chooses a fixed metric, then implicitly the data manifold is being treated as a flat space - like a vector space - which generally it is not. My favorite example for this is the earth, a 2-sphere that is embedded in a higher dimensional space. The correct similarity measure between points is the length of a geodesic connecting those points. If instead one were to just take a flat map (choice of coordinates) and compare their Euclidean distance, it would only be a decent approximation of similarity if the points are already very close. This is like the flat earth fallacy.
But this argument seems analogous to someone saying that average height is less “correct” than the full original data set because every individual’s height is different. In one sense it’s not wrong, but it kind of misses the point of averaging. The full local metric tensor defined on the manifold is going to have the same “complexity” as the manifold itself; it’s a bad way of summarizing a model because it’s not any simpler than the model. Their approach is to average that metric tensor over the region of the manifold swept out by the training data, and they show that this average empirically reflects something meaningful about the underlying response manifold in problems that we’re interested in. Whether or not this average quantity can entirely reproduce that original manifold is kind of irrelevant (and indeed undesirable), the point is that it (a) represents something meaningful about the model and (b) it’s low dimensional enough for a human to reason about. Although globally it will not be accurate to distances along the surface, presumably it is “good enough” to at least first order for much of the support of the training data.
Yes, that sounds right, but it doesn't make the work less worthwhile or less interesting.
The authors do interesting things with the NFM, including explaining why pruning should even be possible and why we see grokking during learning. They also train a kernel machine iteratively, at each step alternating between (1) fitting the model's kernel matrix to the data and (2) computing the average gradient outer product of the model and replacing the kernel matrix with it. The motivation is to induce the kernel machine to "learn to identify features." The approach seems to work well, outperforming all previous approaches on tabular data.
PS. I've updated my comment to add these additional points.
"But these networks remain a black box whose inner workings engineers and scientists struggle to understand."
"We are trying to understand neural networks from first principles,""
There is a large contingent of CS people in HN that think that since we built AI, and can examine the code, the models, the weights, that this means we understand it.
Here’s a question I’ve been asking myself with the latest ML advancements: what is the difference between understanding and pretending to understand really, really well?
>what is the difference between understanding and pretending to understand really, really well?
Predicting the outcome of an experiment. A person/thing who understands something, can predict outcomes to a degree. A person pretending to understand cannot predict with any degree of reliability.
LLMs are literally predicting the outcome of an experiment really well constantly, yet they are best described as pretending to understand really, really well ...
People who fake understanding by creating an ad hoc internal model to sound like an expert often get a lot of things right. You catch such people by asking more complex questions, and then you often get extremely alien responses that no sane person would say if they understood. And yes, LLMs give such responses all the time, every single one of them, none of them really understands much, they have memorized a ton of things and have some ad hoc flows to get through simple problems, but they don't seem to really understand much at all.
Humans who just try to sound like an expert also make similar alien mistakes as LLMs do, so I think since we say such humans don't learn to understand we can also say that such models don't learn to understand. You don't become an expert by trying to sound like an expert. These models are trained to sound like experts, so we should expect them to be more like such humans rather than the humans who become experts.
Hmmm, I don't agree. They do seem to understand some things.
I asked ChatGPT the other day how fast an object would be traveling if "dropped" from the Earth with no orbital velocity, by the time it reached the sun. It brought out the appropriate equations and discussed how to apply them.
(I didn't actually double-check the answer, but the math looked right to me.)
It also seems to have a calculation or "analysis" function now, which gets activated when asking it specific mathematical questions like this. I've imagined it works by using the LLM to set up a formula, which is then evaluated in a classical way.
There are limits on what it can do, just like any human has similar limits. ChatGPT can answer more questions like this correctly than the average person could from off the street. That seems like understanding to me.
LLMs are the reason everyone is suddenly taking seriously the ideas that machines can be intelligent. Government officials, non-tech pundits, C-suite inhabitants, average people, everyone is now concerned about AI and its implications.
And I would say that's because of LLMs ability to predict the outcome of an experiment really well.
As for the "pretending", I think that comes from the fact LLMs are doing something quite different than humans to produce language. But that doesn't make them unintelligent. Just makes them not human.
> Government officials, non-tech pundits, C-suite inhabitants, average people, everyone is now concerned about AI and its implications.
All you need for that is for the AI to talk like experts, not for the AI to be experts. AI talking like experts without understanding much maps very well to what we see today.
Does Newton's Theory of Gravity actually understand gravity or is it just a really good tool for 'predicting outcomes'. The understanding is baked in to the formula and is a reflection of what Newton experienced in reality.
Newton's theory is ultimately and slightly wrong but still super useful and many LLMs are basically like this. I can see why all this becomes confusing but I think part of that is when we anthropomorphic words to describe these things that are just math models.
Most of the "predictions" you'd make using it, when discovered, were wrong. it is a deeply unpredictive formula, being useless for predicting vast classes of problems (since we are largely ignorant about the masses involved, and cannot compute the dynamics beyond a few anyway).
Science is explanatory, not "predictive" -- this is an antique mistake.
As for 'math models' insofar as these are science, they arent math. They use mathematical notation as a paraphrase for english, and the english words refer to the world.
F=GMM/r^2 is just a summary of "a force occurs in proportion to the product to two masses and inversely in proportion to their square distance"
note: force, mass, distance, etc. <- terms which describe reality and its properties; not mathematics.
Regarding your last statement, what do you see as the distinction?
Take for example eg continuity. Students are first taught it means you can graph a function in a single stroke of the pen. Later comes epsilon and delta, this is more abstract but at that stage the understanding is that "nearby values map to nearby values" (or some equivalent).
If the student dives in from there, they're taught the "existence" of real numbers (or any mathematical term) is rather a consequence of a system of symbols and relations that increasingly look nothing like "numbers", instead describing more of a process.
Later that "consequence" and "relation" themselves are formalities. "Pure" math occasionally delivers strange consequences in this sense. But it always boils down to a process that something or another must interpret and carry out.
So I wonder whether the edifice is meaningfully a thing in and of itself. Methods developed in ancient China and India etc would have been useful to the Greeks and vice versa, however all of them though worked by means of the human brain. "Line" has a distinct meaning to us, the axioms of geometry don't create the line, they allow us to calculate some properties more efficiently. We always need to interpret the result in terms we understand, don't we?
Restating that: How can we train a Mapper AI instead of a Packer AI?
Once you've trained the model, it only has it's context window to work with for long term memory. Adding a memory prosthesis in the manner of MemGPT is likely to result in a superhuman level Packer.
It's the deep drive for consistency in a knowledge base that mappers possess that would result in the most powerful AGI.
Not strictly speaking of ML, but you can pretend to understand by having a giant map of questions to answers, but unable to infer an answer to a new question.
understanding really well is getting know what it isn't.
A "stochastic parrot" can emulate a topic really well, until you push the limits and it fails at corner cases. If a model "understand really well" then it knows exactly where the boundaries are.
Learning how deep networks learn features isn't obvious, and teasing out the details is valuable research.
Superhuman levels of feature recognition are at play during the training of LLMs, and those insights are compiled into the resulting weights in ways we have little visibility into.
Before I read this, is this yet another paper where physicists believe that modern NN are trained by gradient descent and consist of only linear FNN (or another simplification) and can therefore be explained by <insert random grad level physics method here>? Because we already had hundreds of those.
Fnn are easy for sequential data because there is no relationship between the elements. Hence, these NN are a frequent target of simplified analyses. However, they also never led to exciting models which we now call AI.
Instead, real models are an eclectic mix of attention or other sequential mixers, gates, ffn, norms and positional tomfoolery.
In other words, everything that makes AI models great is what these analyses usually skip.
Of course, while wildly claiming generalized insights about how AI really works.
There’s a dozen papers like that every few months.
>> > believe that modern NN are trained by gradient descent
>> Are they not?
While technically true, that answer offers almost zero insight into how they work. Maybe another way to say it is that during inference there is no gradient descent happening - the network is already trained. Ignoring that gradient descent might be an overgeneralization of the training process, it tells you nothing about how ChatGPT plays chess or carries a conversation.
Telling someone what methodology was used to create a thing says nothing about how it works. Just like saying our own brain is "a product of evolution" doesn't tell how it works. Nor does "you are a product of your own life experience" put psychologists out of business. "It's just gradient descent" is a great way to trivialize something that nobody really seems to understand yet.
There proposal is actually that gradient descent is a bad representation of the NN learning behavior, and that instead this NFM tracks the “learning” behavior of the model better than simply its loss function over training.
It isn't, but the kinds of paper referred to make so many assumptions that the conclusions lack external validity for actual deep learning. In other words, they typically don't actually say anything about state-of-the-art deep networks.
Does anyone else find reading sites like this on mobile unbearable? The number and size of ads is crazy, in terms of pixel space I think there’s more ads than content
Firefox Focus seems to block them all, I just use that as a content blocker in Safari and am rarely bothered by ads.
I really am not opposed to sites using ads to monetise and resisted ad blocking for many years, but advertisers took it way too far with both the number and intrusiveness of ads so I ended up relenting and installing an ad blocker.
If you're fast, you can sometimes even bypass some paywalled articles in reader mode, as some website pre-load their articles before they load a login/paywall screen on top of it.
The article is not so specific on how the research actually works. Then, "We hope this will help democratize AI," but the linked paper is behind a paywall.
Curious to see how they would go about explaining how a network selects its important features
It's the same way we do it, form a number of possible variants and use the ones that work best.
They have the advantage of millions or even billions of times more compute to throw at the learning process. Something that might be a one in a million insight happens consistently at that scale.
My 2c:
The phys.org summary isn't great. The authors are focused on a much narrower topic than simply "how NNs learn", they're trying to characterize the mechanism by which deep NNs develop 'features'. They identify a quantity which is definable for each layer of the NN (outer product of the input weight matrix), and posit that this quantity is proportional to the average derivative of the layer with respect to its inputs, where the average is taken over all training data. They (claim to, I haven't evaluated) prove this formally for the case of deep FNNs trained on gradient descent. They argue that, by treating this quantity as a measure of 'feature importance', it can be used to explain certain behaviors of NNs that are otherwise difficult to understand. Specifically they address:
* Simple and spurious features (they argue that their proposal can identify when this has occurred)
* "Lottery ticket" NNs (they argue that their proposal can help explain why pruning the connections of a fully connected NN improves its performance)
* Grokking (they argue that their proposal can help explain why NNs can exhibit sudden performance improvements, even after training performance is 100%)
Finally they propose a heuristic ML algorithm which updates their proposal directly during training, rather than the underlying weights, and shows that this achieves superior performance to some existing alternatives.
Overall I would say that they have defined a nice tool for measuring NN feature importance, in terms of features that the NN itself is defining (not in terms of the original space). I can definitely see why this has a lot of value, and I'm especially intrigued by their comparisons of the NFM to testing performance in their 'grokking' case study.
With that said, I'm not really active in the NN space, so it seems a little surprising that their result is really that novel. The quantity they define (outer product of the weight matrix) seems fairly intuitive as a way to rank the importance of inputs at any given layer of the NN, so I'm wondering if truly nobody else has ever done this before? Possibly the novelty is in their derivation of the proportionality, or in the analysis of this quantity over training iterations. I'd guess that their model proposal is totally new, and I'm curious to try it out on some test cases, it seems promising for cases where light-weight models are required. It also seems interesting to point out how both training performance AND the development of feature importance both jointly influence testing accuracy, but again I'm surprised that this is really novel. I also have to wonder how this extends to more complicated architectures with, eg. recursive elements; it's not discussed anywhere, but seems like it would be an important extension of this framework given where genAI is currently at (although first draft was pub'd in '22 so it's possible that this just wasn't as pressing when it was being written).