This only applies to intra-distribution "generalisation", which is not the meani...

jebarker · 2024-11-22T17:20:18 1732296018

I don't really follow what you're saying here. I understand that the use of language in the real-world world is not sampled from a stationary distribution, but it also seems plausible that you could relax that assumption in an LLM, e.g. conditioning the distribution on time, and then intra-distribution generalization would still make sense to study how well the LLM works for held-out test samples.

Intra-distribution generalization seems like the only rigorously defined kind of generalization we have. Can you provide any references that describe this other kind of generalization? I'd love to learn more.

ericjang · 2024-11-22T17:33:29 1732296809

intra-distribution generalization is also not well posed in practical real world settings. suppose you learn a mapping f : x -> y. casually, intra-distribution generalization implies that f generalizes for "points from the same data distribution p(x)". Two issues here:

1. In practical scenarios, how do you know if x' is really drawn from p(x)? Even if you could compute log p(x') under the true data distribution, you can only verify that the support for x' is non-zero. one sample is not enough to tell you if x' drawn from p(x).

2. In high dimensional settings, x' that is not exactly equal to an example within the training set can have arbitrarily high generalization error. here's a criminally under-cited paper discussing this: https://arxiv.org/abs/1801.02774

mjburgess · 2024-11-22T23:20:51 1732317651

Worse even than this: there are no distributions.

What we mean by x ~ p(x), y ~ p(y|x) is not x -> y st. x = f(y)

Reality itself has no probability distributions. Reality follows a causal model, where a causal relation is given in terms of necessity and possibility.

Eg., there is no such thing as Photo ~ P(Photo|PhotoOfCat) to be learned, only (All Causes) -> PhotoOfCat. Thus the setup of ML as y = f(x) is incorrect, there is no `f` which satisfies this formula (in almost all cases).

Consider the LLM case: reality has no P("The War in Ukraine"| TheWarIn2022) -- either the speaker meant TheWarIn2022, or they didnt. There's no sense in which reality has it that the utterance is intrinsically ambiguous (necessarily, for communication to be possible, pragmatics+semantics has to be able to fully resolve meaning).

So what are LLMs learning? Just an implied empirical distribution which is "smoothed over" the data just enough that it "hangs on to it, without repeating it" -- and this is vital, since if it were to try to generalise in the scientific sense, it would cease to be meaningful, since no algorithm which computes P(y|x) in this manner could capture the necessary relata that fully resolves meaning. Any system capable of modelling meaning would be probabilistic only in the sense of having a prior over such causal models: P("TheWarInUkraine"|TheWarIn2022, CausalModel) = 1, but P(CausalModel) < 1

So it's always undefined what it means to "generalise" wrt to an empirical distribution -- there aren't any.

When we say scientific theories generalise, we mean their posited necessary causal relations are maintained across irrelevant interventions. Eg., newton's theory of gravity generalises in that each term (F, M, m, r) is a valid measure of some property, and it remains a valid measure across a very large number of environments.

It fails to generalise for extreme values of M, m, etc.

In the ML sense, all intra-distributional generalisation fails for trivial permutations of any causal property, eg., m+dm -- because this induces an entirely new distribution. The "generalisation error" depends on what m+dm does within our model, but regardless, generalisation fails.

Scientific theories do not fail to generalise in this way, irrelevant causal interventions make no difference to the explanatory adequacy (or predictive power) of the theory.

jebarker · 2024-11-23T00:47:37 1732322857

Thanks for the clarification. I understand much better what you mean by "scientific generalization". I can't tell whether you're suggesting that LLMs are a dead end for modeling meaning or just that LLMs as estimating probability distributions is the wrong way to think about them?

mjburgess · 2024-11-23T10:13:54 1732356834

LLMs fail to model meaning, but in doing so, model empirical distributions of meaningful tokens which is more useful, given the method being used.

If you were only modelling conditional probability, trying to model meaning this way, would make your solution worse.

ie., if LLMs really generalised in the ML sense, i.e., unbiasedly randomly sampled from some hypothetical "Meaning Distribution", they'd perform terribly -- since there is no such distribution to choose from.

By hijacking an empirical distribution, and "replaying it back", its actually possible to generate useful output.

Think about it this way, probability distributions are just measures of subjective confidence: each person has their own subjective confidence distribution P("some written words"|WhatTheyMean). If you could actually model this -- which one would you model? If you modelled any of them, you'd not be able to understand a great deal, since each person's confidence is poorly calibrated and missing meanings (eg., "acetylcholine").

So the LLM models some half-baked average of the subjective distributions of all speakers on the internet (/ in the training data) with respect to next word expectations.

This is not what we're modelling when we mean things (eg., when I say, "pass the pen", the cause of my saying it is: 1) need for a pen; 2) you having a pen; etc. -- these reasons are unavailable to the LLM, so it cannot model meaning). But as stated, it would be useless if it actually tried to -- because these methods are incapable of saying, "pass me a pen" and meaning it.