Thanks for the clarification. I understand much better what you mean by "scientific generalization". I can't tell whether you're suggesting that LLMs are a dead end for modeling meaning or just that LLMs as estimating probability distributions is the wrong way to think about them?
LLMs fail to model meaning, but in doing so, model empirical distributions of meaningful tokens which is more useful, given the method being used.
If you were only modelling conditional probability, trying to model meaning this way, would make your solution worse.
ie., if LLMs really generalised in the ML sense, i.e., unbiasedly randomly sampled from some hypothetical "Meaning Distribution", they'd perform terribly -- since there is no such distribution to choose from.
By hijacking an empirical distribution, and "replaying it back", its actually possible to generate useful output.
Think about it this way, probability distributions are just measures of subjective confidence: each person has their own subjective confidence distribution P("some written words"|WhatTheyMean). If you could actually model this -- which one would you model? If you modelled any of them, you'd not be able to understand a great deal, since each person's confidence is poorly calibrated and missing meanings (eg., "acetylcholine").
So the LLM models some half-baked average of the subjective distributions of all speakers on the internet (/ in the training data) with respect to next word expectations.
This is not what we're modelling when we mean things (eg., when I say, "pass the pen", the cause of my saying it is: 1) need for a pen; 2) you having a pen; etc. -- these reasons are unavailable to the LLM, so it cannot model meaning). But as stated, it would be useless if it actually tried to -- because these methods are incapable of saying, "pass me a pen" and meaning it.