I think this submission paper is talking about reinforcement learning as part of/after the main training, then the model does inference as normal.
They might have done that for O1, but the bigger change is the "runtime train of thought" that once the model received the prompt and before giving a definitive answer, it "thinks" with words and readjusts at runtime.
At least that's my understanding from these two approaches, and if that's true, then it's not similar.
AFAIK, OpenAI been doing reinforcement learning since the first version of ChatGPT for all future models, that's why you can leave feedback in the UI in the first place.
OpenAI stated [1] that one of the breakthroughs needed for o1's train of thought to work was reinforcement learning to teach it to recover from faulty reasoning.
> Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working.
That's incredibly similar to this paper, which is discusses the difficulty in finding a training method that guides the model to learn a self-correcting technique (in which subsequent attempts learn from and improve on previous attempts), instead of just "collapsing" into a mode of trying to get the answer right with the very first try.
They are indeed similar and OpenAI did indeed use RL at training time in a way that has not been done before, as does this approach. Yes both also involve some additional inference-time generation, but the problem is that (at least as of now) you can't get standard LLMs to actually do well with extra inference-time generation unless you have a training process that uses RL to teach them to do so effectively. I'm working on a blog post to explain more about this aimed at HN-level audiences. Stay tuned!
Both models generate an answer after multiple turns, where each turn has access to the outputs from a previous turn. Both refer to the chain of outputs as a trace.
Since OpenAI did not specify what exactly is in their reasoning trace, it's not clear what if any difference there is between the approaches. They could be vastly different, or they could be slight variations of each other. Without details from OpenAI, it's not currently possible to tell.
One is talking about an improvement made by making control flow changes during inference (no weights updates).
The other is talking about using reinforcement learning to do weight updates during training to promote a particular type response.
OpenAI had previously used reinforcement learning with human feedback (RLHF), which essentially relies on manual human scoring as its reward function, which is inherently slow and limited.
o1 and this paper talk about using techniques to create a useful reward function to use in RL that doesn't rely on human feedback.
> I think this submission paper is talking about reinforcement learning as part of/after the main training
Reinforcement learning to promote a particular type of self-correction response
> They might have done that for O1, but the bigger change is the "runtime train of thought" that once the model received the prompt and before giving a definitive answer,
Also reinforcement learning to promote certain reasoning trace
> o1 and this paper talk about using techniques to create a useful reward function to use in RL that doesn't rely on human feedback.
I take this to mean during weight updates, e.g. training.
> "runtime train of thought"
I take runtime here to mean inference, not during RL. What does runtime mean to you?
Previous approaches [0] successfully used inference time chain of thought to improve model responses. That has nothing to do with RL though.
The grandparent is wrong about the paper. They are doing chain of thought responses during training and doing RL on that to update the weights, not just during inference/runtime.
I found the paper a tad difficult to understand because it spends a lot of time circling around the main thesis instead of directly describing. So, to the best of my understanding:
We want to improve LLM's abilities to give correct answers to hard problems. One theory is that we can do that by training a "Self Correcting" behavior into the models where they can take as input a wrong answer and improve it to a better/correct answer.
This has been explored previously, trying to train this behavior using various Reinforcement techniques where the reward is based on how good the "corrected" answer is. So far it hasn't worked well, and the trained behavior doesn't generalize well.
The thesis of the paper is that this is because when the model is presented with a training example of `Answer 1, Reasoning, Corrected Answer`, and a signal of "Make Corrected Answer Better" it actually has _two_ perfectly viable ways to do that. One is to improve `Reasoning, Corrected Answer`, which would yield a higher reward and is what we want. The other, just as valid solution, is to simply improve `Answer 1` and have `Corrected Answer` = `Answer 1`.
The latter is what existing research has shown happens, and why so far attempts to train the desired behavior has failed. The models just try to improve their answers, not their correcting behaviors. This paper's solution is to change the training regimen slightly to encourage the model to use the former approach. And thus, hopefully, get the model to actually train the desired behavior of correcting previous answers.
This is done by doing two stages of training. In the first stage, the model is forced (by KL divergence loss) to keep its first answers the same, while being rewarded for improving the second answer. This helps keep the model's distribution of initial answers the same, avoiding the issue later where the model doesn't see as many "wrong" answers because wrong answers were trained out of the model. But it helps initialize the "self correcting" behavior into the model.
In the second stage the model is free to change the first answer, but they tweak the reward function to give higher rewards for "flips" (where answer 1 was bad, but answer 2 was good). So in this second stage it can use both strategies, improving its first answer or improving its self correcting, but it gets more rewards for the latter behavior. This seems to be a kind of refinement on the model, to improve things overall, while still keeping the self correcting behavior intact.
Anyway, blah blah blah, metrics showing the technique working better and generalizing better.
Seems reasonable to me. I'd be a bit worried about, in Stage 2, the model learning to write _worse_ answers for Answer 1 so it can maximize the reward for flipping answers. So you'd need some kind of balancing to ensure Answer 1 doesn't get worse. Not sure if that's in their reward function or not, or if its even a valid concern in practice.
Circling around the idea in a response describes what I see in a lot of LLM output quite well. I haven't tried o1 myself, but it does seem to fix that problem.
LLMs have no direct recollection of the qualia of their own training. This is at least a major way that I self-correct myself: if I'm about to talk about something I know, I'll try and figure out how/why I know that thing and in so doing, try to gauge whether I actually know that thing, if I'm hallucinating, or if I actually heard it from a less than reliable source etc.
I don't think LLMs can self-correct without remembering their own training in some way.
So you’re saying the solution is to prefix each training batch with a description of a sensory experience (You read the following in a paris cafe in 1997. While you read, you have an excellent baguette and some boiled eggs, and over-roasted coffee. The woman one table over is wearing a beautiful blue hat) and then post-train the final model into recalling the setting where it read any piece of text, or failing to recall any experience when presented with text it didn’t read?
(If someone tries this and it works, I’m quitting my phd and going back to camp counseling)
I don't think that's what they're saying at all. They're talking not about qualia in the human sense, but specifically about "the qualia of their own training". That is, the corpus that LLMs "learn" from and the "experiences" of those texts that are generalized during the training process. Both the raw data and the memory of "learning" is discarded.
So if one were to improve an LLM along those lines, I believe it would be something like: 1) LLM is asked a question. 2) LLM comes up with an initial response. 3) LLM retrieves the related "learning" history behind that answer and related portions of the corpus. 4) LLM compares the initial answer with the richer set of information, looking for conflicts between the initial answer and the broader set, or "learning" choices that may be false. 6) LLM generates a better answer and gives it. 7) LLM incorporates this new "learning".
And that strikes me as a pretty reasonable long-term approach, if not one that fits within the constraints of the current gold rush.
Sort of like this? It does help: Source-Aware Training Enables Knowledge Attribution in Language Models (https://arxiv.org/abs/2404.01019)
From the abstract:
> ... To give LLMs such ability, we explore source-aware training -- a recipe that involves (i) training the LLM to associate unique source document identifiers with the knowledge in each document, followed by (ii) an instruction-tuning stage to teach the LLM to cite a supporting pretraining source when prompted.
I think your overweighting the value of that in day-to-day use. As folks accumulate knowledge, a common pattern (especially for things not embedded in a framework - trivia-like data) is a "I have no idea why I'd know this, but the answer is X".
But even if it's embedded in a framework, say CS, the qualia fade in the background as time passes. E.g. like everybody in CS, I'm pretty much able to quote O() performance characteristics of a sizeable number of algorithms off the bat. If you ask me where I learned it, for that specific algorithm - that's long receded into the past.
When humans self-correct, the normal process isn't "gauging whether you know the thing" or the even more impressive feat of calling up if you heard it from a "less than reliable source". There's a fuzzy sense of "I don't fully understand it", and self-correction means re-verifying the info from a trusted source.
So, no, I don't think the qualia matter for recall as much as you think.
According to which philologist? In short: they are both weak terms, 'hallucination' and 'confabulation', and we are using them in this context very loosely (and it should be in the open).
About the terms themselves, "confabulate" means "exchanging stories", while "hallucinate" is less clear but probably means "to err". In psychiatry, "hallucinate" was apparently introduced by Esquirol and "confabulate" by Wernicke and Bonhoeffer; neither concept seems to be akin to the substance of the phenomenon of "stochastic parrots bullshitting an unchecked narrative through formal plausibility".
Let's ask chatgpt free (coz I'm cheap) what it not-thinks about these, and ask for more statistically generated bullshit:
-- 8< --
These are some creative and fun suggestions! Here's my take:
Digimagination: This one has a sleek, almost sci-fi ring to it. It captures the digital, machine-driven nature of language model "hallucination," but presents it in a positive, imaginative light. This could highlight the creative aspect of AI responses, even when they aren't grounded in fact.
Stochafubalation: Combining "stochastic" (random probability) with "confabulation" is clever. It emphasizes the statistical nature of language models making up facts, though it could be a bit of a mouthful.
Statillucination: I like how this merges "statistical" and "hallucination." It feels technical and would fit right into the AI research community's vocabulary.
Matmulshitting: This one is tongue-in-cheek and hilarious! It references matrix multiplication (which drives many neural network operations) with a humorous, blunt twist. Probably not for formal use, but great for informal contexts!
Here are a few more neologisms I can propose:
Probafabrication: A combination of "probability" and "fabrication," highlighting the statistical underpinnings of how LLMs generate falsehoods.
Algoillusion: A fusion of "algorithm" and "illusion," suggesting the misleading or illusionary nature of generated information.
Synthifalse: A contraction of "synthetic" and "false," pointing to AI-generated responses that are not grounded in reality.
Pseudoformulation: Emphasizing that the response is a pseudo (false) formulation rather than a real answer.
Modelucination: A blend of "model" and "hallucination," capturing the fact that the machine learning model is generating false or misleading data.
My judgement of yours: I love 'stochfabulating' and 'matmulshitting', but I condemn 'digimagining'. Digimagining is doable in proper terms (no pun intended).
"Pretending" is too human to my taste: it assumes the thing doing the pretend thing knows about the real thing. It's something kids do during play. I'm too afraid of the consequences to admit LLMs are anywhere near this situation wink
Spoiler: You're never going to get rid of hallucinations in the autoregressive, next token prediction paradigm (aka LeCun's Law).
The issue here is people trying to use language models as deterministic problem solvers, rather than for what they actually excel at (semi-creative text generation).
Is LeCun's Law even a thing? Searching up for it doesn't yield many results, except for a HN comment where it has a different definition. I guess it could be from some obscure paper, but with how poorly it's documented it seems weird to bring it up in this context.
* Probability e that any produced token takes us outside the set of correct answers
* Probability that answer of length n is correct
* P(correct) = (1-e)^n
* This diverges exponentially
* It's not fixable (without a major redesign)
Doesn't that argument make the fundamentally incorrect assumption that the space of produced output sequence has pockets where all output sequence with a certain prefix are incorrect?
Design your output space in such way that every prefix has a correct completion and this simplistic argument no longer applies. Humans do this in practice by saying "hold on, I was wrong, here's what's right".
Of course, there's still a question of whether you can get the probability mass of correct outputs large enough.
Wouldn't this apply to all prediction machines that make errors.
Humans make bad predictions all the time but we still seem to manage to do some cool stuff here and there.
part of an agents architecture will be for it to minimize e and then ground the prediction loop against a reality check.
making LLMs bigger gets you a lower e with scale of data and compute but you will still need it to check against reality. test time compute also will play a roll as it can run through multiple scenarios and "search" for an answer.
The difference between LLMs and other kinds of predictive models, or humans, is that those kinds of systems do not produce their output one token at a time, but all in one go, so their error basically stays constant. LeCun's argument is that LLM error increases with every cycle of appending a token to the last cycle's output. That's very specific to LLMs (or, well, to LLM-based chatbots to be more precise).
>> part of an agents architecture will be for it to minimize e and then ground the prediction loop against a reality check.
The problem is that web-scale LLMs can only realistically be trained to maximise the probability of the next token in a sequence, but not the factuality, correctness, truthfullness, etc of the entire sequence. That's because web-scale data is not annotated with such properties. So they can't do a "reality check" because they don't know what "reality" is, only what text looks like.
The paper above uses an "oracle" instead, meaning they have a labelled dataset of correct answers. They can only train their RL approach because they have this source of truth. This kind of approach just doesn't scale as well as predicting the next token. It's really a supervised learning approach hiding behind RL.
"The difference between LLMs and other kinds of predictive models, or humans, is that those kinds of systems do not produce their output one token at a time, but all in one go, so their error basically stays constant." -- This is a big, unproven assumption. Any non-autoregressive model can be trivially converted to an autoregressive model by: (i) generating a full output sequence, (ii) removing all tokens except the first one, (iii) generating a full-1 output sequence conditioned on the first token. This wraps the non-autoregressive model in an "MPC loop", thereby converting it to an autoregressive model where per-token error is no greater than that of the wrapped non-AR model. The explicit MPC planning behavior might reduce error per token compared to current naive applications of AR transformers, but the MPC-wrappped model is still an AR model, so the problem is not AR per se.
LeCun's argument has some decent points, eg, allocating compute per token based solely on location within the sequence (due to increasing cost of attention ops for later locations) is indeed silly. However, the points about AR being unavoidably flawed due to exponential divergence from the true manifold are wrong and lazy. They're not wrong because AR models don't diverge, they're wrong because this sort of divergence is also present in other models.
The loop itself is claimed to be the problem. It doesn't matter whether you use an AR or non-AR model. They both have a certain error probability that gets amplified in each iteration.
The per token error of the non-AR model wrapped with MPC is no higher than the per token error of the non-AR model without MPC. Likelihood of the entire sequence being off the true data manifold is just one minus the product of the per token errors, whether or not you're running with the MPC loop. Ie, wrapping the non-AR model in an MPC loop and thereby converting it to an AR model (with a built-in planning mechanism) doesn't increase its probability of going off track.
Per token error compounding over sequence length happens whether or not the model's autoregressive. The way in which per token errors correlate across a sequence might be more favorable wrt probability of producing bad sequences if you incorporate some explicit planning mechanism -- like the non-AR model wrapped in an MPC loop, but that's a more subtle argument than LeCun makes.
Yes. Also "other kinds of predictive models" in my comment refers to models other than generative language models, e.g. image classifiers or regression models etc. Those don't generate tokens, they output labels and the error of the labeling is constant (well, within error bounds). This was in response to OP's comment about "all prediction machines that make errors."
Could the argument be rescued by some additional assumptions?
I agree with, and have previously also stated, the point you make there about “any non-auto-regressive model can be converted into an equivalent auto-regressive model by […]”, but, if one imposes additional restrictions on e.g. computation time, or something like that, I think that construction no longer works.
Well, of course there are some additional assumptions which would rescue the argument, so I guess my real question is whether there’s some combination of extra assumptions which both rescue the argument, and actually result in it being interesting.
If one makes the assumptions that there is a positive common lower bound on the probability of each token being incorrect assuming each previous token is correct, and that if any token is incorrect, then the whole generated text is incorrect, then of course the argument goes through, though the assumption doesn’t necessarily seem very likely.
Then, if we apply the construction, you mentioned to a text generation process with a low enough probability of error, then by the contrapositive, there cannot be an especially high common lower bound on the probability of error per token.
[“edit” prior to posting: I notice that at this point I started using symbols as if I was going to start doing actual algebraic manipulations, but did not actually do any algebraic manipulations which would justify the use of said symbols. I think what I wrote below would be clearer if I had just used words. Unfortunately I don’t want to take the time to rewrite it. I apologize for introducing formalisms without having a good reason to do so.]
If we have the assumption that there is a procedure with error rate < epsilon(x) for generating an entire text response of length l(x), and which can be computed within time t(x), the construction gives an autoregressive method which has error rate less than epsilon(x) for the entire text, and doesn’t have an error rate higher than epsilon’(x) for all of the tokens, and runs in time t’(x) per token (err… I guess it should actually vary between the tokens in the generated string… depends on details I guess), where epsilon’(x) and t’(x) can be computed based on epsilon(x) and t(x) and based on how the construction works,
and epsilon’(x) will be much smaller than epsilon(x), while t’(x) l(x) >> t(x) (at least, assuming l(x) is somewhat large).
So, that particular construction does not preclude the possibility that there is no algorithm that works auto-regressively and which both has an error rate(for overall generated text) as low as [the error rate for some non-auto-regressive model that runs quickly enough], and which runs quickly enough .
If there are cryptographically secure families of hashing functions (in the sense of, asymptotically in the size of the hash length, while the hash can be computed in polynomial time, finding preimages or collisions cannot be done in polynomial time) it seems that there should probably be functions from strings to strings which can be computed in time bounded above by some polynomial, but which can’t be computed autoregressively in time bounded above by a polynomial of the same degree.
(So like, maybe it can be computed in time 5n^4 when not autoregressive, but needs at least 2n^5 time to do auto regressively)
(I’m imagining something like, “compute a string of the form ‘hash(y), y’ where y is the result of some computation done on the input which takes a polynomial amount of time to compute from the input. So, the easiest way to compute this would be to compute y and the compute hash(y). So, to do this auto-regressively, it would need to compute y again for each token in the hash.)
Of course, a single factor of n might not be that compelling, and appealing to strong hashing functions is probably trying to kill a fly with a sledgehammer(probably there are arguments that work as well without assuming this), but it’s what came to mind.
Perhaps one could do something like this to show that for some problems, any auto-regressive solution that has certain runtime bounds, will have some positive lower bound on the error rate per token?
I think it would be hard to make a solid argument that AR or non-AR is strictly better wrt full sequence error rates, whether or not we place constraints on compute, memory, etc. I'd guess that there's some intrinsic form of complexity inherent to any particular distribution of sequences which requires spending at least some amount of compute to achieve sequence generation error less than some epsilon. I'd also guess that AR and non-AR models could both achieve this bound in principle, though maybe it's practically harder with one or the other. It would be interesting to formally characterize this sort of complexity, but that's above my analytical pay grade.
The hash function example is interesting. I think the model could compute y prior to outputting any tokens and then output the `hash(y), y' sequence deterministically. In architectures like transformers, all the compute in earlier steps can be reused in later steps via attention, so it wouldn't be necessary to recompute y at each step as long as the model commits to a given y up front before starting to generate hash(y).
Ah, yeah, I guess that probably is true of transformers in practice. I was thinking about something which strictly takes in a sequence of tokens and outputs a (possibly 1-hot) probability distribution over all possible next tokens. Such a thing running autoregressively would have to recompute y each time. But, if intermediate computations are cached, as with transformers in practice, then this isn’t necessary.
No. Many prediction machines can give you a confidence value on the full outcome. By the nature of tokenization and the casual inference (you build a token one at a time, and they're not really semantically connected except in the kv cache lookups, which are generally hidden to the user), the confidence values are thrown out in practice and even a weak confidence value would be hard to retrieve.
I don't think it's impossible to obtain content with confidence assessments with the transformer architecture but maybe not in the way it's done now (like maybe another mayer on top).
Is this similar to the effect that I have seen when you have two different LLMs talking to each other, they tend to descend into nonsense ? A single error in one of the LLM's output and that then pushes the other LLM out of distribution.
I kind of oscillatory effect when the train of tokens move further and further out of the distribution of correct tokens.
This is equivalent to the problem of maximum entropy Markov models and their application to sequence output.
After some point you’re conditioning your next decision on tokens that are severely out of the learned path and you don’t even see it’s that bad.
Usually this was fixed with cost sensitive learning or increased sampling of weird distributions during learning and then making the model learn to correct the mistake.
Another approach was to have an inference algorithm that maximize the output probability, but these algorithms are expensive (viterbi and other dynamic programming methods).
Feature modeling in NNs somewhat allowed us to ignore these issues and get good performance but they will show up again.
> Is this similar to the effect that I have seen when you have two different LLMs talking to each other, they tend to descend into nonsense ?
Is that really true? I'd expect that with high temperature values, but otherwise I don't see why this would happen, and I've experimented with pitting same models against each other and also different models against different models, but haven't come across that particular problem.
That the chain-of-thought diverges from accepted truth as an incorrect token pushes it into a line of thinking that is not true. The use of RL is there to train the LLM to implement strategies to bring it back from this. In effect, two LLMs would be the same and would slow diverge into nonsense. Maybe it is something that is not so much of a problem anymore.
Yann LeCun talks about how the correct way to fix this is to use an internal consistent model of the truth; then the chain-of-thought exists as a loop within that consistent model meaning it cannot diverge. The language is a decoded output of this internal model resolution. He speaks about this here: https://www.youtube.com/watch?v=N09C6oUQX5M
It's quite fitting that the topic of this thread is self-correction. Self-correction is a trivial existence proof that refutes what LeCun is saying, because all the LLM has to say is "I made a mistake, let me start again".
Yes the main flaw of this reasoning is supposing that e does not depend on previous output. I think this was a good approximation to characterize vanilla LLMs, but the kind of RL in this paper is done with the explicit goal of making e depending on prior output (and specifically to lower it given a long enough chain of thought).
P(correct) converges to zero, so you get almost certainly incorrect, at an exponential rate. The original choice of terms is not the most rigorous, but the reasoning is sound (under the assumption that e is a constant).
Ah yes I didn't pay attention that it was the probability of being correct I misread it as the probability of being incorrect since the claim was that it diverged.
Simplistic, since it assumes probabilities are uncorrelated, when they clearly aren't. Also, there are many ways of writing the correct solution to a problem (you do not need to replicated an exact sequence of tokens).
“Label bias” or “observation bias” a phenomenon where going outside of the learned path lives little room for error correction. Lecun talks about the lack of joint learning in LLMs.
Does anyone here know, has anyone tried something like feeding the perplexity of previous tokens back into the model, so that it has a way of knowing when it's going off the rails? Maybe it could be trained to start responding less confidently in those cases, reducing its desire to hallucinate.
One way I explain it to people: Imagine a corporation that only has a PR department. Extremely good at generating press releases and answering reporter questions. But without the rest of the company, the output text isn't constrained by anything meaningful.
In an alternate universe, one where people understood this, people would be using LLMs for nothing serious, but a whole lot of fun little art projects.
If you're talking about label bias then you don't need to solve label bias to 'solve' hallucinations when the model has already learnt internally when it's bullshitting or going off the rails.
I hate that the AI pundits have succeeded in popularizing the notion of "hallucination", anthropomorphizing these balls of statistics into something that seems like it's actually in some sort of deep thought process akin to a person's mind.
No, it's not "hallucinating". It's not lying, or making things up, or anything like that either. It's spitting out data according to what triggers the underlying weights. If this were a regular JSON API endpoint, you wouldn't say the API is hallucinating, you'd say "This API is shit" because it's broken.
As long as AI-bros are pushing for making AI models seem like more than they are to pad their wallets, there'll be someone like me pointing out that, no, it's not "hallucinating", it's spitting bad data.
You're being pedantic. Your statement that "it's spitting bad data" is incorrect too, as it implies agency. Actually, nothing is happening but electrons flowing. The notion of an "it" that "spits" "data" which is "bad" is your own conceptual overlay.
Tbf, if you assume humans have agency, there’s plenty of people who would claim you’re making the same mistake because the reductionist view is that people are just either deterministic chemical soup (or maybe with a bit of randomness baked in).
I know lots of people working on AI. they are among the least bro-y group of people I have ever met.
There is simply nothing similar to actual bro-y finance culture among AI research engineers. It is entirely a figment of the media and backreaction that we currently have to portray everyone we don’t like as a “bro” - truth be damned.
> I hate that the AI pundits have succeeded in popularizing the notion of "hallucination", anthropomorphizing these balls of statistics into something that seems like it's actually in some sort of deep thought process akin to a person's mind.
I'd argue the opposite: people think a person's mind is in "deep thought" when it's actually just a ball of statistics.
Not intended to be snarky, but what would you consider them? Is it akin to a function in the mathematical sense, that takes (sensory) input and creates output based on that? If so, how does this function work, if not by statistics? I am genuinely interested in your point of view. Also: Don't you think humans can be somewhat compared to a "pretrained model", as in human genetics gives the brain a head start, so that it can start speaking latin from what you deam "homo sapiens mumbling?
Not a specialist, but I think each individual is just a small step of gradient descent for the large neural network of humankind.
At our individual scale, we look like a rigid ball of statistics, but at the global scale, we carry a small amount of gradient/delta that pushes humankind in a broader direction.
LLMs have been able to reproduce the former, it is unclear how they can contribute to/replicate the latter.
The right word is "confabulation". Which is when we fill in missing information but may not be aware that we are doing it.
We all confabulate to some degree, as any neural system must, since no training data is stored perfectly.
Human "hallucinations" in contrast, are a particular kind of breakdown in our sensory feedback loops. Which is not a process LLMs even have.
Hallucinations occur when our internal sensory feedback loops overpower actual sensory input, resulting in a stream of false sensory experience/signals being generated and processed. The false running experience might still incorporate some actual sensory information or not.
When we dream, we are hallucinating - our sensory experience loop running free of our actual senses - to a productive purpose.
The reason our senses have feedback is so that we can use our interpretation of sensory input as cues to make interpreting the next moments input easier. But its important that our running interpretation can reset when new input significantly diverges from our expectations so it can quickly reorient.
(Not only is it important to revert to a raw input interpretation to ensure our running interpretation keeps up the actual context changes and corrects misinterpretations, but such resets signal that something novel or unexpected has happened, so likely trigger learning.)
So "hallucinations" was an unfortunate and misleading choice of terminology.
I've got bad news for you – that term was used in deep learning research well before LLMs came on the scene. It has nothing to do with pundits trying to popularize anything or trying to justify LLMs' shortcomings, it was just a label researchers gave to a phenomenon they were trying to study.
A couple papers that use it in this way prior to LLMs:
Maybe an evolutionary / structuralist lens is helpful here: terms that rapidly diffuse through discourse are those that people like most, and most people like to anthropomorphize, so "hallucination" has come to take on a new meaning, and we all (to different degrees) know what it is referring to.
Yeah it's simply model error. All models from Linear Regression to LLMs have error. I guess because this type of error is in the form of deceptively reasonable human language, it gets a different moniker. It's also notably harder to quantify so it might warrant a different name.
I don't see any mention of weight release unfortunately.