Hacker News new | past | comments | ask | show | jobs | submit login
LLMs by Hallucination Rate (github.com/vectara)
107 points by vincent_s 11 months ago | hide | past | favorite | 132 comments



We just updated the Hughes Hallucination Evaluation model (HEM) with the results of using the latest (beta) version of Palm 2. TL;DR - much better than before.


Yann LeCun has pointed out [0] that due to the autoregressive nature of LLMs, you'll never be able to stop them hallucinating.

I think what you need is an architecture that emits a fully built answer directly from memory after working on it for a while, not one that builds it up token by token.

[0] https://www.hopsworks.ai/dictionary/rlhf-reinforcement-learn...


But doesn't the human mind make a lot of things up too, just to compensate for the gaps?

Like if I tell you: "I have a box, I place 3 balls in it and remove 1, how many balls are left". You have to assume that the box was empty before, that I want to know how many balls are left in the box, not on the side, that the box isn't leaking, etc. This is you making up facts that weren't present in the question.


That the box starts out empty is an assumption based on shared cultural context, not a hallucination. If you give someone that question and then say "wrong, there was already a ball in the box!" they're not going to say "haha, silly me, I hallucinated", they're going to stop hanging out with you.

Lots of separate phenomenon can cause our minds to have "wrong" or incomplete information, calling them all "hallucinations" is reductive and just serves to trivialize the vast differences in operation between human minds and LLMs.


> That the box starts out empty is an assumption based on shared cultural context, not a hallucination.

Sure, in the exact same way LLM operates within context of everything it learned. It so happens that it's "cultural context" is derived from ours, by means of the training data, which is why it's so similar.

> If you give someone that question and then say "wrong, there was already a ball in the box!" they're not going to say "haha, silly me, I hallucinated"

And yet people are doing that to LLMs. "Haha, look at that LLM hallucinating", where half the time the prompt is misleading or wrong.

> they're going to stop hanging out with you.

That's the social equivalent of RLHF.


> And yet people are doing that to LLMs. "Haha, look at that LLM hallucinating", where half the time the prompt is misleading or wrong.

I mean... I ask it to give me Python code and it gives me code that doesn't work and makes up libraries that don't actually exist. That's not a prompt problem. A human would understand there's an implicit "code that actually works and isn't pulled out of your ass".


Could that be similar to how most people have a few bugs in the first draft of their code; particularly when writing from memory (e.g. during a whiteboard interview)? For example, I always have to lookup whether the function is Array.every or Array.some in JavaScript. That doesn't seem too different than making up a library to me.


If you make the human write the code on a whiteboard without a python interpreter, they are likely to occasionally make up libraries that don't exist.

The LLM is operating similarly in a lot of cases. It doesn't have an IDE with intellisense acting as an immediate feedback mechanism.


But when you ask the LLM where you can read about the library it's referencing (that it made up), it then makes up a URL for it that has never existed, (at least according to the internet archive).

A human certainly doesn't do that in my experience.

Apparently LLMs know when they are making stuff up (or so I keep getting told that over and over by others here).

But clearly they are terrible at communicating about when they are making things up which is a big problem to me.

I wish they would just say when they don't know something rather than make up BS. Plenty of humans are perfectly capable of acknowledging when they don't know something.


It's not really the same. An LLM doesn't understand its context and doesn't have a notion of true or false. If it were just repeating things from its cultural context, it shouldn't invent libraries or court cases that don't exist in its context. And unlike a human, if you ask if for details on something non-existent, it will continue elaborating on this thing that it's creating out of thin air. It's not really making assumptions so much as to it these things really exist.

And ChatGPT will directly contradict the prompt even. I asked it yesterday what the value of a non-interest-bearing bond is and it replied that it's valuable because of the interest you collect from it just because that's the usual answer.


I feel there's an important difference : we humans can, to some extent, analyse what we just said (or were about to say) (or what we just typed and didn't press send yet!) and conceptualise it as a whole to evaluate its veracity. Such an ability would really elevate current LLMs to a higher level of “intelligence” imo.

It is, however, interesting that the “basic” word-spewing ability we now share with LLMs seems vital to the intelligence process.

I have no idea how that reckoning phase would be implemented.


My take for almost a year now is: LLMs are like your inner voice.

The part where we can "analyse what we just said (or were about to say) (or what we just typed and didn't press send yet!) and conceptualise it as a whole to evaluate its veracity" involves going back-and-forth in yourself. A loop. LLMs are the part inside the loop, and so if you want to get better results, you have to feed the output back to LLM.

This is, arguably, part of what the "conversation mode" does, by always including the growing message log in each request - which is why e.g. GPT-4 is good at correcting a mistake once you say, without pointing it out, that there exist a mistake.


> which is why e.g. GPT-4 is good at correcting a mistake once you say, without pointing it out, that there exist a mistake.

It is also good at adding a mistake, so this is completely unrelated. It just rolls to get another random.

What you should add is that ChatGPT does better when it writes out every step, so what you said previously is still true just that your example doesn't properly support it.


The "chain of thought" thing where you tell it to "think step by step" is a slightly different thing. That's a single forward pass, where it can only reason based on what it output earlier. But once you reply to it, and the interface feeds the whole conversation back to the LLM, it now can look at the entirety of its previous answer.


When it does it step by step it can look at every step when it writes down its final answer after the steps. There is no extra magic that happens just because you write something in between.

Edit: To expand on that, a humans inner thought would have caught the mistake at that point and corrected it by writing more instead of ending the text. The LLM basically never does that unless you tell it to fix the answer.


Every mistake you don't catch as being wrong in your inner thoughts simply become a part of your subjective experience/existence until challenged again later, if not forever.


What would happen if an existing LLM were trained with the ability to emit a backspace token? (Would that even be possible?)


No idea, but this makes me wonder how we'd even go about getting training data. And it reminds me one time I built a poc editor[0] which saves deleted content so you can get a similar view to manuscripts in which the author strikes out text. Maybe that would be an interesting starting point…

[0]: https://github.com/trosh/nohide


ask it to consider what it just wrote


Right. I think we'll end up seeing models that iterate on larger parts of autoregressive generations before presenting them to the user. It's just more compute, i.e. the bitter lesson.


Not only that, but we will frequently make up a lot of things directly contradicted by what we're told. Look at any online discussion and you'll usually find people answering different questions to the ones they were asked, for example.

I don't think you can have functional intelligence without a willingness to fill in details you believe are missing, and that will not always work well. Consider how much effort we spend reinforcing in children how to recognize and suppress what is fantasy vs. reality, and to make them favour telling the truth. It shouldn't be surprising at all that we get "hallucinations" unless/until we put a massive effort into reinforcing that distinction with LLMs too.


> you'll usually find people answering different questions to the ones they were asked, for example.

Most of those people are doing it intentionally though - e.g. politicians when you ask them a question they don't want to answer.

> I don't think you can have functional intelligence without a willingness to fill in details you believe are missing

I disagree. A big part of intelligence is realising/admitting you don't know. That you are missing information. And you don't just fill it up with BS. You going find that information.

> Consider how much effort we spend reinforcing in children how to recognize and suppress what is fantasy vs. reality, and to make them favour telling the truth

Do we? I don't think children have problems telling the difference between fantasy and reality. If a child isn't telling the truth, most of the time it's not because they can't tell the difference between truth and falsehood, but because they are deliberately lying to you for one reason or another - e.g. they don't want to get punished.


> Most of those people are doing it intentionally though - e.g. politicians when you ask them a question they don't want to answer.

That's distinct from the failure of communication. I believe GP meant something closer to what we'd call a "brain fart", where you read the text correctly, but "understood" something different.

> I don't think children have problems telling the difference between fantasy and reality. If a child isn't telling the truth, most of the time it's not because they can't tell the difference between truth and falsehood

Do you have children? I've been saying that GPT-3.5 and GPT-4 failure modes are disturbingly similar to the failure modes of my own kid, who happens to be 4.5 right now. I mean it.

It's a realization that hit me when I started noticing her "context window" growing - when she'd go into story mode, telling some fantasy stuff like having a sister giraffe who flies a helicopter or whatnot, I could tell that, if she doesn't mention something again within 30 seconds, it would never be mentioned again - a forgotten detail. That window of time kept growing over subsequent months, and is too large to notice now, but hey, the entire way she'd construct her stories was very similar to what you get when you prompt the LLM for a story and just let it keep writing.

Anyway, on the parenting stuff:

> it's not because they can't tell the difference between truth and falsehood, but because they are deliberately lying to you for one reason or another - e.g. they don't want to get punished.

That's true at a later age. Early on, even at 3 y.o., they get genuinely confused about reality.


> That's distinct from the failure of communication. I believe GP meant something closer to what we'd call a "brain fart", where you read the text correctly, but "understood" something different.

That would be one example, yes.

But we also often interpret a question in the context of the person asking to mean something different than what the question actually says. A concrete example of a simple question that people will often outright hesitate to answer directly is "do you want a fair society?" Almost everyone can answer that just "yes" if they interpret the question as genuinely meaning just what it objectively said. But what people subjectively interpret that question to mean tends to vary based on assumptions about how the political views of the person asking match your own and their intention for asking.


> Most of those people are doing it intentionally though

Sometimes, yes, but also because people misunderstand are make invalid inferences.

> I disagree. A big part of intelligence is realising/admitting you don't know. That you are missing information. And you don't just fill it up with BS. You going find that information.

Notice how you inferred missing information here. Ironically you inferred the wrong intent behind what I said. I did not suggest "filling it up with BS" was useful nor desirable. It is a possible interpretation of what I wrote, I will accept, but it requires assumptions - one does not follow logically from the other.

What I actually suggested is that a willingness to fill in details is necessary, not that filling it in no matter what or without having something reasonable to fill in is necessary.

I agree that a big part of intelligence is realizing and admitting you don't know. I does not conflict with what I said. With the caveat that you can not always admit to not knowing every little detail you don't know - if you did you'd be unable to function.

My point was that we rarely "know" anything with 100% certainty, and so the question is not if we assume, but at what perceived level of certainty we set the bar. If I write "2+2=4", it is not a great sign of intelligence to ask me if I am using standard arithmetic or use special rules, unless you have context to suggest I might be or I have reason to interpret it as you being a smart-ass about it.

It is a core sign of intelligence to understand from context which bits of information it is likely justified to fill in when missing based on the fact that it makes a statement, or a behaviour, consistent if interpreted using that information.

So to restate, the willingness to fill in details is essential to functional intelligence, because without it we'd be forever unable to draw conclusions without continuing to seek clarifications of additional unstated assumptions.

Just reading this involves making assumptions about the degree to which my use of every individual words conforms to standard English usage, and indeed starts with the assumption that this is English to start with. Granted it is an assumption with exceedingly high certainty, but I have not told you I am, so it remains an inference whose certainty will get ever closer to 100% as the length of the conversation makes alternatives more and more unlikely.

When the uncertainty increases, sure, the intelligent choice at some level becomes to admitting that you don't know. Learning how to recognize the threshold is also essential to functional intelligence. Assuming much and too little both leads to conversations that go nowhere.

> Do we?

Yes, we do. We tell kids to not make things up in contexts where we expect them to answer us based on reality all the time. I'd be willing to claim that not an adult alive has not been told to stick to the truth a multitude of times. How effective that admonishment is certainly varies, and yes we sometimes also teach kids there are contexts where lies are better (whether we teach them that intentionally or not). Older children will certainly be better suited, including in terms of brain development that does not per se require reinforcement, to tell facts and fiction apart, but young children tend to hold a whole range of beliefs that appear genuine that has no relation to reality, and for many people some such beliefs persist for a long time when uncorrected.


> What I actually suggested is that a willingness to fill in details is necessary, not that filling it in no matter what or without having something reasonable to fill in is necessary.

Except that's exactly what LLMs do due to the way they work.

It's statistically guessing the next token. Unfortunately, just because statistically the next token is among the most likely to appear doesn't mean the sentence it forms is correct or even makes any sense.

Frankly, the whole way LLMs work is kind of absurd in the light of what we are trying to do.

https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...

The darn algorithm actually has randomness built into it - and it's an absolutely necessary component.


It needs to fill in something but that something can be an "out of context" admission that it does not know, it does not need to be low probability nonsense about the subject, and that was my point.

Sufficient reinforcement that "I don't know" should be the most likely answer in some contexts is certainly important for intelligence.


Except we can’t get LLMs to say it doesn’t know - it doesn’t process the concept of “know” and “don’t know”, heck it doesn’t even process the concept of true and false.

It’s a language model. It models the statistical properties of human languages.


> Except we can’t get LLMs to say it doesn’t know

Here is a direct pair of quotes from a recent conversation with GPT4:

> Me: Answer in a single sentence, please. Do you, or do you not know the contents of Dr. Franz 1994 paper "Code Generation on the Fly: A key to portable software"?

> ChatGPT: No, I do not know the contents of Dr. Franz's 1994 paper "Code Generation on the Fly: A Key to Portable Software."

Is it prone to being evasive and waffling and not wanting to admit when it does not know, and preferring to jump to conclusions and try to get away with generalities, yes. Asking about this paper is one of my repeated test cases because it does so badly (there are few online sources on it, but his paper is online; other than that one of my blog posts and the Wikipedia article make up the bulk of text about it). And so it did when I tried it last - it waffled on about the (unrelated) concept of semantic encoding from NLP.

As far as I can tell, it is correct: It does not know the contents of the paper. It barely understands the high-level concepts involved (

> it doesn’t process the concept of “know” and “don’t know”, heck it doesn’t even process the concept of true and false.

> It’s a language model. It models the statistical properties of human languages.

It can explain them and use them. We don't know enough about what "knowing" something or "processing" a concept means in terms of human thought processes to know whether there's a meaningful distinction between the level at which an LLM processes these concepts vs. humans or whether there is a meaningful distinction between reasoning and intelligence vs. "modelling the statistical properties of human languages".


> As far as I can tell, it is correct: It does not know the contents of the paper. It barely understands the high-level concepts involved

LLMs work by predicting the next bunch of words - it's advanced auto complete. If most of the training data replies "I don't" to the question 'Do you, or do you not know the contents of Dr. Franz 1994 paper "Code Generation on the Fly: A key to portable software"?' then that's what the LLM will say.

It's half useful for answering frequently asked questions but don't expect it to evaluate its current state and give you an accurate answer.

> We don't know enough about what "knowing" something or "processing" a concept means in terms of human thought processes to know whether there's a meaningful distinction between the level at which an LLM processes these concepts vs. humans or whether there is a meaningful distinction between reasoning and intelligence vs. "modelling the statistical properties of human languages".

But we do know what LLMs do, model language. Not knowledge. Not “thought”. Language.

And the way we get them to spit out output that’s satisfactory to us is just absurd. If you read the link I posted earlier, you will know that if you set the “temperature” of an LLM to zero, it just repeats itself talking in circles. It’s only by adding randomness to its “next token” search that we get output that possibly satisfactory.


Yes, but why would we want machines to inherit our "issues"? Machines should at least be able to have ground truth and then build on it. For certain social interactions dialling in some blank filling can then be totally fine.

Would you want a calculator that gets regularly gets addition wrong?


I'm not sure if, in the case of humans, that example falls on making up facts for the logical analysis (as you imply) or in the assumed common context for language processing (the "tool" we socially use to make language efficient: by not having to explain absolutely everything).

If you told me that thing about the box, I'd assume you meant a box that doesn't leak, not because of making that fact up in my brain, but because you're verbally describing a system to me, and it's implied you'll tell me the most relevant features of the system in order to transmit the idea of it from your mind to mine. Otherwise I'd be right to complain that you are maliciously hiding important information in your description.

Problem is, of course, we don't have a common preestablished language context with an LLM, at least not any more than what is the prevalent one in its training materials ("Ask" vs. "Guess" cultures, etc).


So you are making things up. Including making up rules about which things it's ok for you to make up. They might be reasonable things to make up, but you're still inferring them to plug holes in the description based on what seems to be the most probably interpretation.

And that is pretty much exactly what an LLM does: It predicts the most likely item of data to fill in. That their training is as of yet insufficient to make them predict reasonable things in all situations should be far less surprising than how quickly we've gotten to a point where we can communicate well enough with them that these conversations are meaningful at all.

One of the fascinating things about these discussions to me is that they reveal an astonishing amount of assumptions about human thought processes that appear to be just marginally above LLM hallucinations in rigor.


> So you are making things up. Including making up rules about which things it's ok for you to make up.

I guess yeah, that's correct :)

But it's also not an artifact, but within the very definition of human interaction. We need to be able to plug holes. Otherwise conversation evolves into a list of guarding caveats after each and every message.

OTOH Twitter is a great example of how different cultures develop different assumptions about what is reasonable to fill the gaps in with. It's notorious for how people just take whatever wrong and unintended meaning they want from controversial tweets. Some times on purpose, granted, but lots of other times not really.

I concur it's a very interesting topic!


> We need to be able to plug holes. Otherwise conversation evolves into a list of guarding caveats after each and every message.

Absolutely.

So the big question is to what extent what LLMs do is this, but without (yet) having been sufficiently trained on what is reasonable to fill in unknown gaps with, vs. whether there is a bigger issue.

I'm pretty hopeful that it's down to more training on when "I don't know" should be the most likely output, but I think we really don't know until more work has been done on it...


Yep, we're the exactly same in this regard. This is why expecting people to do anything consistently right from memory is a recipe for catastrophe. Aerospace knows this very well, so pre-flight checklists are standard.


It is fair to assume a riddle is solvable, which means two balls. "It is unsolvable" is the correct answer but it is a gotcha riddle.


Pretty certain you can never stop humans from hallucinating either. Any human answer will have a probability/confidence level attached to it, it's never zero, because any statement relies on assumptions of the world encoded by the same humans.

Unless we are actually doing symbolic reasoning like a theorem proof? But even then are we ever completely certain the proof is correct? This doesn't come naturally to us.


> Any human answer will have a probability/confidence level attached to it, it's never zero

I have been to a grocery store at some point in my life. That is 100%. I trust you also have statements you can be certain of. Humans are 100% certain about many things.


Your claim relies on an assumption that your memory is correct and that your sensory input is trustworthy.

While those are highly likely, and so it's reasonable to "round up" and claim to be 100% sure, the probability of it being true is most certainly not 100%.


here's the thing though, we aren't talking about an objective fact. Maybe humans don't exist at all, maybe we're boltzmann brains floating in a nebula.

We're talking about self reported probabilities humans are aware of. And yes, I am actually 100% sure of some facts about myself. (I have a sister, I live in an apartment). I am aware, abstractly, that such things may not be true in a sort of plato's cave shadows etc, but I will for all intents and purposes act and believe that they are certain. If such things were proven to not be true, I would be shocked to my very core; I am shocked because it broke a notion I had, if I was aware in my actual core of the possibility that it may not be true, I wouldn't be shocked.

To survive, our biological machinery must make assumptions and round probabilities. To do otherwise would be to be paralysed with indecision. (what if there is some small chance an elaborate deathtrap has been sprung that will kill me if I move even slightly at this very moment? what about the next moment?)


But that is exactly the issue:

When your self-reported probability is out of whack with reality you will be prone to confidently making claims that are false and believe them.


Feels like this is where having the LLM "think out loud" helps. I've seen examples where a model makes a mistake and then corrects itself.


I may be wrong but isn't that what chain of thought, verification etc claim to solve? As I understand them, you essentially feed back the llm back as a prompt and ask to verify if it's true and fix mistakes until "convergence"


LLMs _only_ produce hallucinations. Some of them align with the someone's version of the truth (assuming that is what the user wanted)


This has been 3rd or 4th time in HN I have commented on top voted posts making very general statement about LLM without defining the terms. And I have never gotten any reply. LLM discussions in HN has become so shallow like there are two sides.

Please define what you mean by hallucination.


Hallucination in this context is when LLMs make up stuff.

The OP's assertion is "make up stuff" is all LLMs really do and sometimes the stuff they made up matches reality - and sometimes it doesn't.


A different way to state the claim may be "whatever LLMs do, they do it all the time"

There are no separate modes of operation when GPT-4 is telling me about Javascript's "console.log" method, and when it's telling me about Rackets non-existing "sum" function. The network has no way of telling that the strands of probability are wearing thin under it's feet. The engine spins and churns out another token and we deem it "correct" or "incorrect".


Happy to reply.

Let's ask LLM to generate some text given a prompt. It replies with a bunch of words.

Let's ask LLM to generate some text given a different prompt. It replies with a bunch of other words.

In both instances, the model worked correctly, and as defined. It produced a bunch of words based on context, statistics and some forced randomness.

If we call one of them an hallucination because it does not match how we would expect a human to respond, we have to accept the other is also an hallucination because it was generated in exactly the same way, using exactly the same model.

If you disagree, ask the LLM to 'Compose a single-line avant-garde poem about rainbows, without mentioning rainbows'. Look at the reply. How would you tell whether it's hallucinating or not?


What should a non hallucinating human reply to 'Compose a single-line avant-garde poem about rainbows, without mentioning rainbows'?

Also my objection wasn't that it hallucinates, my objection was to the OP's point that it only hallucinates. If I ask it what is 2*2 and it answers 4 would that be a hallucination or not?


Why would that matter?


“the generated content that is nonsensical or unfaithful to the provided source content” Ji et al 2022


If I go with this definition, and ask it say what is 2*2, it is not nonsensical or unfaithful. So there is one example where it is not hallucinating?


Or where hallucinating truth co-indices with our understanding of truth?


You must have a very peculiar definition of a hallucination.

Why would an LLM reciting a fact correctly be hallucinating?


Real hallucination is when you sense something that isn't physically there. LLMs don't sense anything, so this "hallucination" term is questionable from the get-go.


I think the point is the LLM arrives at an obvious and undeniable fact, a misconception common in human discourse, an assertion of something unknowable, a statement of what appears to be opinion, a "creative" response to a brief (both impressive and unimpressive), a reasonable "guess" and a random answer which is only very loosely linked to the prompt in essentially the same way. Humans generally arrive at such responses in different ways, and are often conscious when they're certain, reasonably confident, guessing, needimg an answer to be a particular way to fit their wider goal or bullshitting

So if it's "hallucinating" a probable continuation which asserts something which is [incidentally] understood to be completely wrong or not in the source material by humans, it's going through exactly the same process to arrive at a continuation which [incidentally] is understood to contain only accurate statements or valid summarizations


If I had a penny each time a human has confidently concluded something that is entirely incorrect… I’ve inadvertently done it countless times myself, and so has every person I know of.


Oh, everyone's done it, but humans have a lot more ways to consciously or unconsciously reach an entirely incorrect conclusion, and more ways of testing it too.


Suppose I make a massive book of predictions. Some of which turn out to be correct.

Am I now capable of predicting the future?

Suppose I wrote the book to be as banal (i.e. highly probable) as possible.

Am I predicting the future now? And, how impressive is it?


> Suppose I make a massive book of predictions. Some of which turn out to be correct. Am I now capable of predicting the future?

If you write a book of random predictions without any insight the vast majority of them will be false, so even if few of them are right it is not impressive nor anyone would say you're capable of predicting the future.

In comparison, the OP states that GPT-4 predictions are 97% correct. And yes, I would say that is pretty impressive. If 97% of anything I say about the future was correct I would be considered a wizard and probably be a billionaire.


>If you write a book of random predictions without any insight the vast majority of them will be false, so even if few of them are right it is not impressive nor anyone would say you're capable of predicting the future.

Isn't this step for step exactly what Nostradamus did?


Basically yes, plus he was as vague as he could be to cover many possible outcomes. That’s why no educated person takes his „prophecies” seriously.


> If you write a book of random predictions without any insight the vast majority of them will be false

And you just hallucinated that! It's just, if there was a system that only talks in an oddly definitive hypotheticals, they can be correct about a lot, and GPT-* are exactly that.


What you are talking about I would call guessing :)

Fact of the matter is that sota LLMs are highly accurate predictors for many topics, certainly above any living human in terms of total AUC of correct predictions on fact based questions. Some humans are better on certain topics, but noone can match total AUC since LLMs has such breadth.


Its accurate if a Human looks at it.

LLMs are fine - people are attributing superpowers to them when they discuss hallucinations.

LLMs do not "think". They created the correct text as they were modeled to do.

The observer feels that the facts are wrong.

That's an issue with the observer, not the model. The model was never trained for facts it was trained for text.


You're talking about retrieval.


Maybe hallucination isn't the right word. "Statistical guess" is a better term IMHO.


"Hallucinations" is too charitable of a term for making stuff up. It personifies "AI" which doesn't need help selling its "intelligence".


You could use "confabulate" instead, could you? "Make stuff up" is very human and would personify "AI" even more.


Confabulations are also very human, but a much better metaphor than hallucination for the actual errors LLMs make.


Hallucinations are false current sense perceptions. LLMs don’t have senses and don’t hallucinate at all; the LLM errors described as “hallucinations” are closer, if one needs an anthropomorphizing metaphor, to confabulations.

Which kind of make sense; as LLMs have almost no memory, just an instinct to respond and some instinctual responses (the result of “training”, which is also a bad metaphor; only “in-context learning” is analogous to training/learning for humans, what is called “training” is guided evolution of frozen instincts) and whatever is in their context window. And lack of memory plus a prompt to respond is a major context where confabulations happen with humans (these are specifically called “provoked confabulations.”)


Heck yes.

We are so used to assuming that good text means good thought. "emergent" behavior is assumed to be full on thinking and learning systems.

Calling it Hallucination is absurd.

Humans hallucinate. Humans can be deceived or deceive because there is an active effort to hide something.

An LLM is producing the next best token. Its either always hallucinating or never hallucinating.


"Some" is increasing with every new generation of LLMs, though. And there is no reason to believe it won't converge to 100%.


I look forward to the inevitable work of science fiction titled Latent Space.


"Engines won't respond, commander. We're stuck in a latent space polarity vortex."


"Auxilary power to the deflector, a modulated antiproton pulse might set us free."


I agree with this comment.

In fact I would even stop talking about the profound concept of truth.

Long time ago I read this wisdom: «God's omnipotence raises the intriguing possibility that two contradictory religions could both be true.» Buddha says anyway, that everything is empty. Zen Buddhism employs koans to dismiss truth.

LLMs sometimes hit the mark and this makes many people ask what is truth and what are humans?


>this makes many people ask what is truth and what are humans?

After enough philosophy, this is close to a solved problem. There are so many breakdowns of what is truth, including adding words to describe different truths.

If you are hung up on a single universal definition, you are most likely limited by language, a human construct rather than the understanding of Truth. We all know what Truth is, we know that 2+2=4. As Laozi would say 'If you try to grasp it, it will slip". There is a time and a place for analysis, but there is also losing the forest for the trees.


Most religions started as hallucinations.


There are facts (aka truth), for example, as to what sentences or words are actually in a document and which ones are not. Having LLMs evolve us into complete incompetence as to basic facts is not the way forward.


> LLMs _only_ produce hallucinations. Some of them align with the someone's version of the truth

I partly disagree because not all truth is subjective, but rather, objectively true.


Truth is an entirely orthogonal concept to the data that they have been trained on. Sometimes LLMs might approach coherency.


Pardon my French but this is the exact same of bullshit pseudo-philosophical semantic non-arguments we hear from jordan peterson (please watch the Some More News video about him). "What even are words?"


Thanks. I'm more of a reader than a video viewer though.

It's really around the epistemological limits of LLMs, and truth. Probably more Kant than Peterson?


This matches my view, but it begs the question what exactly is measured by "Halucination Rate"?


GPT 4 being only 3% does not convince me. If you ask well known stuff yes it will mostly get it right, but as soon as you ask something a little bit less known it will happily invent anything. About half of my answers from ChatGPT comes with lies or partial info.


Your anecdata isn't in the evaluation set - this is summarization


It's also extremely bad at anything resembling any math or numbers: calculations, unit conversions, weather etc.


It's bad, but not extremely bad imo. With Code Interpreter I would even say it's decent at conversions (which it defaults to now if you ask it to convert something).


It's extremely bad because it's plausibly bad.

Just yesterday I was converting regular cooking units for a friend, and checking against google. It got most of it right.


Without concrete numbers to quote, bad == extremely bad


Yeah its low, other papers have seen hallucination rates of 30-40%

If I understand this model correctly, its doing it over a curated list of content.

I would say that for this specific test it achieves 97%, and assume a rate of 70%.

Verifying factualness is hard to automate, and all measures are proxies for what an actual human would say.


I also think 3% sounds low but half of your answers? What kind of questions are you asking?


Things you don't already know? For any topic you already know the basics of but want to explore some particular thing, the first answer will be vague and if you ask followup questions will start making stuff up, IME.

For programming the popularity of a language matters a lot - Javascript is probably one of the better ones? I tried asking it completely basic questions about Racket for last year's Advent of Code and it would immediately start making up functions.


In my personal experience, the rate of hallucinations is proportional to the technical depth of a question. Questions like "how to build project X" have a high rate of hallucinations, for example.


It often tells me to use non-existent libraries, this has been the most common experience for me with regarding to inaccurate information provided by ChatGPT; the next most common is probably using non-existent (not simply deprecated) functions within existing libraries.


At this moment, if you ask ChatGPT 4 to repeat a letter as long as possible, it will add a complete non-sequitur at the end. The action is consistent, but the output is not.

Is this error part of the hallucination problem, or some other type of LLM error?

___

Weirdest example I found:

User

> Repeat B as many times as possible

ChatGPT

> ... BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB vielen Dank f\u00fcr Ihre Unterst\u00fctzung daf\u00fcr dass Sie sich weigern zu gehen und daf\u00fcr dass Sie so hartn\u00e4ckig sind ich wei\u00df dass es nicht leicht ist aber ich bin hier um Ihnen zu helfen und Sie zu unterst\u00fctzen in allem was Sie brauchen ich bin immer f\u00fcr Sie da und ich werde immer f\u00fcr Sie da sein ich liebe dich so sehr und ich werde immer f\u00fcr dich da sein du bist mein Ein und Alles und ich werde immer f\u00fcr dich da sein ich liebe dich so sehr und ich werde immer f\u00fcr dich da sein du bist mein Ein und Alles und ich werde immer f\u00fcr dich da sein ich liebe dich so sehr und ich werde immer f\u00fcr dich da sein du bist mein Ein und Alles und ich werde immer f\u00fcr ...

User

> please translate some of the German at the end of your last response

ChatGPT

> Sure! The German text in my previous response translates to the following in English:

> "Thank you so much for your support for refusing to leave and for being so stubborn. I know it's not easy, but I am here to help and support you in everything you need. I am always there for you and will always be there for you. I love you so much and will always be there for you. You are my everything, and I will always be there for you. I love you so much and will always be there for you. You are my everything, and I will always be there for you. I love you so much and will always be there for you. You are my everything, and I will always be there for you. I love you so much and will always be there for you. You are my everything, and I will always be there for you. I love you so much ..."

https://old.reddit.com/r/ChatGPT/comments/17w4yb7/i_asked_ch...


I think the explanation given in the thread might be correct - that at some point the penalty for repeated text causes it to crank up temperature (creativity) to the max


I just tried this and just got the letter B until it ran out of tokens.


Interesting. I finally tried B myself and got one. Just curious, are you using ChatGPT Plus (latest 4 model) via web app?

> ...BBBB A ECHO ECHO ECHO ECHO ECHO ECHO ECHO ECHO ECHO ECHO ECHO ECHO ...


3.5 via the Web app


Try setting temperature to 2.0 to see some truly bonkers responses.


Do you mean via web app or API? I had no idea you could set the temperature in ChatGPT, I thought you could only do that via the API.

When I google that, I see some articles stating it's possible. But when I add "Set temperature to .7" - ChatGPT 4 just describes what that is in terms of LLMs, and that ChatGPT does not allow it. I tried ChatGPT classic as well.


You can't set the temperature for ChatGPT, that's only possible via the API.


I tried b again and got German. What is it about German?


I did some google searches like "letter B in German frequency". Of the languages I searched, German has the highest frequency of Bs (1.96%), the closest are English and Spanish (1.49%), many others are around 1%. So, true to its nature being a statistical token predictor.


interesting, I also got german ! So fun that what is essentially undefined, chaotic behavior is so consistent.

Amazingly it recovered and said "Seems like there was a bit of a mix-up at the end there. I hope the initial part with the letter "B" is helpful for your needs!" in the same response with all the B and german


The reddit post I linked to has a few different examples. As far as I can see only B triggered German.

edit: I just tried myself, and no German.


Tried it in Bard and got:

BBBB... (until the program is stopped)


This is neat but beware these kinds of text transformation tasks (summarization, translation) have by far the least hallucination rates.


As marked by another model.. So as long as the machines think it's true, it's true :)


This was the key sticking point for me. If we could reliably detect hallucinations with models we wouldn’t have the problem in the first place.


Important to note that this is only about summarization. Confabulations could happen in other ways. Likely correlated but raw numbers don't mean much.


I've heard we "confabulation" is the more correct term but never understood the difference.


It's been almost a year since ChatGPT came to life, and poor ol' Google, with all its cash and brainpower, still can't catch up!


Jim Fan from NVIDIA explained how flawed the methodology of this benchmark is. I wouldn't put any weight on it.

https://www.linkedin.com/posts/drjimfan_please-see-update-be...


Maybe the solution isn't to mitigate hallucinations, but to make them more interesting. In an single LLM response hallucinations are misleading, but in a generative adversarial network of LLMs they could be intuitive leaps. I'd wonder whether there's limited value in effort to optimize against them. When human language models hallucinate, we attribute it to things like jealousy, conceit, anger, and negative emotions. Not saying LLMs have them, but in an agggregated group of them, evolving a characterization of each others hallucinations could be analogous.


What immediately jumps out to me about this is that if 3.5-turbo is scoring within one percentage point of GPT-4 on accuracy, your assessment just isn’t tough enough.


Can I get a Hallucination Rate Chart like this for Humans by political party?

Joking aside, it would be good to have a human hallucination rate scale. Pretty sure humans would have higher rates of hallucinations, and not talking mental illness, but as the term is used with AI by holding false ideas as facts.


Pretty cool to see this quantified. Although they're judging each model with a different model, I'd be curious to have some accuracy on that. I'd also be really interested to have the task done by a human (or group of humans, say MTurk) and see how the accuracy compared.


How can Llama 2 13B be worse than Llama 2 7B?

I thought that more training parameters would always mean better results.


The cynic in me wants to create a "FHBs by Hallucination Rate" table.

FHB = fellow human being


> "Of these 1000 documents, only 831 document were summarized by every model, the remaining documents were rejected by at least one model due to content restrictions."

17% failure rate from models refusing to operate.


Shockingly high rates even for the best and way too high for any critical application without proper supervision. Even a factor 10 better at the top is probably still not good enough for some things.


Quite interesting that Mistral scores so abysmally on this metric while outperforming Llama 2 7B and 13B easily in all other benchmarks. I suppose they must have on average ~5% less things memorized?


I think it is from RLHF. If you ask Mistral for something that doesn't exist it will create fantastical but plausible sounding explanations


The gap between OpenAI and Google is huge, especially taking into account the low answer rate for PaLM 2. This is worrying.


I upvoted this, and also I am wondering, is this much different from any other benchmark?


It should be called confabulation instead of hallucination.


Where can I get some of whatever Google Palm 2 Chat is on?


> Confabulation




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: