For anyone who still doesn't understand why more and more folks are pointing out that LLMs "hallucinate" 100% of the time, let me put it to you this way: what is the LLM doing differently when it generates tokens that are "wrong" compared to when the tokens are "right"? If there is a difference, where does that exist? In the mechanism of the LLM, or in your mind?
Bonus question 1: Why do humans produce speech? Is it a representation of text? Why then do humans produce text? Is there an intent to communicate, or merely to construct a "correct" symbolic representation?
Bonus question 2: It's been said that every possible phrase is encoded in the constant pi. Do we think pi has intelligence? Intent?
What is the difference between a google map style application that shows pixels that are "right" for a road, and pixels that are "wrong" for a road?
A pixel is a pixel, colors cannot be true or false. The only way we can say a pixel is right or wrong in google maps is whether the act of the human utilizing that pixel to understand geographical information results in the human correctly navigating the road.
In the same way, an LLM can tell me all kinds of things, and those things are just words, which are just characters, which are just bytes, which are just electrons. There is no truth or false value to an electron flowing in a specific pattern in my CPU. The real question is what I, the human, get out of reading those words, if I end up correctly navigating and understanding something about the world based on what I read.
Unfortunately, we do not want LLMs to tell us "all sorts of things," we want them to tell us the truth, to give us the facts. Happy to read how this is the wrong way to use LLMs, but then please stop shoving them into every facet of our lives, because whenever we talk about real-life applications of this tech it somehow is "not the right fit".
> we want them to tell us the truth, to give us the facts.
That's just one use case out of many. We also want it to tell stories, make guesses, come up with ideas, speculate, rephrase, ... We sometimes want facts. And sometimes it's more efficient to say "give me facts" and verify the answer then to find the facts yourself.
I think the impact of LLMs is both overhyped underestimated. The overhyping is easy to see: people predicting mass unemployment, etc., when this technology reliably fails very simple cognitive tasks and has obvious limitations that scale will not solve.
However, I think we are underestimating the new workflows this tech will enable. It will take time to search the design space and find where the value lies, as well as time for users to adapt to a different way of using computers. Even in fields like law where correctness is mission-critical, I see a lot of potential. But not from the current batch of products that are promising to replace real analytical work with a stochastic parrot.
That's a round peg in a square hole. As ive seen them called elsewhere today, these "plausible text generators" can create a pseudo facsimile of reasoning, but they don't reason, and they don't fact check. Even when they use sources to build consensus, its more about volume than authoritativeness.
I was watching the show, 3 Body Problem, and there was a great scene where a guy tells a woman to double check another man’s work. Then goes to the man and tells him to triple check the woman’s work. MoE seems to work this way, but maybe we can leverage different models that have different randomness and maybe we can get to a more logical answer.
We have to start thinking about LLM hallucination differently. When it’s follows logic correctly and provides factual information, that is also a hallucination, but one that fits our flow of logic.
Sure, but if we label the text as “factually accurate” or “logically sound” (or “unsound”) etc., then we can presumably greatly increase the probability of producing text with targeted properties
What on Earth makes you think that training a model on all factual information is going to do a lick in terms of generating factual outputs?
At that point, clearly our only problem has been we've done it wrong all along by not training these things only on academic textbooks! That way we'll only probabilistically get true things out, right? /s
> The real question is what I, the human, get out of reading those words
So then you agree with the OP that an LLM is not intelligent in the same way that Google Maps is not intelligent? That seems to be where your argument leads but you're replying in a way that makes me think you are disagreeing with the op.
I guess I am both agreeing and disagreeing. The exact same problem is true for words in a book. Are the words in a lexicon true, false, or do they not have a truth value?
The words in the book are true or false, making the author correct or incorrect. The question being posed is whether the output of an LLM has an "author," since it's not authored by a single human in the traditional sense. If so, the LLM is an agent of some kind; if not, it's not.
If you're comparing the output of an LLM to a lexicon, you're agreeing with the person you originally replied to. He's arguing that an LLM is incapable of being true or false because of the manner in which its utterances are created, i.e. not by a mind.
So only a mind is capable of something making signs that are either true or false? Is a properly calibrated thermometer that reads "true" if the temperature is over 25C incapable of being true? But isn't this question ridiculous in the first place? Isn't a mind required to judge whether or not something is true, regardless of how this was signaled?
Read again; I said he’s arguing that the LLM (i.e. thermometer in your example) is the thing that can’t be true or false. Its utterances (the readings of your thermometer) can be.
This would be unlike a human, who can be right or wrong independently of an utterance, because they have a mind and beliefs.
I’ll cut to the chase. You’re hung up on the definition of words as opposed to the utility of words.
That classical or quantum mechanics are at all useful depends on the truthfulness of their propositions. If we cared about the process then we let the non-intuitive nature of quantum mechanics enter into the judgement of the usefulness of the science.
The better question to ask is if a tool, be it a book, a thermometer, or an LLM are useful. Error rates affect utility which means that distinctions between correct and incorrect signals are more important than attempts to define arbitrary labels for the tools themselves.
You’re attempting to discount a tool based on everything other than the utility.
Ah okay I understand, I think. So basically that's solipsism applied to a LLM?
I think that's taking things a bit too far though. You can define hallucination in a more useful way. For instance you can say 'hallucination' is when the information in the input doesn't make it to the output. It is possible to make this precise, but it might be impractically hard to measure it.
An extreme version would be a En->FR translation model that translates every sentence into 'omelette du fromage'. Even if it's right the input didn't actually affect the output one bit so it's a hallucination. Compared to a model that actually changes the output when the input changes it's clearly worse.
Conceivably you could check if the probability of a sentence actually decreases if the input changes (which it should if it's based on the input), but given the nonsense models generate at a temperature of 1 I don't quite trust them to assign meaningful probabilities to anything.
No, your constant output example isn’t what people are talking about with “hallucination.” It’s not about destroying information from the input, in the sense that it you asked me a question and I just ignored you, I’m not in general hallucinating. Hallucinating is more about sampling from a distribution which extends beyond what is factually true or actually exists, such as citing a non-existent paper, or inventing a historical figure.
> It's offering support for the claim that LLMs hallucinate 100% of the time, even when their hallucinations happen to be true.
Well this makes the term "hallucinate" completely useless for any sort of distinction. The word then becomes nothing more than a disparaging term for an LLM.
Not really. It distinguishes LLM output from human output even though they look the same sometimes. The process by which something comes into existence is a valid distinction to make, even if the two processes happen to produce the same thing sometimes.
It makes sense to do so in the same way that it’s useful to distinguish quantum mechanics from classical mechanics, even if they make the same predictions sometimes.
A proposition of any kind of mechanics is what can be true or false. The calculations are not what makes up the truth of a proposition, as you’ve pointed out.
But then again, that road, according to a neighboring country who thinks that it’s their land and it isn’t a road. Depending on what country you’re in, it makes a difference
> The real question is what I, the human, get out of reading those words, if I end up correctly navigating and understanding something about the world based on what I read.
No, the real question is how you will end up correctly navigating and understanding something about the world from a falsehood crafted to be optimally harmonious with the truths that happen to surround it?
A pixel (in the context of an image) could be “wrong” in the sense that its assigned value could lead to an image that just looks like a bunch of noise. For instance, suppose we set every pixel in an image to some random value. The resulting would look like total noise and we humans wouldn’t recognized it as a sensible image. By providing a corpus of accepted images, we can train a model to generate images (arrays of pixels) which look like images and not, say, random noise. Now it could still generate an image of some place or person that doesn’t actually exist, so in that sense the pixels are collectively lying to you.
> let me put it to you this way: what is the LLM doing differently when it generates tokens that are "wrong" compared to when the tokens are "right"?
It is conditioning on latents about truth, falsity, reliability, and calibration. All of these inferred latents have been shown to exist inside LLMs, as they need to exist for LLMs to do their jobs in accurately predicting the next token. (Imagine trying to predict tokens in, say, discussions about people arguing or critiques of fictional stories, or discussing mistakes made by people, and not having latents like that!)
LLMs also model other things: for example, you can use them to predict demographic information about the authors of texts (https://arxiv.org/abs/2310.07298), even though this is something that pretty much never exists IRL, a piece of text with a demographic label like "written by a 28yo"; it is simply a latent that the LLM has learned for its usefulness, and can be tapped into. This is why a LLM can generate text that it thinks was written by a Valley girl in the 1980s, or text which is 'wrong', or text which is 'right', and this is why you see things like in Codex, they found that if the prompted code had subtle bugs, the completions tend to have subtle bugs - because the model knows there's 'good' and 'bad' code, and 'bad' code will be followed by more bad code, and so on.
This should all come as no surprise - what else did you think would happen? - but pointing out that for it to be possible, the LLM has to be inferring hidden properties of the text like the nature of its author, seems to surprise people.
> It is conditioning on latents about truth, falsity, reliability, and calibration. All of these inferred latents have been shown to exist inside LLMs, as they need to exist for LLMs to do their jobs in accurately predicting the next token.
No, it isn't, and no, they haven't [1], and no, they don't.
The only thing that "needs to exist" for an LLM to generate the next token is a whole bunch of training data containing that token, so that it can condition based on context. You can stare at your navel and claim that these higher-level concepts end up encoded in the bajillions of free parameters of the model -- and hey, maybe they do -- but that's not the same thing as "conditioning on latents". There's no explicit representation of "truth" in an LLM, just like there's no explicit representation of a dog in Stable Diffusion.
Do the thought exercise: if you trained an LLM on nothing but nonsense text, would it produce "truth"?
LLMs "hallucinate" precisely because they have no idea what truth means. It's just a weird emergent outcome that when you train them on the entire internet, they generate something close to enough to truthy, most of the time. But it's all tokens to the model.
[1] I have no idea how you could make the claim that something like a latent conceptualization of truth is "proven" to exist, given that proving any non-trivial statement true or false is basically impossible. How would you even evaluate this capability?
> In this work, we curate high-quality datasets of true/false statements and use them to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgically intervening in a LLM's forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that language models linearly represent the truth or falsehood of factual statements.
You can debate whether the 3 experiments cited back the claim (I don't believe they do), but they certainly don't prove what OP claimed. Even if you demonstrated that an LLM has a "linear structure" when validating true/false statements, that's whole universe away from having a concept of truth that generalizes, for example, to knowing when nonsense is being generated based on conceptual models that can be evaluated to be true or false. It's also very different to ask a model to evaluate the veracity of a nonsense statement, vs. avoiding the generation of a nonsense statement. The former is easier than the latter, and probably could have been done with earlier generations of classifiers.
Colloquially, we've got LLMs telling people to put glue on pizza. It's obvious from direct experience that they're incapable of knowing true and false in a general sense.
> [...] but they certainly don't prove what OP claimed.
OP's claim was not: "LLMs know whether text is true, false, reliable, or is epistemically calibrated".
But rather: "[LLMs condition] on latents *ABOUT* truth, falsity, reliability, and calibration".
> It's also very different to ask a model to evaluate the veracity of a nonsense statement, vs. avoiding the generation of a nonsense statement [...] probably could have been done with earlier generations of classifiers
Yes. OP's point was not about generation, it was about representation (specifically conditioning on the representation of the [con]text).
Your aside about classifiers is not only very apt, it is also exactly OP's point! LLMs are implicit classifiers, and the features they classify have been shown to include those that seem necessary to effectively predict text!
> It's obvious from direct experience that they're incapable of knowing true and false in a general sense.
Yes, otherwise they would be perfect oracles, instead they're imperfect classifiers.
Of course, you could also object that LLMs don't "really" classify anything (please don't), at which point the question becomes how effective they are when used as classifiers, which is what the cited experiments investigate.
> But rather: "[LLMs condition] on latents ABOUT truth, falsity, reliability, and calibration".
Yes, I know. And the paper didn't show that. It projected some activations into low-dimensional space, and claimed that since there was a pattern in the plots, it's a "latent".
The other experiments were similarly hand-wavy.
> Your aside about classifiers is not only very apt, it is also exactly OP's point! LLMs are implicit classifiers, and the features they classify have been shown to include those that seem necessary to effectively predict text!
That's what's called a truism: "if it classifies successfully, it must be conditioned on latents about truth".
> "if it classifies successfully, it must be conditioned on latents about truth"
Yes, this is a truism. Successful classification does not depend on latents being about truth.
However, successfully classifying between text intended to be read as either:
- deceptive or honest
- farcical or tautological
- sycophantic or sincere
- controversial or anodyne
does depend on latent representations being about truth (assuming no memorisation, data leakage, or spurious features)
If your position is that this is necessary but not sufficient to demonstrate such a dependence, or that reverse engineering the learned features is necessary for certainty, then I agree.
But I also think this is primarily a semantic disagreement. A representation can be "about something" without representing it in full generality.
So to be more concrete: "The representations produced by LLMs can be used to linearly classify implicit details about a text, and the LLM's representation of those implicit details condition the sampling of text from the LLM".
My sense is an LLM is like Broca's area. It might not reason well, but it'll make good sounding bullshit. What's missing are other systems to put boundaries and tests on this component. We do the same thing too: hallucinate up ideas reliably, calling it remembering, and we do one additional thing: we (or at least the rational) have a truth-testing loop. People forget that people are not actually rational, only their models of people are.
One of the surprising results in research lately was the theory of mind paper the other week that found around half of humans failed the transparent boxes version of the theory of mind questions - something previously assumed to be uniquely a LLM failure case.
I suspect over the next few years we're going to see more and more behaviors in LLMs that turn out to be predictive of human features.
The terminology is wrong but your point is valid. There is no internal criteria or mechanism for statement verification. As the mind likely is also in part a high dimensional construct and LLMs to an extent represent our collective jumble of 'notions' it is natural that their emits resonate with human users.
Q1: A ""correct" symbolic representation" of x. What is x? Your "Is there an intent to communicate, or" choice construct is problematic. Why would one require a "symbolic representation" of x, x likely being a 'meaningful thought'. So this is a hot debate whether semantics is primary or merely application. I believe it is primary in which case "symbolic representation" is 'an aid' to gaining a concrete sense of what is 'somehow' 'understood'. You observe a phenomena, and understand its dynamics. You may even anticipate it while observing. To formalize that understanding is the beginning of 'expression'.
Q2: because while there is a function LLM(encodings, q) that emits 'plausible' responses, an equivalent function for Pi does not exist outside of 'pure inexpressible realm of understanding' :)
>I believe it is primary in which case "symbolic representation" is 'an aid' to gaining a concrete sense of what is 'somehow' 'understood'.
There is nothing magic about perception to distinguish it meaningfully from symbolic representation; in point of fact, that which you experience is in and of itself a symbolic representation of the world around you. You do not sense the frequencies outside the physical transduction capabilities of the human ear, or the wavelengths similarly beyond the capability to transduce of the human eye, or feel distinct contact beyond the density of haptic transduction of somatic nerves. Nevertheless, those phenomena are still there, and despite their insensible nature, have an effect on you. Your entire perception is a map, which one would be well advised to not mistake for the territory. To dismiss symbolic representation as something that only occurs on communication after perception is to "lose sight" of the fact that all the data your mind integrates into a perception is itself, symbolic.
Communication, and symbolic representation is all there is, and it happens long before we even get to the partnof reality where I'm trying to select words to converse about it with you.
> There is nothing magic about perception to distinguish it meaningfully from symbolic representation; in point of fact, that which you experience is in and of itself a symbolic representation of the world around you.
You're right that there's nothing magic about it at all. The mind operates on symbolic representations, but whether those are symbolic representations of external sensory input or symbolic representations of purely endogenous stochastic processes makes for a night-and-day difference.
Perception is a map, but it's a map of real territory, which is what makes it useful. Trying to navigate reality with a map that doesn't represent real territory is not just not useless, it's dangerous.
> As the mind likely is also in part a high dimensional construct and LLMs to an extent represent our collective jumble of 'notions' it is natural that their emits resonate with human users.
But humans are equipped with sensory input, allowing us to formulate our notions by reference to external data, not just generate notions by internally extrapolating existing notions. When we fail do do this, and do formulate our notions entirely endogenously, that's when we say we are hallucinating.
Since LLMs are only capable of endogenous inference, and are not able formulate notions based on empirical observation, the are always hallucinating.
> what is the LLM doing differently when it generates tokens that are "wrong" compared to when the tokens are "right"?
When they don't recall correctly, it is hallucination. When they recall perfectly, it is regurgitation/copyright infringement. We find issue either way.
May I remind you that we also hallucinate, memory plays tricks on us. We often google stuff just to be sure. It is not the hallucination part that is a real difference between humans and LLMs.
> Why do humans produce speech?
We produce language to solve body/social/environment related problems. LLMs don't have bodies but they do have environments, such as a chat room, where the user is the environment for the model. In fact chat rooms produce trillions of tokens per month worth of interaction and immediate feedback.
If you look at what happens with those trillions of tokens - they go into the heads of hundreds of millions of people, who use the LLM assistance to solve their problems, and of course produce real world effects. Then it will reflect in the next training set, creating a second feedback loop between LLM and environment.
By the way, humans don't produce speech individually, if left alone, without humanity as support. We only learn speech when we get together. Language is social. Human brain is not so smart on its own, but language collects experience across generations. We rely on language for intelligence to a larger degree than we like to admit.
Isn't it a mystery how LLMs learned so many language skills purely from imitating us without their own experience? It shows just how powerful language is on its own. And it shows it can be independent on substrate.
> When they don't recall correctly, it is hallucination. When they recall perfectly, it is regurgitation/copyright infringement. We find issue either way.
Bonus question 2 is the most ridiculous straw man I've seen in a very long time.
The existence of arbitrary string encodings in transcendental numbers proves absolutely nothing about the processing capabilities of adaptive algorithms.
Exactly. Reading digits of pi doesn’t converge toward anything. (And neither do infinite typewriter monkeys.) Any correct value they get is random, and exceedingly rare.
LLMs corral a similar randomness to statistically answer things correctly more often than not.
Here’s the issue: humans do the same thing: the brain builds up a model of the world but the model is not the world. It is a virtual approximation or interpretation based on training data: passed experiences, perceptions, etc.
A human can tell you the sky is blue based on its model. So can any LLM. The sky is blue. So the output from both models is truthy.
> A human can tell you the sky is blue based on its model. So can any LLM. The sky is blue. So the output from both models is truthy.
But a human can also tell you that the sky is blue based looking at the sky, without engaging in any model-based inference. An LLM cannot do that, and can only rely on its model.
Humans can engage in both empirical observation and stochastic inference. An LLM can only engage in stochastic inference. So while both can be truthy, only humans currently have the capacity to be truthful.
It's also worth pointing out that even if human minds worked the same way as LLMs, our training data consists of an aggregation of exactly those empirical observation -- we are tokenizing and correlating our actual experiences of reality, and only subsequently representing the output of our inferences with words. The LLM, on the other hand, is trained only on that second-order data -- the words -- without having access to the much more thorough primary data that it represents.
A witness to a crime thinks that there were 6 shots fired; in fact there were only 2. They remember correctly the gender of the criminal, the color of their jacket, the street corner where it happened, and the time. There is no difference in their mind between the true memories and the false memory.
I write six pieces of code that I believe have no bugs. One has an off-by-one error. I didn't have any different experience writing the buggy code than I did writing the functional code, and I must execute the code to understand that anything different occurred.
Shall I conclude that myself and the witness were hallucinating when we got the right answers? That intelligence is not the thing that got us there?
> Shall I conclude that myself and the witness were hallucinating when we got the right answers?
If you were recalling stored memories of experiences that were actual interactions with external reality, and some of those memories were subsequently corrupted, then no, you were not hallucinating.
If you were engaging in a completely endogenous stochastic process to generate information independently of any interaction with external reality, then yes, you were hallucinating.
> That intelligence is not the thing that got us there?
It's not. In both cases, the information you are recalling is stored data generated by external input. The storage medium happens to be imperfect, however, and occasionally flips bits, so later reads might not exactly match what was written.
But in neither case was the original data generated via a procedural algorithm independently of external input.
People who are consistently unable to distinguish fiction from reality make terribly witnesses; if an obviously high crackhead would fare better than an LLM on the witness stand.
Do we actually think this way though? When I am talking with someone I am cognating about what information and emotion I want to impart to the person / thinking about how they are perceiving me and the sentence construction flows from these intents. Only the sentence construction is even analogous to token generation, and even then, we edit our sentences in our heads all the time before or while talking. Instead of just a constant forward stream of tokens from the LLM.
>what is the LLM doing differently when it generates tokens that are "wrong" compared to when the tokens are "right"? If there is a difference, where does that exist? In the mechanism of the LLM, or in your mind?
If there were a detectable difference within the mechanism, the problem of hallucinations would be easy to fix. There may be ways to analyze logits to find patterns of uncertainty characteristics related to hallucinations. Perhaps deeper introspection of weights might turn up patterns.
The difference isn't really in your mind either. The difference is simply that the one answer correlates with reality and the other does not.
The point of AI models is to generalize from the training data, that implicitly means generating output that it hasn't seen as input.
Perhaps the issue is not so much that it is generalizing/guessing but the degree to which making a guess is the right call is dependent on context.
If I make a machine that makes random sounds in approximately the human vocal range, and occasionally humans who listen to it hear "words" (in their language, naturally), then is that machine telling the truth when words are heard and "hallucinating" the rest of the time?
>what is the LLM doing differently when it generates tokens that are "wrong" compared to when the tokens are "right"?
When an LLM is trained, it essentially compresses the knowledge of the training data corpus into a world model. "Right" and "false" are thereby only emergent when you have a different world model for yourself that tells you a different answer, most likely because the LLM was undertrained on the target domain. But to the LLM, the token with the highest probability will be the most likely correct answer, similarly to how you might have a "gut feeling" when asked about something which you clearly don't understand and have no education in. And you will be wrong just as often. The perceived overconfidence of wrong answers likely stems from human behaviour in the training data as well. LLMs are not better than humans, but they are also not worse. They are just a distilled encoding of human behaviour, which in turn might be all that the human mind is in the end.
LLM's become fluent in constructing coherent, sophisticated text in natural language from training on obscene amounts of coherent, sophisticated text in natural language. Importantly, there is no such corpus of text that contains only accurate knowledge, let alone knowledge as it unambiguously applies to some specific domain.
It's unclear that any such corpus could exist (a millennias old discussion in philosophy with no possible resolution), but even if you take for granted that such a corpus could, we don't have one.
So what happens is that after learning how to construct coherent, sophisticated text in natural language from all the bullshit-adled general text that includes truth and fiction and lies and fantasy and bullshit and garbage and old text and new text, there is a subsequent effort to sort of tune the model in on some generating useful text towards some purpose. And here, again, it's important to distinguish that this subsequent training is about utility ("you're a helpful chatbot", "this will trigger a function call that will supplement results", etc) and so still can't focus strictly on knowledge.
LLM's can produce intelligent output that may be correct and may be verifiable, but the way they work and the way they need to be trained prevents them from ever actually representing knowledge itself. The best they can do is create text that is more or less fluent and more or less useful.
It's awesome and has lots and lots of potential, but it's a radically different thing than a material individual that's composed of countless disparate linguistic and non-linguistic systems that have never yet been technologically replicated or modeled.
Wrong. This is the common groupspeak on uninformed places like HN, but it is not what the current research says. See e.g. this: https://arxiv.org/abs/2210.13382
Most of what you wrote shows that you have zero education in modern deep learning, so I really wonder what makes you form such strong opinions on something you know nothing about.
The person you are replying to, said it clearly: "there is no such corpus of text that contains only accurate knowledge"
Deep learning, learns a model of the world, and this model can be as inaccurate as it goes. Earth may as well have 10 moons for a DL model. In order for Earth to have only 1 moon, there has to be a dataset which encodes only that information, and not even once more moons. A drunk person who stares at the moon, sees more than one moon and writes about that on the internet, has to be excluded from the training data.
Also the model of the Othello world, is very different from a model of the real world. I don't know about Othello, but in chess it is pretty well known that all possible chess positions, are more than there are atoms in the universe. For all practical purposes, the dataset of all possible chess positions is infinite.
The dataset of every possible event on earth, every second is also more than all the atoms in the universe. For all practical purposes, it is infinite as well.
Do you know that one dataset is more infinite than the other? Does modern DL state that all infinities are the same?
Wrong again. When you apply statistical learning over a large enough dataset, the wrong answers simply become random normal noise (a consequence of the central limit theorem) - the kind of noise which deep learning has always excelled at filtering out, long before LLMs where a thing - and the truth becomes a constant offset. If you have thousands of pictures of dogs and cats and some were incorrectly labelled, you can still train a perfectly good classifier that will achieve more or less 100% accuracy (and even beat humans) on validation sets. It doesn't matter if a bunch of drunk labellers tainted the ground truth as long as the dataset is big enough. That was the state of DL 10 years ago. Today's models can do a lot more than that. You don't need infinite datasets, they just need to be large enough and cover your domain well.
> You don't need infinite datasets, they just need to be large enough and cover your domain well.
When you are talking about distinguishing noise from a signal, or truth from not-totally-truth, and the domain is sufficiently small, e.g a game like Othello or data from a corporation, then i agree with everything in your comment.
When the domain is huge, then distinguishing truth from lies/non-truth/not-totally-truth is impossible. There will not be such a high quality dataset, because everything changes over time, truth and lies are a moving target.
If we humans cannot distinguish between truth and non-truth, but the A.I. is able to, then we are talking about AGI. Then we put the machines to discover new laws of physics. I am all for it, i just don't see it happening anytime soon.
What you're talking about is by definition no longer facts but opinions. Even AGI won't be able to turn opinions into facts. But LLMs are already very good at giving opinions rather than facts thanks to alignment training.
> When an LLM is trained, it essentially compresses the knowledge of the training data corpus into a world model
No, you added an extra 'l'. It's not a world model, it's a word model. LLMs tokenize and correlate objects that are already second-order symbolic representations of empirical reality. They're not producing a model of the world, but rather a model of another model.
Do you have a citation for the existence of an LLM "world model"?
My understanding of retrieval-augmented generation is that it is an attempt to add a world model (based on a domain-specific knowledge database) to the LLM; the result of the article is that the combination still hallucinates frequently.
I might even go further to suggest that the latter part of your comment is entirely anthropomorphization.
> If there is a difference, where does that exist? In the mechanism of the LLM, or in your mind?
Thank you for this sentence: it is hard to get across how often Gen-AI proponents are actually projecting perceived success onto LLMs while downplaying error.
You mostly see people projecting perceived error onto LLMs?
I don't think I've seen a single article about an AI getting things wrong, recently, where there was a nuanced notion about whether it was actually wrong.
I don't think we're anywhere close to "nuanced mistakes are the main problem" yet.
But the errors are fundamental, and the successes actually subjective as a result.
That is, it appears to get things right, really a lot, but the conclusions people draw about why it gets things right are undermined by the nature of the errors.
Like, it must have a world model, it must understand the meaning of... etc.; the nature of the errors they are downplaying fundamentally undermines the certainty of these projections.
Large language models are one particular implementation of machine intelligence optimized for generating language but not necessarily for being true or for being grounded in sourceable to reference material or facts.
I think we're taking models that have been optimized for a particular thing and repurposing them towards new tasks and then assuming that the principles of large language models as we have developed them are just universally inherent properties of AI, which I think is not at all the case.
Conceivably, you absolutely could train models where their own built-in metric of success involves a notion of appropriateness of citations. The models we have simply don't do that, but I don't think we should proceed to the assumption that we are witnessing the future of AI with hallucinations everywhere, I just think it's a lazy repurposing of systems that were not initially designed with these things in mind.
Designing the metric of success is the hard part. The magic of LLMs is that they can learn a lot with unlabeled data and a relatively simple cost function. They can be trained the mountains of data we just have lying around without a ton of additional labeling or making clever domain-specific metrics.
If this approach doesn't really generalize well, it negates a lot of the usefulness of these methods and suggests we are further away than we might think.
I think way out of this is training models on multimodal data where one of the reasoning modes are reasoning graphs or knowledge graphs or schematics.
We might gradually improve constructing graphs from data like texts, images and speech. And we could create systems that correctly create synthetic data from generated graphs to serve as a training data for large multimodal model.
> optimized for a particular thing and repurposing them
A lot of human exuberance (and thus investment) is riding on the questionable idea that a really good text fragments correlation specialist system can usefully impersonate a generalist AI. ("LLM, which rocks are the best to eat?")
Even worse: Some people assume an exceptional text-specialist model will usefully meta-impersonate a generalist model impersonating a specific and different kind of specialist! ("LLM, do my taxes.")
How many of the large systems in the world (civil infrastructure, bureaucracies, the internet, etc) don't involve a great deal of repurposing previously present systems for purposes beyond their original design intent? I don't think we're guaranteed to be stuck with hallucinating AI embedded in everything, but if something gets popular enough to work its way into a lot of tech stacks, it can easily persist well beyond the point where anyone designing things from the ground up would include it simply because it was already there.
Yeah as far as my personal use cases, code or basic Q & A that sort of chains together naturally. It's easy enough for me to quick test these ideas or code. I can typically eyeball it when it is wrong and if not then, I'll run it and see.
Done. It works for me, but mostly because I can easily validate and I'm confidant in quick / painless validations on my end.
Something in depth that involves a great deal more multi step reasoning to get from point A to B, I've got my doubts. If anything that's where I see AI wander off the path into a mess.
But how do you do that? The model is a pattern recognition engine and the problem we call "hallucination" is actually that the model is recognizing patterns from data that is similar to, but not an exact match to what we want. In order to eliminate this issue with current architecture it would require a model trained on only the specific data set we want answers to be drawn from. But that isn't a viable solution because just to get responses that aren't total nonsense requires a training set so large as to make this impossible for types of specialized data.
> The model is a pattern recognition engine and the problem we call "hallucination" is actually that the model is recognizing patterns from data that is similar to, but not an exact match to what we want.
But that's a terrible fit for law. In law, the differences really matter.
Maybe two models? One like current LLMs, generating the usual bullshit. A second model trained to map output from the first model to reliable citations or mapping to some value from 0 to 1 predicting the confidence of the models accuracy.
Clearly I am just bullshitting myself here, I don't know how to train the second model. Something mapping text to reliable sources...(waves hands)
But how do you do "reliable citations" with the current architecture? You still have the problem that it is at its core a pattern recognition engine. It will just be "looks similar to all the reliable citations in the training set for similar subjects" not "this is the correct citation for your specific query."
> Something mapping text to reliable sources...(waves hands)
You mean basically Google search? What you want is an intelligent search engine, no such search engine exist today but not due to lack of trying, this is a trillion dollar problem.
Not to say that it is easy in absolute terms, but I'd argue that true/false'ing a statement, e.g. "humans should eat 1 rock a day" is a categorically easier problem than answering "What should humans eat"?
For fun/example I asked gpt3.5 "What percent of dieticians would suggest eating one rock a day is good for your health?" And got a pretty solid if wordy 'none'.
I was wondering if these are fly by night kinda random AI startup type companies offering these services.
>we put the claims of two providers, LexisNexis (creator of Lexis+ AI) and Thomson Reuters (creator of Westlaw AI-Assisted Research and Ask Practical Law AI)), to the test
LexisNexis, and Thomson Reuters, those are established big companies offering these services.
If I was a law firm, I would not trust my client’s information with Harvey or OpenAI. It honestly doesn’t matter what they promise if my reputation is on the line. No company can promise not to be hacked or leaked.
I agree from the perspective that they're probably using the data to do training which is a big big no no in the space, but in terms of hacking, that's true of basically any platform. Certifications like SOCII and Fedramp exist, and even old players like relativity are moving into the cloud. Unless law firms suddenly start wanting to manage their own IT infrastructure, trusting some slice of tech is necessary.
I’d say a small law firm. They don’t have a security team, they’re routinely targeted by phishers, and their data is easier to convert into money than the huge mass of (mostly completely uninteresting) data fed into ChatGPT.
I'd be interested to see other examples of Lexis+ AI failures, because I thought the example answer shown in the article appeared to be the "least wrong" one given that the answer was correct until Dobbs was decided less than 2 years ago.
I'm not in the LLM training space, but I'd also think that that kind of failure would be the easiest to fix, i.e. instead of starting out with "Currently...", start out with "As of <date>..."
Thanks very much for linking that. This makes me think that legal support is actually one of the worst possible uses of LLM-based AI (at least as implemented here), primarily because so much of the source material is directly contradictory, e.g. a legislature passes a law which is subsequently overturned by the courts, or decisions in lower courts are reversed by higher courts, or higher courts reverse themselves over time. It feels like you'd absolutely have to annotate all the source material in some way to say whether it was still controlling law/precedent.
Your instinct is correct. The major legal research providers (Thomson Reuters and LexisNexis) both provide “citators”, which are human annotations of which cases and statutes have been overruled, upheld, criticized, etc. One of the issues the paper describes is the fairly ham-handed way this gets integrated into these systems, causing even more trouble.
Pretty much the same, when I try to get a LLM output correct code targeting a certain libary, but in its training data are various conflicting versions of the libary and the result is a incompatible mix composed of code for different versions thrown together.
Is there actually an effective way to handle queries with RAG where time periods are relevant? I made a proof-of-concept RAG for documents on a government website shortly after GPT-3.5 came out and remember this being a big problem. The most glaring wrong answer was "Who is currently the governor?" It answered with the previous governor, likely because he was listed as such in 8 years of documents versus the 2 years at the time for the current governor.
I'd suggest they face more constraints, to the point of it being impossible for them to create something good here:
1) Executives who don't understand LLMs at all and don't really understand the process of building products under massive uncertainty around product quality.
2) Executives who have grown up building products which didn't need insane amounts of continual tuning.
3) Internal process that isn't set up for the types of products mentioned in 1 and 2
4) Entire groups of stakeholders in the same position as executives, putting pressure on the people building these things.
To make an analogy - this space is closer to the self driving car project at Google. Zero product risk, the technology is sort of all there, but putting it all together is difficult and it's gonna be ready when it's ready. Pushing something out to meet a schedule aint gonna work.
This is the (near term) future of white collar jobs: AI does most of the work, but real people will still be required to audit and verify basically everything. I don't know how much net gain productivity there will be...in certain professions an error rate that high is unacceptable and dangerous.
The whole AI workflow sounds incredibly backwards - you give all these creative tasks to computers, which are notoriously bad at creative tasks, and then the drudgery of verification and review goes to humans, which are notoriously bad at verification and review.
Robots replaced most painters a century ago, its called cameras. So what you are talking about has already happened many times over and started a long time ago, its just easier to replace art since it doesn't have a precise requirement.
Or see social media replacing news etc, similar situation. It turns out that manipulating our emotions at scale isn't that hard.
Edit: Look at old things that were made, craftsmen put in extra work to make them beautiful. Modern things machine took all that work and produce standard things, machines took most of the art making that lots of people used to engage with already. Instead today humans mostly do paperwork.
Edit2: Point was that technology always made us move from art to ever more tedious work, only exception was probably farming.
You can absolutely automate ornamentation with cnc. Companies choose not to because there are very few kinds of ornamentation that stand the test of time. The ornate things you see on Antiques Roadshow were, in their time, contemporary with tons and tons of crap that didn’t survive. What you’re seeing is that automation is used to produce crap to exacting standards, which frees craftsmen up to produce ornate pieces for very high cost. People also don’t want to pay for ornament on disposable pieces, which most flat pack furniture is disposable. So it’s far more complex than you make it out to be.
I see a ton of ornamentation in poor old villages and tribes, craftsmen making ornamental pieces has always been a standard part of human culture. Today that part of human culture is mostly lost and only a few hobbyist engage with it or buy it, most are happy with robot made ornaments.
Sure they exist, but they aren't nearly as established or high status as a century ago. Technology did replace artists, some still do art but they are very few and mostly just do it on their personal time instead of being professional artists. You can still draw your own art even with image generators, they didn't change that.
Before automatic photos, automatic music playing etc, people who could play music or draw were highly sought after just to draw and play music everywhere, since there were no alternatives. Technology changed that and today those are dying breeds.
The reason music performances aren't as high status as centuries ago is that it's been democratized. The barrier to entry is lower. Anyone can get a mic and amp and start busking on the street corner. I think that's a good thing for humanity at large.
Practice of art works better IMO as an avenue for creative expression than it does as a status symbol. This can be done even with no audience at all. Perhaps the audience is all consuming AI-generated work.
Status often comes with rarity. There's more art now, and that's a good thing. Some of it has status, but most is just people doing stuff because they can. And that's not a bad thing.
> The reason music performances aren't as high status as centuries ago is that it's been democratized
I'm talking about the median musicians status, not the higher end, that crashed when automatic music playing started to get good. Before then every little village needed to have musicians around for their celebrations and parties, knowing how to play music was respected as a proper profession then unlike now.
That goes for all the ornamental pieces that used to be done by craftsmen as well in villages, things like making nice signs or making celebratory furniture or festival clothing etc, all that artistry that common people used to do is now mass produced by machines that just copies a single design over and over.
So overall I'd say machines so far massively reduced human artistry and cultural creation. The artisans of old didn't just copy either, humans aren't machines that can perfectly copy so they all put in a bit of themselves in all of their work, that wasn't soulless, and it is now mostly lost.
> Before then every little village needed to have musicians around for their celebrations and parties, knowing how to play music was respected as a proper profession then unlike now.
This isn't really right, I suspect. What is more likely is that before then, lots of people knew how to play some musical instruments, and most were rudimentary. Because music was foundational to cohesion.
"Musician" as a profession is really quite new -- as a sibling comment suggests, patronage/donations/tips/busking would have been the only way.
But then also it's fair to say that an entire class of reliable musical instruments produced at any kind of production scale is also quite new.
Essentially all valved brass and wind instruments are less than about 200 years old in design. The first modern classical guitar is also not much more than 200 years old, surprisingly. The first pianoforte is only 320 years old or so.
Many simple folk instruments are this sort of age -- the balaika is at most as old as the piano, the ukulele much younger.
Few truly loud melodic instruments existed much beyond 1550; Amati's violin dates to then. Amazingly the rackett, an instrument often used to portray medieval wind music in films, is younger than the violin, and the crumhorn is not much older.
In the bad old days, the only practical way you could really learn music was to get a patron, or somehow make a living out of it. The barrier to entry was sufficiently high that it was a full-time commitment, one way or another.
Today, it's still possible to be a full-time musician, but by far, most musicians have day jobs. You can do music as a hobby.
On one hand, high fidelity music reproduction lowered the demand for performers, as you note. On the other hand, cheap high quality music equipment lowered the barrier to entry. Today you can play your piano piece on a sub-$1k electric keyboard that's portable and never needs to be tuned. It even sounds good. My opinion is that, on balance, the net human artistic output is way up. I also have no data to support this. But it just feels right.
I also think net human artistic and creative output is up and we just don't realise it -- therefore we do not think clearly of the damage generative AI will do to how we feel about the value of our lives.
My understanding was actually that technology enabled more time for leisure activities, one of those being art. Maybe not in the past few decades, but certainly as we through the Renaissance period and onwards.
Cameras replaced only a very specific kind of painting, one that some people still do as a hobby (realism).
To your second edit, the fantasy of automated robots has always been to move us in the direction of a leisure-based society. Even today, Elon musk is promoting robots and AI as the means to get "universal high income".
The fact that this often does not pan out (though arguably laundry machines, factory clothing and vacuum cleaners have done more than almost anything else in that regard) does not mean that the idea is in any way uncommon.
> Cameras replaced only a very specific kind of painting, one that some people still do as a hobby (realism).
That was by far the most common way for artists to provide for themselves though, cameras did take their jobs.
Similarly, text to image only solves a very specific part of art, art where you aren't very creative but instead draw an image based on a description. Artistry based on coming up with new things to draw will still be there. Sure most jobs artists do today are based on drawing based on descriptions, so its the same situation as with the cameras.
Those are the equivalent to prompt engineer artists, they use the new tool to create new art. Still removed most of the jobs, there isn't nearly as much need for photographers as there were need for painters.
When you consider that the people who develop this stuff consider that a good way to write a song might be to get an AI to generate a complete generic, formulaic song and then have a musician fix it, it's not so hard to understand why they think it might work for the law.
Because that isn't how you write good songs any more than it's likely how you write good legal filings.
Depends on your definition of good. All popular songs have been highly formulaic for a long time. That's what makes them popular, they aren't challenging to listen to and sound roughly like everything else. I'm not going to get into the nitty gritty of music theory, but any serious musician can tell you that writing a good song is striking a balance between boring and novelty. I think AI generated music does in fact get you 95% of the way there. The other 5% is already being in the elite levels of the music industry.
That's even more confused. They never said formulaic songs become popular, they said there is a formula to make popular songs. Those are different statements, and the latter is clearly not true.
A "hook" is a pretty pure example of musicianship, because it has to be "hooky".
Gregory Bateson said that "information is the difference that makes a difference".
For a musician as for music fans, hooks are like this: a thing will stop being hooky when everyone uses it.
It just becomes part of music language. Indeed you could argue that many of the fundamental qualities of popular music are hooky qualities pushed down a layer or two.
Carly Rae Jepsen's "Call Me Maybe" is an example of this. It's full of very obvious, well-understood musical tricks -- syncopation, breakdowns, subtle pacing changes.
But it's also full of maddeningly effective hooks that are incredibly, even deviously well-crafted. People write songs like this to feel proud of writing songs that get into everyone's heads, but also because it gets into their heads.
Since generative tools don't know how to focus on craft -- or even of its existence -- a good hook is something a generative algorithm will always struggle with, because it requires innovation, and because it is a complex, fragile element in itself.
The rest of it, as you say... it's likely quicker for an experienced musician to just do the work rather than keep poking a generative tool until it is right.
I bet you all of this holds true -- innovation, craft -- even to some extent in quite banal legal filings.
> but there is already a significant amount of automation for the mechanical parts.
Right -- and there can be ordinary, boring, individual scrutiny of what those pieces do, and data/code fixes for them.
I mean, it's no small amount of irony that the systems lawyers use are close to the kinds of "expert systems" that dominated AI development after the first AI winter.
It's true. I've integrated ChatGPT into my report writing work flow, and about all it's good for is brainstorming, and quickly formatting ideas. You've got to do the hard work yourself.
TL;DR: There's a lot of really crappy, rushed, "AI" that accumulated the last 18 months. Yet AI is awesome. It's *really* stunning to live it. I cannot stress this enough. You could probably do an GPT wrapper startup today, and as long as you don't rush, roundly beat every single incumbent within a year.
Context:
I'm a sole developer that quit my job at Google 7 months ago, decided to 18 months ago after getting first-hand look at how search x AI was being built.
I just wrapped up 2 days of final benchmarking for release.
3-search-query RAG scores 97% on the USMLE, 6 points above "SOTA": Gemini 1.5 tuned for med that does 4 rounds of N answers each round, then 3 search queries to resolve differences.
How am I possibly roundly beating that? Rushing. Lack of attention to basic details while just getting to the desired outcome, "we finetuned for med and beat GPT-4". And that's *Google*, infinitely resourced.
And Perplexity? AI startup darling? 76%. That's the $40/month version. The free version is at 66%. I'm absolutely stunned, it didn't seem great, I didn't know it was actively much worse than just using ChatGPT.
If that's how Google and Perplexity are going, I can't even imagine the shenanigans that are going down at companies like these. They don't have infinite resources, expertise, or have it as a core competency. There's probably more effort put into who gets to work on the AI thing than working on the AI thing.
(I did legal & med benchmarking, for legal to compare TFA, perplexity free 58%, perplexity pro 67%, llama 8B w/no internet 65%, gpt4o x internet 90%)
Industries do explore new products on hype, but they don't generally become widespread or disruptive until they can actually demonstrate some value.
Companies are obliged or incentivized to start introducing products so that customers can see what's acheivable, but it's not yet obvious where generative AI products will be able to really take root and where they'll just be written off until some big innovation is made.
And whatever happens, it seems very unlikely that it will apply to white collar jobs in general any time soon. It will apply to some industries, but others almost certainly won't be able to get what they need from what seems possible far -- or will only find use for it in certain narrow segments of their work.
That and the fact that we're no longer training and leveling up juniors. Companies want to hire just one senior employee and let them use AI as they would a team of junior employees.
But since we're no longer training juniors, we're no longer creating seniors. What happens when there's no one skilled enough to replace those senior employees when they leave?
> Why, if there is rigour and precedent, is there not a solution that doesn't start from a foundation of waffle and imprecision?
Law is a fact-specific discipline, laden with jargon, and where the answers can change on a pretty frequent basis. It's not that far off the mark to say that the answer to every legal question is "it depends." What you need is a system that can take a moderately free-form explanation from a user, tease out the legal aspects from that explanation to match against the legal database (which almost certainly isn't going to use the same terms the user used!) to arrive at the answer, and in the inevitable scenario where the user failed to provide enough information, also be able to recognize that more information is needed to answer the question, and query the user (in their own terms) to get them to provide that information.
LLMs are decently good at handling the language translation aspects, definitely, far better than any previous AI solutions; so it's not hard to see why people try this stuff with LLMs.
So why not ask ChatGPT who to marry and follow that advice instead of your gut, its just random afterall!
No, people really care about reducing error rates, it doesn't matter that some things has higher error rates, if you can reduce the chance of a bad marriage or a bad contract people really want to do that.
Well as it relates to humans, typically at an orthodontist office you have the one licensed orthodontist and any number of dental assistants, and the license orthodontist can oversee the work of the assistants. The assistants do the majority of the work but aren't credentialed, but because they're overseen by the main orthodontist we regard the work as rising to a standard we associate with professional legitimacy.
I think the net result in these cases is that there's a lot more professional capacity to meet the needs of people who have braces.
I don't suspect AI is coming for our teeth anytime soon but there's precedent for it as an oversight structure that satisfies us of the legitimacy of certain types of work.
The assistants do stuff like flossing the braces and such but the ortho comes in to actually install and set them up. Its not purely oversight but a skill gap and what is possible given an amount of training.
I got an ad for an AI legal consultant and it made me upset. Someone is trying to make a cheap buck by outsourcing legal advice to an unreliable source. Now I'm more prepared to explain how it's a bad idea, instead of just why.
I like that the article explains a few different ways hallucinations creep in, besides the obvious. Maybe what's most needed is a better retrieval/search system? If the AI can't fetch high quality data, then it's doomed before it tries.
What really upsets me is the idea of making therapist LLMs.
They may be able to regurgitate an impressive amount of psychology texts, but that doesn't give them the theory of mind, observational skill, experience or judgement needed to be a good psychotherapist.
I know LLMs and genAI are trendy and they seem espectacular and work like magic, but maybe they are not the tool for every problem. I can see them working fine in a lot of aspects/problem, but that doesn't mean they will work fine in every aspect that involves text, even if they are refined, etc.
I guess once the fashion passes (probably not soon, unfortunately), people will see the real value of LLMs and will start applying them to problem that are solvable by them, and not absolutely everything.
No one knows what the real value of LLMs is though. Its a nascent technology. We could be discussing the internet circa 1994 or we could be discussing bitcoin circa 2011.
> the internet circa 1994 or we could be discussing bitcoin circa 2011.
People were clear about the value of both. SMTP by itself was already a clear value for business. The WWW itself was invented in 1989. Same for Bitcoin as the title of the original paper, "Bitcoin: A Peer-to-Peer Electronic Cash System
", explains it all. Some capabilities were not there yet, but it was mostly infrastructure and engineering issues (Acceptance was another issue for bitcoin).
LLMs generate tokens. The generation is governed by the models, but the model itself has no concept of usefulness or truth. Books are not enough to train them. The WWW has a lot of garbage informations. And natural language is insufficient to describe exact processes without formalizing it.
The fact is that the human mind can already do what the LLMs fans hope them to achieve. And we can transfer knowledge through media. Getting something done is mostly assembling the right people in a room, recursively. And we built tools to enhance productivity.
I believe that Microsoft and others are hoping that the general public will be ok with wrong results. But the fact is that all the capabilities exposed so far have an alternative and more deterministic process to achieve the same result. But it often requires learning or hiring a specialist and it's sad to see how many people are balking at that.
I'm going to go out on a limb and say that LexisNexis and Thompson Reuters didn't do nearly enough (if any) taxonomical engineering of the corpus before deciding to just do some sliding window chunking. Without that, the whole "think/plan" part of a natural language query pipeline is all but useless.
I just want to bang a marching bass drum while walking through their office and continually shout, "you still have to do the messy part!"
I try to read a few scotus rulings a year and I always wonder if anyone actually reads all the citations, notes, references, etc. It just feels like doing it right would take hundreds of hours. Sometimes I wonder if the judges have read their full final opinions or if its just the clerks.
Im genuinely curuous, how does this work in regular law? Does anyone ever check all the citations, etc? Does a judge really deep read all 200 pages of filings and arguments? Do they just scan them efficiently as a lot of it is familiar?
SCOTUS justices have 4 full-time clerks, and yes they absolutely check all the citations. Getting onto law review in school involves a “cite checking” test, since this is a very important task for lawyers.
Also, some of the citations are routine, and are very well-known by lawyers and judges. For example, when a lawyer sees “Carolene Products footnote 4”, they don’t have to check it to know it’s about interstate commerce. In some SCOTUS opinions, 1/3 of the footnotes could be routine ones like this.
Clerks are portioned out to read a bunch of this. Judges presumably read their own opinions, as do judges who concur with any opinion written. Mind that reading and writing these things is 90% of their job, 10% is the actual hearings, so imagine dedicating at least 40/hrs a week over months and months over this stuff. I imagine it isn't particularly challenging a workload to write.
Lawyers absolutely read every word of court decisions—citations, notes, appendices, concurrences, and dissents. Usually multiple times. But that tends to happen when looking to cite, criticize, or argue about a particular decision that matters. In other words, when you might have to defend your interpretation against another lawyer. And when you have the time. Which is money.
Lawyers reading for other reasons may skim, hunt for specific passages, or even just glance any provided syllabus, depending on why they are reading. When you've read enough of these, by the same judges, living and dead, over and over, you get a sense not just of formatting and structure but written and thinking style, as well. You get better at determining whether and when it's worth a full, deep read.
One of the things American law schools teach by experience in the first year is that the level of attention, organization, and critical thinking expected in reading is far higher, in its own peculiar way, than what most students are used to putting in, even from very strong academic backgrounds. Then they develop endurance for it, by assigning many cases to read for each class session, several times a week, for several classes at once. All in a competitive environment where your hiring prospects largely come down to grades and your grades come down to a single exam per subject, each awarded on strict statistical curves.
Part of it's that you learn what you need to read. Part of it's that you just grind. When you've been grinding long enough, you don't even feel it anymore. It still hurts, but it's a long, slow abrasion on your mind and personality. Not a stitch in your side anymore.
I'll never tell anyone off from reading Supreme Court opinions. It's your court! As long as you can keep it. But for most folks reading for interest, the syllabus is fine. If you want just a little more than that, read the intro and concluding sections of the "opinion of the court". Then read the intro of each dissent, if there are any.
> Im genuinely curuous, how does this work in regular law? Does anyone ever check all the citations, etc?
Yes. At higher levels (state supreme courts and above), there are staff checking every single citation. It's often checked in a paper copy of the original source, which in rare cases could be a book from e.g. the 1700s.
That's roughly the range of nonsensical or incorrect answers I've gotten from various human lawyers, accountants, financial advisors, and other professionals over the years. I've come to expect about a fifth to a quarter of their advice to have problems with it.
Being wrong a third of the time would be a notable improvement for the human medical professionals I've dealt with. It's basically a coin-flip when it comes to the correctness and quality of what they claim.
Where is the recourse when your financial advisor loses your money, your lawyer loses your case, and your surgeon loses your life? Happens all too often because there literally isn’t any recourse.
I don’t think that word “smarmy” means what you think it means, and there is no valid point about malpractice - that’s the essence of the comment. Laws are only as effective as the means of enforcing them, and in this case not everyone has equal access to the justice system.
I have an uncle who is an attorney in X state. I had him try, using GPT4, a bunch of prompts about X state law in his specialty and the rate was of hallucination was much higher than 1 in 6. Probably half or more were incorrect. Often the answers would be correct for other states, but not for X state. Alternatively, they were correct for X state at a certain point in time, but no longer are.
Geoffrey Hinton says the term "confabulate" is more apt to describe the mutex/mutating memory recall operation of the brain as it summons "known facts" and "gets them wrong." confabulation is a feature, not a bug, of the brain, and also a feature, not a bug. If the goal is to reach brainlike computing, we are getting closer.
I feel like LLMs just show how utterly stupid we all are, fundamentally.
All it takes is a machine using full sentences, and suddenly we assume it must be smart. This is why humans fall for scams, and this is why humans get into cults.
I guess thats something LLMs can teach us; how flawed we are in evaluating whether someone is speaking the truth without researching anything.
I wish someone would get free consultations with 100 attorneys and record the meetings. Then grade their answers in the same way the LLMs were graded.
My guess (and it is only a guess) is that the AIs outperform real attorneys during free consultations by hallucinating less and giving truthful answers more often.
We think of truthfulness as boolean — whether in humans or AI models — but it may be more useful to think of it as a skill. You can become more truthful by aligning your inner model with reality and then also by checking it with ground truth. A challenging and intrinsically imperfect process either way!
seeing a lot of bad takes in the comments but from enterprise sales point of view: companies aren't buying or trusting LLMs
1 out of 6 times the blackbox produces completely wonky/risky output. Yet it costs far far more to fix the changes introduced by the blackbox including the hallucinations.
This is the blockchain argument all over again I see in the comments. Literally just swap it out for AI/LLM and it feels like the crypto bubble peak
What AI hype is doing is giving companies to massively garnish white collar wages by making everybody think it can (when in reality AI cannot replace a sentient human interface completely) extrapolating to wild science fiction novels. A bit disappointing to see, thought reddit would be more for this type of speculation.
The solution is simple - put the relevant law in the prompt. The larger context size of Claude lets you shove in half-a-dozen cases, your petition, and a few other docs.
When you do that, the quality is more than passable, and instantaneous.
The article’s cited examples of errors, such as not understanding that a precedent has been overruled or inventing a provision of the bankruptcy code, gives a fascinating insight into how LLMs work and their limitations.
i get the sense that the solution to this problem is more use of LLMs (running critical feedback and review in a loop) rather than less use of LLMs.
If you can build good tooling around current kinda dumb LLMs now to lower that number, we will be in a pretty good position as the foundational models continue to improve.
Yeah I'd imagine the problem is not verifying the output against retrieved documents. If it just hallucinates it would ignore the given context, something that can absolutely be verified by another LLM.
When talking to LLMs their responses are always a bit off. The best way I can describe it, it’s like they are speaking a specific dialect but you know they are using expressions that wouldn’t be used in that dialect. Or, it’s like they are way over their head in the specific matter and reiterate things they really don’t understand. Like there’s no substance in what they are saying.
And if they do this for a simple or a matter that you know of, how can you trust their answers in a matter if you can’t evaluate their answer?
It doesn’t get better when they are totally mixing things up but say it with certainty, instead of just saying that they don’t know or admit that they have gaps in what they know. They don’t just get it wrong like humans do, it’s just weirdly wrong. Humans usually get it consistently wrong. Perhaps it’s about what they are not saying.
Would I always be able to tell it’s not a human I talk with, probably not. But chances are that I will know, the longer I talk to them.
Well, we are always using our internal models for work like this. When our internal models and what we say don't match reality, we might call it "hallucination" or "getting it wrong" or even "lying".
When we get it right — our internal model matches reality, and we also say something that matches reality — we call it "truthfulness", and it's super valuable.
But I think there are two really different sources of hallucination / inaccuracy / lies. One is that the internal model is wrong — "Oops, I was wrong. The CVS isn't on Main St." The other is when we decide to deceive. "Haha, I sent you to Main St. CVS is really on Center Rd." Two very different internal processes with the same outcome.
If we were only engaging in model-based inference, then we, too, would always be hallucinating. But the very thing you're pointing out -- that we act differently when our internal model is wrong vs. when it is right -- is the crux of the difference. We use models, but then we have the ability to immediately test the output of those models for correctness, because we have semantic, not just syntactic, awareness of both the input and output data. We have criteria for determining the accuracy of what our model is producing.
LLMs don't, and are only capable of engaging in stochastic inference from the pre-defined model, which solely represents syntactic patterns, and have no ability to determine whether anything they output is semantically correct.
LLMs hallucinate 100% of the time - hallucinating is what they Do. Often they’re hallucinating things that also occur in reality, but it sounds like the legal products are only doing so 66-83% of the time.
Words are meaningless if we don't adhere to some common definition. "Hallucination" and "generation" are two separate concepts, and it is unproductive to conflate them.
Then maybe they shouldn't be. "Generating" a legal reference and "hallucinating" it are pretty much the same thing. Lawyers need AIs that "remember" legal references or are better capable of "looking them up" on demand. "Generation" is the completely wrong thing to do here. In fact it is arguably the worst possible thing, inasmuch as simply failing would actually be far preferable.
I wouldn't expect a human to be able to just spit out all relevant case law, completely correctly with every last aspect of the citation accurate down to every last digit, based on solely their internal language model either.
Yes, this is what I mean: the thing the LLM is doing is a statistical guesswork process - that's the core of what the thing is. In many cases, its statistical guesswork is correct - it's guessed the right thing. That is not the same thing as actually referencing a canonical source, and should not be confused as such. So long as people's mental models of LLMs don't match what the LLM is actually doing, we'll continue to be surprised by how often they're wrong about things and we'll continue to try to use them in ways we shouldn't. Recognize what the tool you're using is and use it appropriately.
Memory is generation; the memory palace metaphor is much less accurate IMO than ones that focus on Hebbian learning, learned (Pavlovian) responses, and/or, ironically enough, machine learning. The only difference between “remembering” a legal reference and “generating” one is that the former denotes a much higher level of confidence. Which, ofc, we all agree is an inescapably important metric — an acceptable failure rate for these models would be maybe 5%, MAX. LLMs used alone through a browser window, even with a half-assed “RAG” solution, are not enough IMO.
I think what you'd find in a real lawyer's office is that they don't "remember" or "generate" very many references at all. Those books in the stereotypical image of a lawyer's office weren't there just there to look lawyerly, and in modern times those computers are equipped with some expensive subscriptions to legal databases. This is not how legal references are created, not now, not in the past. Depending on an LLM (not "AI", LLM specfically) to do this is insane, stupid, and anyone selling an LLM-based product for this purpose ought to be sued for fraud... and that only coincidentally since they are providing a product to lawyers.
Why is anyone even defending this? LLMs are objectively unsuited for this. LLMs are objectively failing at this. LLM architecture is obviously not a good choice for this task to anyone with even a basic understanding of them. But... who cares? All this means is that LLMs are not suited for this task. I wouldn't be banging on this except for the engineers running around here with stars in their eyes thinking that despite all the evidence they've finally found the tool that does everything. No. We haven't. They don't. This isn't an attack on the viability of AI as a whole or LLMs specifically, any more than pointing out that a hammer is not a good tool for cutting grass is an attack on hammers.
I think many people forgot what mastery entails. It's all about being reliable to reproduce a process or solving a problem. Sometimes the process is deterministic and we engineer a tool for it. Or it's not and we need to approach it the old way by learning and training.
Providing truthful information is one of the latter and one of the flaws is our memory. Writing it down or recording it correct this and the next problem was retrieving the snippet we need. That's one of the things lawyer does. They don't invent laws out of thin air, they're just better at finding the helpful bits. Even today, I spend a lot of time looking at libraries and languages documentation to verify a hunch I have. But I still have to go through the learning phase to know what to search for.
In Dreyfus Model of skill acquisition, we have these four qualities:
Recollection (non-situational or situational)
Recognition (decomposed or holistic)
Decision (analytical or intuitive)
Awareness (monitoring or absorbed)
The novice is described by the left values and the expert by the right values for each attribute. For the novice, every part is costly. For the expert, mastery has reduced the cost for taking actions while still having the possibility to do a more deliberate approach when the chance of error is too high. Novice are hoping LLMs can reproduce mastery, expert knows that LLMs inherent capacity of making errors make it a liability.
Thanks for the in depth, passionate followup! Right off the bat I want to clarify that I was talking about human cognition, not just typical attorney work — I’d stand by the assertion that it’s hallucination all the way down, at the very least “hallucinating a symbolic representation of the book passage you read 2s ago”.
Re: LLMs and law, I agree with all your complaints 100% if we constrain the discussion to direct/simplistic/“chatbot”-esque systems. But that’s simply not what the frontier is. LLMs are a ground breaking technique for building intuitive components within a much larger computational system that looks like existing complex software. We’re not excited about (only) crazy groundbreaking products, we’re excited about enhancing existing products with intuitive features.
To briefly touch on your very strong beliefs about LLM models being a bad architecture for legal tasks: I couldn’t disagree more. LLMs specialize in linguistic structures, somewhat tautologically. What’s not linguistic about individual atomic tasks like “review this document for relevant passages” or “synthesize these citations and facts into X format”? Lawyers are smart and do lots of deliberation, sure, but that doesn’t mean they’re above the use of intuition.
As far as we’re in an argument of some kind, my closing argument is that people as a whole can be pretty smart, and there’s a HUGE wave of money going into the AI race all of a sudden. Like, dwarfing the “Silicon Valley era” altogether. What are the chances that you’re seeing the super obvious problem that they’re all missing? Remember that this isn’t just stock price speculation, this is committed investments of huge sums of capital into this specific industry.
Which, indeed, we do. The current understanding of perception is that sense inputs serve to correct the brain's existing model of the world - you do not look at something and then perceive it, you generate an image of it and then use your vision to correct it.
Actual hallucination is quite eye-opening in part because of the realization that perception lenses all our senses _all the time_. So while I agree that it's over-broadening the definition of hallucination, I think it ironically appropriate too.
The point is that LLMs are never right for the right reason. Humans who understand the subject matter can make mistakes, but they are mistakes of a different nature. The issue reminds me of this from Terry Tao (LLMs being not-even pre-rigorous, but adept at forging the style of rigorous exposition):
It is perhaps worth noting that mathematicians at all three of the above stages of mathematical development can still make formal mistakes in their mathematical writing. However, the nature of these mistakes tends to be rather different, depending on what stage one is at:
1. Mathematicians at the pre-rigorous stage of development often make formal errors because they are unable to understand how the rigorous mathematical formalism actually works, and are instead applying formal rules or heuristics blindly. It can often be quite difficult for such mathematicians to appreciate and correct these errors even when those errors are explicitly pointed out to them.
2. Mathematicians at the rigorous stage of development can still make formal errors because they have not yet perfected their formal understanding, or are unable to perform enough “sanity checks” against intuition or other rules of thumb to catch, say, a sign error, or a failure to correctly verify a crucial hypothesis in a tool. However, such errors can usually be detected (and often repaired) once they are pointed out to them.
3. Mathematicians at the post-rigorous stage of development are not infallible, and are still capable of making formal errors in their writing. But this is often because they no longer need the formalism in order to perform high-level mathematical reasoning, and are actually proceeding largely through intuition, which is then translated (possibly incorrectly) into formal mathematical language.
The distinction between the three types of errors can lead to the phenomenon (which can often be quite puzzling to readers at earlier stages of mathematical development) of a mathematical argument by a post-rigorous mathematician which locally contains a number of typos and other formal errors, but is globally quite sound, with the local errors propagating for a while before being cancelled out by other local errors. (In contrast, when unchecked by a solid intuition, once an error is introduced in an argument by a pre-rigorous or rigorous mathematician, it is possible for the error to propagate out of control until one is left with complete nonsense at the end of the argument.)
It makes sense TO conflate them so that people can better understand what's going on under the hood. We need to "de-personalize" these things as much as possible so we don't make stupid mistakes surrounding them.
Think of it like this: How USEFUL is it to say "I googled 'the closest ice cream shop to my house' but it 'hallucinated' the one that was the second-closest'"
> "I googled 'the closest ice cream shop to my house' but it 'hallucinated' the one that was the second-closest'"
a google search doesn't do that though. It will actually select ice cream shops from it's database, determine their location, your location, and calculate the distance. And then select the closest.
It may mis-calculate, but it will not "hallucinate".
This is incredibly different than what LLMs do, which is why hallucinate is appropriate for an LLM but not appropriate for a conventional search or map engine
I fail to see how this difference strongly justifies the word "hallucination."
The computer is crunching numbers and probabilities of things and giving its best guess based on that, in both cases. If it's appropriate for one, it should be for the other.
(which, I'd say, it isn't, because "hallucination" implies "mistake by a human-like thing")
> The computer is crunching numbers and probabilities of things and giving its best guess based on that, in both cases. If it's appropriate for one, it should be for the other.
Google maps isn't choosing a nearest ice cream store based on probability, it's using an algorithm designed for this purpose.
For the sake of sanity, I think we just have to acknowledge these are propagandistic terms that serve the interests of those who invent them, and are design to mystify and cofound non-experts. Then, move on.
I don't really think there's any winning this propaganda war until it shatters under its own illusions. It's shouting into the wind.
It's not that you can't build systems with low rate of success, its when you couple tightly to a blackbox controlled by a small number of people and prohibitively expensive to have any insight infrastructure
if LLM was just another "framework" that a php developer could pick up we could build systems to deal with these "hallucinations".
Everyone’s giving you grief when you’re clearly merely making the Kantian assertion that all experience must have both immediate and mediated elements, namely “our subjective perceptions” and “the unknowable ineffable objectively-undifferentiable Real World”. The best we can do is construct our little hallucinations and hope (and try to empirically ensure!) that they match whatever Reality might be, knowing that we’ll never have a true fundamental link to it.
Well, it’s either that or you were being a human elitist and portraying a difference of degree as a difference in kind. Idk if we’ve settled on terms yet but I’d probably level ones like “automatist”, “bio supremacist”, or maybe “radical humanist”…
> Everyone’s giving you grief when you’re clearly merely making the Kantian assertion that all experience must have both immediate and mediated elements, namely “our subjective perceptions” and “the unknowable ineffable objectively-undifferentiable Real World”. The best we can do is construct our little hallucinations and hope (and try to empirically ensure!) that they match whatever Reality might be, knowing that we’ll never have a true fundamental link to it.
Insofar as the mechanism by which LLMs operate mirrors the mechanisms by which the brain operates - which they do, to some degree - yes, the fact that LLMs hallucinate reflect the gap between perceptions, thoughts, and "reality" in human behavior, and there is indeed a deeper philosophical lesson to be drawn here. If LLMs manage to disabuse the field of software engineering of its apparently common conceptions about the brain works and reflects reality, I'll consider it a win.
> Well, it’s either that or you were being a human elitist and portraying a difference of degree as a difference in kind. Idk if we’ve settled on terms yet but I’d probably level ones like “automatist”, “bio supremacist”, or maybe “radical humanist”…
I'm not sure how I could be considered a bio-supremacist for pointing out that a hammer is not a drill. LLMs are phenomenally complicated tools, but as currently deployed, they're tools, not beings. We may at some point create a system that one could credibly claim is an actual artificial intelligence, and that system may contain things that look or operate like LLMs, but right now what we're looking at are large statistical model that we're operating by providing it inputs and reading the outputs.
A first-line criteria here, for me, would be that the system operates continuously (ie: not just when someone puts a token in the box). The LLMs we're discussing here do not do that. Someone generates a series of tokens, provides those as inputs to the equation, and then runs the equation, and then decides whether to continue to provide inputs into the equation until they're satisfied by the result. That's a tool, not a "being".
I don't know how else to get this message across, but it does this all the time in all subjects.
It doesn't just occasionally hallucinate mistakes. The mechanism by which it makes non-mistakes is identical and it can't tell the difference.
There is no profession where you a) you shouldn't prefer an expert over ChatGPT and b) you won't find experts idiotically using ChatGPT to reduce their workloads.
This is why it's a grotesquely inappropriately positioned and marketed set of products.
GPT based legal/medical/critical positioning is doomed for failure. These are essentially monopolies protected by humans, something AI, even if they become extremely intelligent, cannot infiltrate it to accept it.
A lawyer would be taking on massive risks by trusting GPT outputs even while thinking 1/6 chance of noise filtering would be a form of risk mitigation, they are mistaken. It's a slot machine in some sense but it gives you a small payout 5/6 times.
For sure, I agree with you 100%. This is basically "Legal Analysis for Dummies" if you choose to rely on a machine to give you help here. Medical is also a bad domain.
What do you call a person who just barely passed the bar exam?
These tools are designed for and marketed to lawyers to use. These are not generalist LLM products. Your "this is why lawyers exist" statement makes no sense in the context of these products.
One of the models studied markets itself with "AI-Assisted Research on Westlaw Precision is the first generative AI offering from Thomson Reuters and will help legal professionals find the answers they need faster and with high confidence."
Another says "Most attorneys know Practical Law as the trusted, up-to-date legal know-how platform that helps them get accurate answers across major practice areas. Its newest enhancement, Ask Practical Law AI is a generative AI search tool that dramatically improves the way you access the trusted expertise from Practical Law."
A third says "Transform your legal work with Lexis+ AI"
Only if you want a factual answer that can be depended on.
If you just want a text bot that makes statements one word at a time we can stop throwing 100bs at this sector and some marketing departments need to come in this weekend and redo a pitch or two.
For commercializing generative AI tools, which needs to happen in order for all the investment to pay off, it absolutely does.
In an adversarial industry like law, where there's often another party actively invested in challenging your work, high error rates in source material translate directly to high risk of failure/rejection in a case or claim.
Putting numbers on the failure rate for current products helps the industry understand whether those products are reliable enough yet and what degree of oversight they require.
LLMs are only capable of hallucinating, whereas humans are capable of hallucinating, but are also capable of empirically observing reality. So whatever the rate is, it's necessarily lower than those for LLMs.
In the sense of just completely making up references out of nowhere? Very low. Humans make mistakes, of course, but they tend not to be so egregious. Grounding a submission on a made up reference is quite likely to be fatal to your case in a way that most human errors aren't.
This is the important bit that's always missing in discussions about LLM/AI applications to existing industries - what is the rate of mistakes for a human worker?
My guess (and this is only a guess) is that LLMs hallucinate on legal questions because they expect things to make sense, but the law does not make sense. Lawyers have a vested interest in keeping the law an indecipherable and self-contradicting mess so that people have to pay them to get answers and help. This is the same reason (corrupt) judges treat attorneys better than people who represent themselves and refuse to point out what the law says in court.
LLMs hallucinate on legal questions because they hallucinate on everything.
Hallucination isn't a special weird form of error: it's the entire process by which LLMs work. Proponents and enthusiasts just call the errors "hallucinations" because it sounds better than admitting "this technology makes no distinction between correct and incorrect, and will often incorrectly correct itself when an error is pointed out".
It's probably best to take a step back and ask what you actually want an LLM to do for you. Using RAG and just having it raise references to consider is probably the best you can expect.
I have sometimes asked legal question (to which I pretty much know the answer) of LLMs and my consideration is that a lot of the time the section of an Act (UK) is the principle node about which a piece of information is anchored, but the anchoring is very loose and many sections are often mentioned in proximity, leading to poor differentiation of the relevance of a document/sentence/word vector to a particular piece of legislation. It might be fixed by training NER to recognise treaties/acts/conventions and the jurisdiction and always using that to label references; I suspect the "1" in "section 1" or "17 USC 1", say, is not being tokenised as part of the "section" and this contributes to poor results. Maybe someone has worked on this?
Also, context for the jurisdiction in which a discussion is taking place is often not really given - can the LLM tell that a law firm is talking about PCT, or EPC rather than USC when it discusses first-to-file for patent law and nothing in the document itself mentions the jurisdiction or any of those three initialisms? How about when the same law firm whose blog is mentioning these things represents the same clients at WIPO, EPO and USPTO? If you're going to fine-tune for that with human question-answer sessions you're going to need some really skilled workers who know the field well.
You probably then need specific prompt templates for legal questions too.
Then they need to layer in precedence and authority of different courts, recognise obiter statements from other statements, recognise summaries of positions of different parties aren't original statements by the speaker, disregard ultra vires statements, ... simples.
People's hatred for lawyers isn't going to distract them from their hatred of LLMs it seems, so you get downvoted. People's context window is as short as an LLMs lmao
Bonus question 1: Why do humans produce speech? Is it a representation of text? Why then do humans produce text? Is there an intent to communicate, or merely to construct a "correct" symbolic representation?
Bonus question 2: It's been said that every possible phrase is encoded in the constant pi. Do we think pi has intelligence? Intent?