Legal models hallucinate in 1 out of 6 (or more) benchmarking queries

bithive123 · 2024-05-31T18:32:14 1717180334

For anyone who still doesn't understand why more and more folks are pointing out that LLMs "hallucinate" 100% of the time, let me put it to you this way: what is the LLM doing differently when it generates tokens that are "wrong" compared to when the tokens are "right"? If there is a difference, where does that exist? In the mechanism of the LLM, or in your mind?

Bonus question 1: Why do humans produce speech? Is it a representation of text? Why then do humans produce text? Is there an intent to communicate, or merely to construct a "correct" symbolic representation?

Bonus question 2: It's been said that every possible phrase is encoded in the constant pi. Do we think pi has intelligence? Intent?

Drakim · 2024-05-31T19:06:15 1717182375

What is the difference between a google map style application that shows pixels that are "right" for a road, and pixels that are "wrong" for a road?

A pixel is a pixel, colors cannot be true or false. The only way we can say a pixel is right or wrong in google maps is whether the act of the human utilizing that pixel to understand geographical information results in the human correctly navigating the road.

In the same way, an LLM can tell me all kinds of things, and those things are just words, which are just characters, which are just bytes, which are just electrons. There is no truth or false value to an electron flowing in a specific pattern in my CPU. The real question is what I, the human, get out of reading those words, if I end up correctly navigating and understanding something about the world based on what I read.

surfingdino · 2024-05-31T20:19:12 1717186752

Unfortunately, we do not want LLMs to tell us "all sorts of things," we want them to tell us the truth, to give us the facts. Happy to read how this is the wrong way to use LLMs, but then please stop shoving them into every facet of our lives, because whenever we talk about real-life applications of this tech it somehow is "not the right fit".

viraptor · 2024-06-01T11:31:42 1717241502

> we want them to tell us the truth, to give us the facts.

That's just one use case out of many. We also want it to tell stories, make guesses, come up with ideas, speculate, rephrase, ... We sometimes want facts. And sometimes it's more efficient to say "give me facts" and verify the answer then to find the facts yourself.

surfingdino · 2024-06-01T15:00:51 1717254051

What if other sources of facts switch to confabulating LLMs? How will you be able to tell facts from made up information?

mercer · 2024-06-01T16:51:54 1717260714

how do you do that now?

anon373839 · 2024-06-01T05:41:42 1717220502

I think the impact of LLMs is both overhyped underestimated. The overhyping is easy to see: people predicting mass unemployment, etc., when this technology reliably fails very simple cognitive tasks and has obvious limitations that scale will not solve.

However, I think we are underestimating the new workflows this tech will enable. It will take time to search the design space and find where the value lies, as well as time for users to adapt to a different way of using computers. Even in fields like law where correctness is mission-critical, I see a lot of potential. But not from the current batch of products that are promising to replace real analytical work with a stochastic parrot.

basch · 2024-05-31T21:47:23 1717192043

That's a round peg in a square hole. As ive seen them called elsewhere today, these "plausible text generators" can create a pseudo facsimile of reasoning, but they don't reason, and they don't fact check. Even when they use sources to build consensus, its more about volume than authoritativeness.

theturtletalks · 2024-05-31T22:16:15 1717193775

I was watching the show, 3 Body Problem, and there was a great scene where a guy tells a woman to double check another man’s work. Then goes to the man and tells him to triple check the woman’s work. MoE seems to work this way, but maybe we can leverage different models that have different randomness and maybe we can get to a more logical answer.

We have to start thinking about LLM hallucination differently. When it’s follows logic correctly and provides factual information, that is also a hallucination, but one that fits our flow of logic.

dougabug · 2024-06-01T03:09:06 1717211346

Sure, but if we label the text as “factually accurate” or “logically sound” (or “unsound”) etc., then we can presumably greatly increase the probability of producing text with targeted properties

salawat · 2024-06-01T14:12:18 1717251138

What on Earth makes you think that training a model on all factual information is going to do a lick in terms of generating factual outputs?

At that point, clearly our only problem has been we've done it wrong all along by not training these things only on academic textbooks! That way we'll only probabilistically get true things out, right? /s

tines · 2024-05-31T19:23:52 1717183432

> The real question is what I, the human, get out of reading those words

So then you agree with the OP that an LLM is not intelligent in the same way that Google Maps is not intelligent? That seems to be where your argument leads but you're replying in a way that makes me think you are disagreeing with the op.

Drakim · 2024-05-31T19:42:40 1717184560

I guess I am both agreeing and disagreeing. The exact same problem is true for words in a book. Are the words in a lexicon true, false, or do they not have a truth value?

tines · 2024-05-31T20:28:00 1717187280

The words in the book are true or false, making the author correct or incorrect. The question being posed is whether the output of an LLM has an "author," since it's not authored by a single human in the traditional sense. If so, the LLM is an agent of some kind; if not, it's not.

If you're comparing the output of an LLM to a lexicon, you're agreeing with the person you originally replied to. He's arguing that an LLM is incapable of being true or false because of the manner in which its utterances are created, i.e. not by a mind.

williamcotton · 2024-05-31T20:40:28 1717188028

So only a mind is capable of something making signs that are either true or false? Is a properly calibrated thermometer that reads "true" if the temperature is over 25C incapable of being true? But isn't this question ridiculous in the first place? Isn't a mind required to judge whether or not something is true, regardless of how this was signaled?

tines · 2024-05-31T20:49:08 1717188548

Read again; I said he’s arguing that the LLM (i.e. thermometer in your example) is the thing that can’t be true or false. Its utterances (the readings of your thermometer) can be.

This would be unlike a human, who can be right or wrong independently of an utterance, because they have a mind and beliefs.

williamcotton · 2024-05-31T20:52:55 1717188775

A human can be categorically wrong? Please explain.

And of course a thing in and of itself, be it an apple, a dog or an LLM, can’t be true or false.

williamcotton · 2024-05-31T21:26:22 1717190782

I’ll cut to the chase. You’re hung up on the definition of words as opposed to the utility of words.

That classical or quantum mechanics are at all useful depends on the truthfulness of their propositions. If we cared about the process then we let the non-intuitive nature of quantum mechanics enter into the judgement of the usefulness of the science.

The better question to ask is if a tool, be it a book, a thermometer, or an LLM are useful. Error rates affect utility which means that distinctions between correct and incorrect signals are more important than attempts to define arbitrary labels for the tools themselves.

You’re attempting to discount a tool based on everything other than the utility.

contravariant · 2024-05-31T19:36:01 1717184161

Any reply will always sound like someone is disagreeing, even if they claim not to.

Though in this case I'm not even sure what the comment they're supposedly disagreeing with is even claiming. Is it even claiming anything?

tines · 2024-05-31T20:26:21 1717187181

> Any reply will always sound like someone is disagreeing, even if they claim not to.

Disagree! :)

> Though in this case I'm not even sure what the comment they're supposedly disagreeing with is even claiming. Is it even claiming anything?

It's offering support for the claim that LLMs hallucinate 100% of the time, even when their hallucinations happen to be true.

contravariant · 2024-05-31T20:50:54 1717188654

Ah okay I understand, I think. So basically that's solipsism applied to a LLM?

I think that's taking things a bit too far though. You can define hallucination in a more useful way. For instance you can say 'hallucination' is when the information in the input doesn't make it to the output. It is possible to make this precise, but it might be impractically hard to measure it.

An extreme version would be a En->FR translation model that translates every sentence into 'omelette du fromage'. Even if it's right the input didn't actually affect the output one bit so it's a hallucination. Compared to a model that actually changes the output when the input changes it's clearly worse.

Conceivably you could check if the probability of a sentence actually decreases if the input changes (which it should if it's based on the input), but given the nonsense models generate at a temperature of 1 I don't quite trust them to assign meaningful probabilities to anything.

dougabug · 2024-06-01T03:53:53 1717214033

No, your constant output example isn’t what people are talking about with “hallucination.” It’s not about destroying information from the input, in the sense that it you asked me a question and I just ignored you, I’m not in general hallucinating. Hallucinating is more about sampling from a distribution which extends beyond what is factually true or actually exists, such as citing a non-existent paper, or inventing a historical figure.

williamcotton · 2024-05-31T20:31:57 1717187517

> It's offering support for the claim that LLMs hallucinate 100% of the time, even when their hallucinations happen to be true.

Well this makes the term "hallucinate" completely useless for any sort of distinction. The word then becomes nothing more than a disparaging term for an LLM.

tines · 2024-05-31T20:47:28 1717188448

Not really. It distinguishes LLM output from human output even though they look the same sometimes. The process by which something comes into existence is a valid distinction to make, even if the two processes happen to produce the same thing sometimes.

williamcotton · 2024-05-31T20:51:07 1717188667

Why is it a valid distinction to make?

For example, does this distinction affect the assessed truthfulness of a signal?

Does process affect the artfulness of a painting? “My five year old could draw that?”

tines · 2024-05-31T20:53:40 1717188820

It makes sense to do so in the same way that it’s useful to distinguish quantum mechanics from classical mechanics, even if they make the same predictions sometimes.

williamcotton · 2024-05-31T21:03:46 1717189426

A proposition of any kind of mechanics is what can be true or false. The calculations are not what makes up the truth of a proposition, as you’ve pointed out.

_boffin_ · 2024-05-31T19:20:52 1717183252

But then again, that road, according to a neighboring country who thinks that it’s their land and it isn’t a road. Depending on what country you’re in, it makes a difference

pessimizer · 2024-05-31T20:17:02 1717186622

> The real question is what I, the human, get out of reading those words, if I end up correctly navigating and understanding something about the world based on what I read.

No, the real question is how you will end up correctly navigating and understanding something about the world from a falsehood crafted to be optimally harmonious with the truths that happen to surround it?

dougabug · 2024-06-01T03:04:49 1717211089

A pixel (in the context of an image) could be “wrong” in the sense that its assigned value could lead to an image that just looks like a bunch of noise. For instance, suppose we set every pixel in an image to some random value. The resulting would look like total noise and we humans wouldn’t recognized it as a sensible image. By providing a corpus of accepted images, we can train a model to generate images (arrays of pixels) which look like images and not, say, random noise. Now it could still generate an image of some place or person that doesn’t actually exist, so in that sense the pixels are collectively lying to you.

genman · 2024-05-31T20:42:08 1717188128

> What is the difference between a google map style application that shows pixels that are "right" for a road, and pixels that are "wrong" for a road?

Different training methodology and objective and possibility to correct obviously wrong outcomes by comparing it against reality.

gwern · 2024-05-31T22:19:47 1717193987

> let me put it to you this way: what is the LLM doing differently when it generates tokens that are "wrong" compared to when the tokens are "right"?

It is conditioning on latents about truth, falsity, reliability, and calibration. All of these inferred latents have been shown to exist inside LLMs, as they need to exist for LLMs to do their jobs in accurately predicting the next token. (Imagine trying to predict tokens in, say, discussions about people arguing or critiques of fictional stories, or discussing mistakes made by people, and not having latents like that!)

LLMs also model other things: for example, you can use them to predict demographic information about the authors of texts (https://arxiv.org/abs/2310.07298), even though this is something that pretty much never exists IRL, a piece of text with a demographic label like "written by a 28yo"; it is simply a latent that the LLM has learned for its usefulness, and can be tapped into. This is why a LLM can generate text that it thinks was written by a Valley girl in the 1980s, or text which is 'wrong', or text which is 'right', and this is why you see things like in Codex, they found that if the prompted code had subtle bugs, the completions tend to have subtle bugs - because the model knows there's 'good' and 'bad' code, and 'bad' code will be followed by more bad code, and so on.

This should all come as no surprise - what else did you think would happen? - but pointing out that for it to be possible, the LLM has to be inferring hidden properties of the text like the nature of its author, seems to surprise people.

timr · 2024-05-31T22:43:43 1717195423

> It is conditioning on latents about truth, falsity, reliability, and calibration. All of these inferred latents have been shown to exist inside LLMs, as they need to exist for LLMs to do their jobs in accurately predicting the next token.

No, it isn't, and no, they haven't [1], and no, they don't.

The only thing that "needs to exist" for an LLM to generate the next token is a whole bunch of training data containing that token, so that it can condition based on context. You can stare at your navel and claim that these higher-level concepts end up encoded in the bajillions of free parameters of the model -- and hey, maybe they do -- but that's not the same thing as "conditioning on latents". There's no explicit representation of "truth" in an LLM, just like there's no explicit representation of a dog in Stable Diffusion.

Do the thought exercise: if you trained an LLM on nothing but nonsense text, would it produce "truth"?

LLMs "hallucinate" precisely because they have no idea what truth means. It's just a weird emergent outcome that when you train them on the entire internet, they generate something close to enough to truthy, most of the time. But it's all tokens to the model.

[1] I have no idea how you could make the claim that something like a latent conceptualization of truth is "proven" to exist, given that proving any non-trivial statement true or false is basically impossible. How would you even evaluate this capability?

kromem · 2024-05-31T23:40:16 1717198816

This was AFAIK the first paper to show linear representations of truthiness in LLMs:

https://arxiv.org/abs/2310.06824

But what you should really read over is Anthropic's most recent interpretability paper.

timr · 2024-06-01T15:07:01 1717254421

> In this work, we curate high-quality datasets of true/false statements and use them to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgically intervening in a LLM's forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that language models linearly represent the truth or falsehood of factual statements.

You can debate whether the 3 experiments cited back the claim (I don't believe they do), but they certainly don't prove what OP claimed. Even if you demonstrated that an LLM has a "linear structure" when validating true/false statements, that's whole universe away from having a concept of truth that generalizes, for example, to knowing when nonsense is being generated based on conceptual models that can be evaluated to be true or false. It's also very different to ask a model to evaluate the veracity of a nonsense statement, vs. avoiding the generation of a nonsense statement. The former is easier than the latter, and probably could have been done with earlier generations of classifiers.

Colloquially, we've got LLMs telling people to put glue on pizza. It's obvious from direct experience that they're incapable of knowing true and false in a general sense.

cmorez · 2024-06-02T04:51:22 1717303882

> [...] but they certainly don't prove what OP claimed.

OP's claim was not: "LLMs know whether text is true, false, reliable, or is epistemically calibrated".

But rather: "[LLMs condition] on latents *ABOUT* truth, falsity, reliability, and calibration".

> It's also very different to ask a model to evaluate the veracity of a nonsense statement, vs. avoiding the generation of a nonsense statement [...] probably could have been done with earlier generations of classifiers

Yes. OP's point was not about generation, it was about representation (specifically conditioning on the representation of the [con]text).

Your aside about classifiers is not only very apt, it is also exactly OP's point! LLMs are implicit classifiers, and the features they classify have been shown to include those that seem necessary to effectively predict text!

One of the earliest examples of this was the so-called ["Sentiment Neuron"](https://arxiv.org/abs/1704.01444), and for a more recent look into kind of features LLMs classify, see [Anthropic's experiments](https://transformer-circuits.pub/2024/scaling-monosemanticit...).

> It's obvious from direct experience that they're incapable of knowing true and false in a general sense.

Yes, otherwise they would be perfect oracles, instead they're imperfect classifiers.

Of course, you could also object that LLMs don't "really" classify anything (please don't), at which point the question becomes how effective they are when used as classifiers, which is what the cited experiments investigate.

timr · 2024-06-07T03:06:30 1717729590

> But rather: "[LLMs condition] on latents ABOUT truth, falsity, reliability, and calibration".

Yes, I know. And the paper didn't show that. It projected some activations into low-dimensional space, and claimed that since there was a pattern in the plots, it's a "latent".

The other experiments were similarly hand-wavy.

> Your aside about classifiers is not only very apt, it is also exactly OP's point! LLMs are implicit classifiers, and the features they classify have been shown to include those that seem necessary to effectively predict text!

That's what's called a truism: "if it classifies successfully, it must be conditioned on latents about truth".

cmorez · 2024-06-09T02:25:01 1717899901

> "if it classifies successfully, it must be conditioned on latents about truth"

Yes, this is a truism. Successful classification does not depend on latents being about truth.

However, successfully classifying between text intended to be read as either:

- deceptive or honest

- farcical or tautological

- sycophantic or sincere

- controversial or anodyne

does depend on latent representations being about truth (assuming no memorisation, data leakage, or spurious features)

If your position is that this is necessary but not sufficient to demonstrate such a dependence, or that reverse engineering the learned features is necessary for certainty, then I agree.

But I also think this is primarily a semantic disagreement. A representation can be "about something" without representing it in full generality.

So to be more concrete: "The representations produced by LLMs can be used to linearly classify implicit details about a text, and the LLM's representation of those implicit details condition the sampling of text from the LLM".

rustcleaner · 2024-05-31T22:52:19 1717195939

My sense is an LLM is like Broca's area. It might not reason well, but it'll make good sounding bullshit. What's missing are other systems to put boundaries and tests on this component. We do the same thing too: hallucinate up ideas reliably, calling it remembering, and we do one additional thing: we (or at least the rational) have a truth-testing loop. People forget that people are not actually rational, only their models of people are.

kromem · 2024-05-31T23:36:49 1717198609

One of the surprising results in research lately was the theory of mind paper the other week that found around half of humans failed the transparent boxes version of the theory of mind questions - something previously assumed to be uniquely a LLM failure case.

I suspect over the next few years we're going to see more and more behaviors in LLMs that turn out to be predictive of human features.

temporarely · 2024-05-31T18:53:48 1717181628

The terminology is wrong but your point is valid. There is no internal criteria or mechanism for statement verification. As the mind likely is also in part a high dimensional construct and LLMs to an extent represent our collective jumble of 'notions' it is natural that their emits resonate with human users.

Q1: A ""correct" symbolic representation" of x. What is x? Your "Is there an intent to communicate, or" choice construct is problematic. Why would one require a "symbolic representation" of x, x likely being a 'meaningful thought'. So this is a hot debate whether semantics is primary or merely application. I believe it is primary in which case "symbolic representation" is 'an aid' to gaining a concrete sense of what is 'somehow' 'understood'. You observe a phenomena, and understand its dynamics. You may even anticipate it while observing. To formalize that understanding is the beginning of 'expression'.

Q2: because while there is a function LLM(encodings, q) that emits 'plausible' responses, an equivalent function for Pi does not exist outside of 'pure inexpressible realm of understanding' :)

salawat · 2024-05-31T19:39:18 1717184358

>I believe it is primary in which case "symbolic representation" is 'an aid' to gaining a concrete sense of what is 'somehow' 'understood'.

There is nothing magic about perception to distinguish it meaningfully from symbolic representation; in point of fact, that which you experience is in and of itself a symbolic representation of the world around you. You do not sense the frequencies outside the physical transduction capabilities of the human ear, or the wavelengths similarly beyond the capability to transduce of the human eye, or feel distinct contact beyond the density of haptic transduction of somatic nerves. Nevertheless, those phenomena are still there, and despite their insensible nature, have an effect on you. Your entire perception is a map, which one would be well advised to not mistake for the territory. To dismiss symbolic representation as something that only occurs on communication after perception is to "lose sight" of the fact that all the data your mind integrates into a perception is itself, symbolic.

Communication, and symbolic representation is all there is, and it happens long before we even get to the partnof reality where I'm trying to select words to converse about it with you.

Gormo · 2024-06-01T19:49:26 1717271366

> There is nothing magic about perception to distinguish it meaningfully from symbolic representation; in point of fact, that which you experience is in and of itself a symbolic representation of the world around you.

You're right that there's nothing magic about it at all. The mind operates on symbolic representations, but whether those are symbolic representations of external sensory input or symbolic representations of purely endogenous stochastic processes makes for a night-and-day difference.

Perception is a map, but it's a map of real territory, which is what makes it useful. Trying to navigate reality with a map that doesn't represent real territory is not just not useless, it's dangerous.

Gormo · 2024-06-01T19:45:12 1717271112

> As the mind likely is also in part a high dimensional construct and LLMs to an extent represent our collective jumble of 'notions' it is natural that their emits resonate with human users.

But humans are equipped with sensory input, allowing us to formulate our notions by reference to external data, not just generate notions by internally extrapolating existing notions. When we fail do do this, and do formulate our notions entirely endogenously, that's when we say we are hallucinating.

Since LLMs are only capable of endogenous inference, and are not able formulate notions based on empirical observation, the are always hallucinating.

visarga · 2024-05-31T19:48:32 1717184912

> what is the LLM doing differently when it generates tokens that are "wrong" compared to when the tokens are "right"?

When they don't recall correctly, it is hallucination. When they recall perfectly, it is regurgitation/copyright infringement. We find issue either way.

May I remind you that we also hallucinate, memory plays tricks on us. We often google stuff just to be sure. It is not the hallucination part that is a real difference between humans and LLMs.

> Why do humans produce speech?

We produce language to solve body/social/environment related problems. LLMs don't have bodies but they do have environments, such as a chat room, where the user is the environment for the model. In fact chat rooms produce trillions of tokens per month worth of interaction and immediate feedback.

If you look at what happens with those trillions of tokens - they go into the heads of hundreds of millions of people, who use the LLM assistance to solve their problems, and of course produce real world effects. Then it will reflect in the next training set, creating a second feedback loop between LLM and environment.

By the way, humans don't produce speech individually, if left alone, without humanity as support. We only learn speech when we get together. Language is social. Human brain is not so smart on its own, but language collects experience across generations. We rely on language for intelligence to a larger degree than we like to admit.

Isn't it a mystery how LLMs learned so many language skills purely from imitating us without their own experience? It shows just how powerful language is on its own. And it shows it can be independent on substrate.

jgalt212 · 2024-06-01T00:54:43 1717203283

> When they don't recall correctly, it is hallucination. When they recall perfectly, it is regurgitation/copyright infringement. We find issue either way.

You nailed it right there.

yarg · 2024-05-31T23:49:18 1717199358

Bonus question 2 is the most ridiculous straw man I've seen in a very long time.

The existence of arbitrary string encodings in transcendental numbers proves absolutely nothing about the processing capabilities of adaptive algorithms.

pennomi · 2024-06-01T02:59:27 1717210767

Exactly. Reading digits of pi doesn’t converge toward anything. (And neither do infinite typewriter monkeys.) Any correct value they get is random, and exceedingly rare.

LLMs corral a similar randomness to statistically answer things correctly more often than not.

apantel · 2024-06-01T01:53:38 1717206818

Here’s the issue: humans do the same thing: the brain builds up a model of the world but the model is not the world. It is a virtual approximation or interpretation based on training data: passed experiences, perceptions, etc.

A human can tell you the sky is blue based on its model. So can any LLM. The sky is blue. So the output from both models is truthy.

Gormo · 2024-06-01T20:05:42 1717272342

> A human can tell you the sky is blue based on its model. So can any LLM. The sky is blue. So the output from both models is truthy.

But a human can also tell you that the sky is blue based looking at the sky, without engaging in any model-based inference. An LLM cannot do that, and can only rely on its model.

Humans can engage in both empirical observation and stochastic inference. An LLM can only engage in stochastic inference. So while both can be truthy, only humans currently have the capacity to be truthful.

It's also worth pointing out that even if human minds worked the same way as LLMs, our training data consists of an aggregation of exactly those empirical observation -- we are tokenizing and correlating our actual experiences of reality, and only subsequently representing the output of our inferences with words. The LLM, on the other hand, is trained only on that second-order data -- the words -- without having access to the much more thorough primary data that it represents.

ToValueFunfetti · 2024-06-01T00:10:18 1717200618

A witness to a crime thinks that there were 6 shots fired; in fact there were only 2. They remember correctly the gender of the criminal, the color of their jacket, the street corner where it happened, and the time. There is no difference in their mind between the true memories and the false memory.

I write six pieces of code that I believe have no bugs. One has an off-by-one error. I didn't have any different experience writing the buggy code than I did writing the functional code, and I must execute the code to understand that anything different occurred.

Shall I conclude that myself and the witness were hallucinating when we got the right answers? That intelligence is not the thing that got us there?

Gormo · 2024-06-01T20:02:36 1717272156

> Shall I conclude that myself and the witness were hallucinating when we got the right answers?

If you were recalling stored memories of experiences that were actual interactions with external reality, and some of those memories were subsequently corrupted, then no, you were not hallucinating.

If you were engaging in a completely endogenous stochastic process to generate information independently of any interaction with external reality, then yes, you were hallucinating.

> That intelligence is not the thing that got us there?

It's not. In both cases, the information you are recalling is stored data generated by external input. The storage medium happens to be imperfect, however, and occasionally flips bits, so later reads might not exactly match what was written. But in neither case was the original data generated via a procedural algorithm independently of external input.

sangnoir · 2024-06-01T06:41:56 1717224116

People who are consistently unable to distinguish fiction from reality make terribly witnesses; if an obviously high crackhead would fare better than an LLM on the witness stand.

dangerwill · 2024-05-31T19:07:41 1717182461

Do we actually think this way though? When I am talking with someone I am cognating about what information and emotion I want to impart to the person / thinking about how they are perceiving me and the sentence construction flows from these intents. Only the sentence construction is even analogous to token generation, and even then, we edit our sentences in our heads all the time before or while talking. Instead of just a constant forward stream of tokens from the LLM.

Lerc · 2024-05-31T23:45:26 1717199126

>what is the LLM doing differently when it generates tokens that are "wrong" compared to when the tokens are "right"? If there is a difference, where does that exist? In the mechanism of the LLM, or in your mind?

If there were a detectable difference within the mechanism, the problem of hallucinations would be easy to fix. There may be ways to analyze logits to find patterns of uncertainty characteristics related to hallucinations. Perhaps deeper introspection of weights might turn up patterns.

The difference isn't really in your mind either. The difference is simply that the one answer correlates with reality and the other does not.

The point of AI models is to generalize from the training data, that implicitly means generating output that it hasn't seen as input.

Perhaps the issue is not so much that it is generalizing/guessing but the degree to which making a guess is the right call is dependent on context.

jellicle · 2024-05-31T19:08:05 1717182485

If I make a machine that makes random sounds in approximately the human vocal range, and occasionally humans who listen to it hear "words" (in their language, naturally), then is that machine telling the truth when words are heard and "hallucinating" the rest of the time?

bombcar · 2024-05-31T19:11:37 1717182697

Now I'm imagining a device that takes as input the roaring of a furnace, and only outputs when it recognizes words.

mcguire · 2024-05-31T19:27:48 1717183668

(An aside: Writing is a representation of speech, not the other way around.)

sigmoid10 · 2024-05-31T18:47:41 1717181261

>what is the LLM doing differently when it generates tokens that are "wrong" compared to when the tokens are "right"?

When an LLM is trained, it essentially compresses the knowledge of the training data corpus into a world model. "Right" and "false" are thereby only emergent when you have a different world model for yourself that tells you a different answer, most likely because the LLM was undertrained on the target domain. But to the LLM, the token with the highest probability will be the most likely correct answer, similarly to how you might have a "gut feeling" when asked about something which you clearly don't understand and have no education in. And you will be wrong just as often. The perceived overconfidence of wrong answers likely stems from human behaviour in the training data as well. LLMs are not better than humans, but they are also not worse. They are just a distilled encoding of human behaviour, which in turn might be all that the human mind is in the end.

swatcoder · 2024-05-31T19:08:38 1717182518

No.

LLM's become fluent in constructing coherent, sophisticated text in natural language from training on obscene amounts of coherent, sophisticated text in natural language. Importantly, there is no such corpus of text that contains only accurate knowledge, let alone knowledge as it unambiguously applies to some specific domain.

It's unclear that any such corpus could exist (a millennias old discussion in philosophy with no possible resolution), but even if you take for granted that such a corpus could, we don't have one.

So what happens is that after learning how to construct coherent, sophisticated text in natural language from all the bullshit-adled general text that includes truth and fiction and lies and fantasy and bullshit and garbage and old text and new text, there is a subsequent effort to sort of tune the model in on some generating useful text towards some purpose. And here, again, it's important to distinguish that this subsequent training is about utility ("you're a helpful chatbot", "this will trigger a function call that will supplement results", etc) and so still can't focus strictly on knowledge.

LLM's can produce intelligent output that may be correct and may be verifiable, but the way they work and the way they need to be trained prevents them from ever actually representing knowledge itself. The best they can do is create text that is more or less fluent and more or less useful.

It's awesome and has lots and lots of potential, but it's a radically different thing than a material individual that's composed of countless disparate linguistic and non-linguistic systems that have never yet been technologically replicated or modeled.

sigmoid10 · 2024-06-01T07:32:56 1717227176

Wrong. This is the common groupspeak on uninformed places like HN, but it is not what the current research says. See e.g. this: https://arxiv.org/abs/2210.13382

Most of what you wrote shows that you have zero education in modern deep learning, so I really wonder what makes you form such strong opinions on something you know nothing about.

emporas · 2024-06-01T18:28:03 1717266483

The person you are replying to, said it clearly: "there is no such corpus of text that contains only accurate knowledge"

Deep learning, learns a model of the world, and this model can be as inaccurate as it goes. Earth may as well have 10 moons for a DL model. In order for Earth to have only 1 moon, there has to be a dataset which encodes only that information, and not even once more moons. A drunk person who stares at the moon, sees more than one moon and writes about that on the internet, has to be excluded from the training data.

Also the model of the Othello world, is very different from a model of the real world. I don't know about Othello, but in chess it is pretty well known that all possible chess positions, are more than there are atoms in the universe. For all practical purposes, the dataset of all possible chess positions is infinite.

The dataset of every possible event on earth, every second is also more than all the atoms in the universe. For all practical purposes, it is infinite as well.

Do you know that one dataset is more infinite than the other? Does modern DL state that all infinities are the same?

sigmoid10 · 2024-06-03T09:21:51 1717406511

Wrong again. When you apply statistical learning over a large enough dataset, the wrong answers simply become random normal noise (a consequence of the central limit theorem) - the kind of noise which deep learning has always excelled at filtering out, long before LLMs where a thing - and the truth becomes a constant offset. If you have thousands of pictures of dogs and cats and some were incorrectly labelled, you can still train a perfectly good classifier that will achieve more or less 100% accuracy (and even beat humans) on validation sets. It doesn't matter if a bunch of drunk labellers tainted the ground truth as long as the dataset is big enough. That was the state of DL 10 years ago. Today's models can do a lot more than that. You don't need infinite datasets, they just need to be large enough and cover your domain well.

emporas · 2024-06-03T20:22:50 1717446170

> You don't need infinite datasets, they just need to be large enough and cover your domain well.

When you are talking about distinguishing noise from a signal, or truth from not-totally-truth, and the domain is sufficiently small, e.g a game like Othello or data from a corporation, then i agree with everything in your comment.

When the domain is huge, then distinguishing truth from lies/non-truth/not-totally-truth is impossible. There will not be such a high quality dataset, because everything changes over time, truth and lies are a moving target.

If we humans cannot distinguish between truth and non-truth, but the A.I. is able to, then we are talking about AGI. Then we put the machines to discover new laws of physics. I am all for it, i just don't see it happening anytime soon.

sigmoid10 · 2024-06-05T10:15:06 1717582506

What you're talking about is by definition no longer facts but opinions. Even AGI won't be able to turn opinions into facts. But LLMs are already very good at giving opinions rather than facts thanks to alignment training.

smogcutter · 2024-05-31T19:02:38 1717182158

> But to the LLM, the token with the highest probability will be the most likely correct answer

This is precisely what people are identifying as problematic

Gormo · 2024-06-01T20:17:43 1717273063

> When an LLM is trained, it essentially compresses the knowledge of the training data corpus into a world model

No, you added an extra 'l'. It's not a world model, it's a word model. LLMs tokenize and correlate objects that are already second-order symbolic representations of empirical reality. They're not producing a model of the world, but rather a model of another model.

mcguire · 2024-05-31T19:53:26 1717185206

Do you have a citation for the existence of an LLM "world model"?

My understanding of retrieval-augmented generation is that it is an attempt to add a world model (based on a domain-specific knowledge database) to the LLM; the result of the article is that the combination still hallucinates frequently.

I might even go further to suggest that the latter part of your comment is entirely anthropomorphization.

edmara · 2024-06-01T13:55:34 1717250134

https://arxiv.org/abs/2310.02207 https://transformer-circuits.pub/2024/scaling-monosemanticit...

throw46365 · 2024-05-31T18:46:08 1717181168

> If there is a difference, where does that exist? In the mechanism of the LLM, or in your mind?

Thank you for this sentence: it is hard to get across how often Gen-AI proponents are actually projecting perceived success onto LLMs while downplaying error.

empath75 · 2024-05-31T19:43:31 1717184611

I mostly see the reverse.

throw46365 · 2024-05-31T20:11:57 1717186317

You mostly see people projecting perceived error onto LLMs?

I don't think I've seen a single article about an AI getting things wrong, recently, where there was a nuanced notion about whether it was actually wrong.

I don't think we're anywhere close to "nuanced mistakes are the main problem" yet.

empath75 · 2024-06-01T16:58:09 1717261089

I mostly see people ignoring successes and playing up every error.

throw46365 · 2024-06-01T18:28:58 1717266538

But the errors are fundamental, and the successes actually subjective as a result.

That is, it appears to get things right, really a lot, but the conclusions people draw about why it gets things right are undermined by the nature of the errors.

Like, it must have a world model, it must understand the meaning of... etc.; the nature of the errors they are downplaying fundamentally undermines the certainty of these projections.

glenstein · 2024-05-31T18:01:15 1717178475

Large language models are one particular implementation of machine intelligence optimized for generating language but not necessarily for being true or for being grounded in sourceable to reference material or facts.

I think we're taking models that have been optimized for a particular thing and repurposing them towards new tasks and then assuming that the principles of large language models as we have developed them are just universally inherent properties of AI, which I think is not at all the case.

Conceivably, you absolutely could train models where their own built-in metric of success involves a notion of appropriateness of citations. The models we have simply don't do that, but I don't think we should proceed to the assumption that we are witnessing the future of AI with hallucinations everywhere, I just think it's a lazy repurposing of systems that were not initially designed with these things in mind.

morsecodist · 2024-05-31T18:24:36 1717179876

Designing the metric of success is the hard part. The magic of LLMs is that they can learn a lot with unlabeled data and a relatively simple cost function. They can be trained the mountains of data we just have lying around without a ton of additional labeling or making clever domain-specific metrics.

If this approach doesn't really generalize well, it negates a lot of the usefulness of these methods and suggests we are further away than we might think.

scotty79 · 2024-06-01T01:04:03 1717203843

I think way out of this is training models on multimodal data where one of the reasoning modes are reasoning graphs or knowledge graphs or schematics.

We might gradually improve constructing graphs from data like texts, images and speech. And we could create systems that correctly create synthetic data from generated graphs to serve as a training data for large multimodal model.

Terr_ · 2024-05-31T18:40:11 1717180811

> optimized for a particular thing and repurposing them

A lot of human exuberance (and thus investment) is riding on the questionable idea that a really good text fragments correlation specialist system can usefully impersonate a generalist AI. ("LLM, which rocks are the best to eat?")

Even worse: Some people assume an exceptional text-specialist model will usefully meta-impersonate a generalist model impersonating a specific and different kind of specialist! ("LLM, do my taxes.")

thfuran · 2024-05-31T21:15:59 1717190159

How many of the large systems in the world (civil infrastructure, bureaucracies, the internet, etc) don't involve a great deal of repurposing previously present systems for purposes beyond their original design intent? I don't think we're guaranteed to be stuck with hallucinating AI embedded in everything, but if something gets popular enough to work its way into a lot of tech stacks, it can easily persist well beyond the point where anyone designing things from the ground up would include it simply because it was already there.

duxup · 2024-05-31T18:16:55 1717179415

Yeah as far as my personal use cases, code or basic Q & A that sort of chains together naturally. It's easy enough for me to quick test these ideas or code. I can typically eyeball it when it is wrong and if not then, I'll run it and see.

Done. It works for me, but mostly because I can easily validate and I'm confidant in quick / painless validations on my end.

Something in depth that involves a great deal more multi step reasoning to get from point A to B, I've got my doubts. If anything that's where I see AI wander off the path into a mess.

fallingknife · 2024-05-31T18:24:10 1717179850

But how do you do that? The model is a pattern recognition engine and the problem we call "hallucination" is actually that the model is recognizing patterns from data that is similar to, but not an exact match to what we want. In order to eliminate this issue with current architecture it would require a model trained on only the specific data set we want answers to be drawn from. But that isn't a viable solution because just to get responses that aren't total nonsense requires a training set so large as to make this impossible for types of specialized data.

AnimalMuppet · 2024-05-31T18:40:17 1717180817

> The model is a pattern recognition engine and the problem we call "hallucination" is actually that the model is recognizing patterns from data that is similar to, but not an exact match to what we want.

But that's a terrible fit for law. In law, the differences really matter.

jimbokun · 2024-05-31T18:38:14 1717180694

Maybe two models? One like current LLMs, generating the usual bullshit. A second model trained to map output from the first model to reliable citations or mapping to some value from 0 to 1 predicting the confidence of the models accuracy.

Clearly I am just bullshitting myself here, I don't know how to train the second model. Something mapping text to reliable sources...(waves hands)

fallingknife · 2024-05-31T19:23:55 1717183435

But how do you do "reliable citations" with the current architecture? You still have the problem that it is at its core a pattern recognition engine. It will just be "looks similar to all the reliable citations in the training set for similar subjects" not "this is the correct citation for your specific query."

Jensson · 2024-05-31T18:43:18 1717180998

> Something mapping text to reliable sources...(waves hands)

You mean basically Google search? What you want is an intelligent search engine, no such search engine exist today but not due to lack of trying, this is a trillion dollar problem.

thereisnospork · 2024-05-31T21:11:57 1717189917

Not to say that it is easy in absolute terms, but I'd argue that true/false'ing a statement, e.g. "humans should eat 1 rock a day" is a categorically easier problem than answering "What should humans eat"?

For fun/example I asked gpt3.5 "What percent of dieticians would suggest eating one rock a day is good for your health?" And got a pretty solid if wordy 'none'.

duxup · 2024-05-31T17:50:00 1717177800

I was wondering if these are fly by night kinda random AI startup type companies offering these services.

>we put the claims of two providers, LexisNexis (creator of Lexis+ AI) and Thomson Reuters (creator of Westlaw AI-Assisted Research and Ask Practical Law AI)), to the test

LexisNexis, and Thomson Reuters, those are established big companies offering these services.

__loam · 2024-05-31T17:53:26 1717178006

I would be interested in how specialist companies like Harvey, which is backed by OpenAI, are doing in this space. They're pretty secretive though.

janalsncm · 2024-05-31T18:06:08 1717178768

If I was a law firm, I would not trust my client’s information with Harvey or OpenAI. It honestly doesn’t matter what they promise if my reputation is on the line. No company can promise not to be hacked or leaked.

__loam · 2024-05-31T18:19:26 1717179566

I agree from the perspective that they're probably using the data to do training which is a big big no no in the space, but in terms of hacking, that's true of basically any platform. Certifications like SOCII and Fedramp exist, and even old players like relativity are moving into the cloud. Unless law firms suddenly start wanting to manage their own IT infrastructure, trusting some slice of tech is necessary.

tiahura · 2024-05-31T23:19:48 1717197588

But your firm can guarantee it won’t be hacked? Or broken into. Or social engineered. Or…

meepmorp · 2024-05-31T23:31:26 1717198286

What's more likely - someone targeting your particular law firm or someone targeting OpenAI?

sjy · 2024-06-01T16:10:47 1717258247

I’d say a small law firm. They don’t have a security team, they’re routinely targeted by phishers, and their data is easier to convert into money than the huge mass of (mostly completely uninteresting) data fed into ChatGPT.

hn_throwaway_99 · 2024-05-31T18:09:22 1717178962

I'd be interested to see other examples of Lexis+ AI failures, because I thought the example answer shown in the article appeared to be the "least wrong" one given that the answer was correct until Dobbs was decided less than 2 years ago.

I'm not in the LLM training space, but I'd also think that that kind of failure would be the easiest to fix, i.e. instead of starting out with "Currently...", start out with "As of <date>..."

rfw300 · 2024-05-31T18:17:27 1717179447

There's more examples on page 18 of the paper: https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Halluc...

hn_throwaway_99 · 2024-05-31T18:33:24 1717180404

Thanks very much for linking that. This makes me think that legal support is actually one of the worst possible uses of LLM-based AI (at least as implemented here), primarily because so much of the source material is directly contradictory, e.g. a legislature passes a law which is subsequently overturned by the courts, or decisions in lower courts are reversed by higher courts, or higher courts reverse themselves over time. It feels like you'd absolutely have to annotate all the source material in some way to say whether it was still controlling law/precedent.

rfw300 · 2024-05-31T19:00:31 1717182031

Your instinct is correct. The major legal research providers (Thomson Reuters and LexisNexis) both provide “citators”, which are human annotations of which cases and statutes have been overruled, upheld, criticized, etc. One of the issues the paper describes is the fairly ham-handed way this gets integrated into these systems, causing even more trouble.

lukan · 2024-05-31T19:14:26 1717182866

Pretty much the same, when I try to get a LLM output correct code targeting a certain libary, but in its training data are various conflicting versions of the libary and the result is a incompatible mix composed of code for different versions thrown together.

vharuck · 2024-05-31T18:28:50 1717180130

Is there actually an effective way to handle queries with RAG where time periods are relevant? I made a proof-of-concept RAG for documents on a government website shortly after GPT-3.5 came out and remember this being a big problem. The most glaring wrong answer was "Who is currently the governor?" It answered with the previous governor, likely because he was listed as such in 8 years of documents versus the 2 years at the time for the current governor.

engineer_22 · 2024-05-31T18:05:05 1717178705

They're established companies with expertise in databases and web GUI.

They've got the same talent constraints as everyone else pursuing bespoke LLM implementations.

danielmarkbruce · 2024-05-31T18:38:19 1717180699

I'd suggest they face more constraints, to the point of it being impossible for them to create something good here:

1) Executives who don't understand LLMs at all and don't really understand the process of building products under massive uncertainty around product quality. 2) Executives who have grown up building products which didn't need insane amounts of continual tuning. 3) Internal process that isn't set up for the types of products mentioned in 1 and 2 4) Entire groups of stakeholders in the same position as executives, putting pressure on the people building these things.

To make an analogy - this space is closer to the self driving car project at Google. Zero product risk, the technology is sort of all there, but putting it all together is difficult and it's gonna be ready when it's ready. Pushing something out to meet a schedule aint gonna work.

dcchambers · 2024-05-31T17:47:15 1717177635

This is the (near term) future of white collar jobs: AI does most of the work, but real people will still be required to audit and verify basically everything. I don't know how much net gain productivity there will be...in certain professions an error rate that high is unacceptable and dangerous.

banannaise · 2024-05-31T17:51:28 1717177888

The whole AI workflow sounds incredibly backwards - you give all these creative tasks to computers, which are notoriously bad at creative tasks, and then the drudgery of verification and review goes to humans, which are notoriously bad at verification and review.

yoyohello13 · 2024-05-31T17:55:45 1717178145

Expectation: Me doing art and being creative while robots do the dishes and cook dinner.

Reality: Me doing dishes and cooking dinner while robots do art.

Jensson · 2024-05-31T18:03:15 1717178595

Robots replaced most painters a century ago, its called cameras. So what you are talking about has already happened many times over and started a long time ago, its just easier to replace art since it doesn't have a precise requirement.

Or see social media replacing news etc, similar situation. It turns out that manipulating our emotions at scale isn't that hard.

Edit: Look at old things that were made, craftsmen put in extra work to make them beautiful. Modern things machine took all that work and produce standard things, machines took most of the art making that lots of people used to engage with already. Instead today humans mostly do paperwork.

Edit2: Point was that technology always made us move from art to ever more tedious work, only exception was probably farming.

throwaway173738 · 2024-05-31T18:22:17 1717179737

You can absolutely automate ornamentation with cnc. Companies choose not to because there are very few kinds of ornamentation that stand the test of time. The ornate things you see on Antiques Roadshow were, in their time, contemporary with tons and tons of crap that didn’t survive. What you’re seeing is that automation is used to produce crap to exacting standards, which frees craftsmen up to produce ornate pieces for very high cost. People also don’t want to pay for ornament on disposable pieces, which most flat pack furniture is disposable. So it’s far more complex than you make it out to be.

Jensson · 2024-05-31T18:25:20 1717179920

I see a ton of ornamentation in poor old villages and tribes, craftsmen making ornamental pieces has always been a standard part of human culture. Today that part of human culture is mostly lost and only a few hobbyist engage with it or buy it, most are happy with robot made ornaments.

gnatman · 2024-05-31T18:17:42 1717179462

I would like to encourage you to interact with the art & music spaces in your city!

Jensson · 2024-05-31T18:20:38 1717179638

Sure they exist, but they aren't nearly as established or high status as a century ago. Technology did replace artists, some still do art but they are very few and mostly just do it on their personal time instead of being professional artists. You can still draw your own art even with image generators, they didn't change that.

Before automatic photos, automatic music playing etc, people who could play music or draw were highly sought after just to draw and play music everywhere, since there were no alternatives. Technology changed that and today those are dying breeds.

recursive · 2024-05-31T18:41:59 1717180919

The reason music performances aren't as high status as centuries ago is that it's been democratized. The barrier to entry is lower. Anyone can get a mic and amp and start busking on the street corner. I think that's a good thing for humanity at large.

Practice of art works better IMO as an avenue for creative expression than it does as a status symbol. This can be done even with no audience at all. Perhaps the audience is all consuming AI-generated work.

Status often comes with rarity. There's more art now, and that's a good thing. Some of it has status, but most is just people doing stuff because they can. And that's not a bad thing.

Jensson · 2024-05-31T18:51:56 1717181516

> The reason music performances aren't as high status as centuries ago is that it's been democratized

I'm talking about the median musicians status, not the higher end, that crashed when automatic music playing started to get good. Before then every little village needed to have musicians around for their celebrations and parties, knowing how to play music was respected as a proper profession then unlike now.

That goes for all the ornamental pieces that used to be done by craftsmen as well in villages, things like making nice signs or making celebratory furniture or festival clothing etc, all that artistry that common people used to do is now mass produced by machines that just copies a single design over and over.

So overall I'd say machines so far massively reduced human artistry and cultural creation. The artisans of old didn't just copy either, humans aren't machines that can perfectly copy so they all put in a bit of themselves in all of their work, that wasn't soulless, and it is now mostly lost.

throw46365 · 2024-06-01T18:34:01 1717266841

> Before then every little village needed to have musicians around for their celebrations and parties, knowing how to play music was respected as a proper profession then unlike now.

This isn't really right, I suspect. What is more likely is that before then, lots of people knew how to play some musical instruments, and most were rudimentary. Because music was foundational to cohesion.

"Musician" as a profession is really quite new -- as a sibling comment suggests, patronage/donations/tips/busking would have been the only way.

But then also it's fair to say that an entire class of reliable musical instruments produced at any kind of production scale is also quite new.

Essentially all valved brass and wind instruments are less than about 200 years old in design. The first modern classical guitar is also not much more than 200 years old, surprisingly. The first pianoforte is only 320 years old or so.

Many simple folk instruments are this sort of age -- the balaika is at most as old as the piano, the ukulele much younger.

Few truly loud melodic instruments existed much beyond 1550; Amati's violin dates to then. Amazingly the rackett, an instrument often used to portray medieval wind music in films, is younger than the violin, and the crumhorn is not much older.

recursive · 2024-05-31T21:08:09 1717189689

In the bad old days, the only practical way you could really learn music was to get a patron, or somehow make a living out of it. The barrier to entry was sufficiently high that it was a full-time commitment, one way or another.

Today, it's still possible to be a full-time musician, but by far, most musicians have day jobs. You can do music as a hobby.

On one hand, high fidelity music reproduction lowered the demand for performers, as you note. On the other hand, cheap high quality music equipment lowered the barrier to entry. Today you can play your piano piece on a sub-$1k electric keyboard that's portable and never needs to be tuned. It even sounds good. My opinion is that, on balance, the net human artistic output is way up. I also have no data to support this. But it just feels right.

throw46365 · 2024-06-01T18:35:13 1717266913

I also think net human artistic and creative output is up and we just don't realise it -- therefore we do not think clearly of the damage generative AI will do to how we feel about the value of our lives.

mym1990 · 2024-05-31T18:47:58 1717181278

My understanding was actually that technology enabled more time for leisure activities, one of those being art. Maybe not in the past few decades, but certainly as we through the Renaissance period and onwards.

raincole · 2024-05-31T18:57:45 1717181865

You can still do art as a leisure activity. AI only affects how viable it is as a career.

zdragnar · 2024-05-31T18:23:33 1717179813

Cameras replaced only a very specific kind of painting, one that some people still do as a hobby (realism).

To your second edit, the fantasy of automated robots has always been to move us in the direction of a leisure-based society. Even today, Elon musk is promoting robots and AI as the means to get "universal high income".

The fact that this often does not pan out (though arguably laundry machines, factory clothing and vacuum cleaners have done more than almost anything else in that regard) does not mean that the idea is in any way uncommon.

Jensson · 2024-05-31T18:28:39 1717180119

> Cameras replaced only a very specific kind of painting, one that some people still do as a hobby (realism).

That was by far the most common way for artists to provide for themselves though, cameras did take their jobs.

Similarly, text to image only solves a very specific part of art, art where you aren't very creative but instead draw an image based on a description. Artistry based on coming up with new things to draw will still be there. Sure most jobs artists do today are based on drawing based on descriptions, so its the same situation as with the cameras.

AnimalMuppet · 2024-05-31T18:42:23 1717180943

> That was by far the most common way for artists to provide for themselves though, cameras did take their jobs.

But those artists still exist. They just use cameras. They're called photography studios.

Jensson · 2024-05-31T19:03:12 1717182192

Those are the equivalent to prompt engineer artists, they use the new tool to create new art. Still removed most of the jobs, there isn't nearly as much need for photographers as there were need for painters.

throw46365 · 2024-05-31T17:56:15 1717178175

When you consider that the people who develop this stuff consider that a good way to write a song might be to get an AI to generate a complete generic, formulaic song and then have a musician fix it, it's not so hard to understand why they think it might work for the law.

Because that isn't how you write good songs any more than it's likely how you write good legal filings.

bongodongobob · 2024-05-31T18:04:19 1717178659

Depends on your definition of good. All popular songs have been highly formulaic for a long time. That's what makes them popular, they aren't challenging to listen to and sound roughly like everything else. I'm not going to get into the nitty gritty of music theory, but any serious musician can tell you that writing a good song is striking a balance between boring and novelty. I think AI generated music does in fact get you 95% of the way there. The other 5% is already being in the elite levels of the music industry.

freejazz · 2024-05-31T18:36:21 1717180581

> All popular songs have been highly formulaic for a long time.

Incredibly reductive and facially absurd considering there is no way to formulaically make a hit song.

throw46365 · 2024-05-31T18:43:37 1717181017

Indeed.

Well unless you read The Manual by The KLF ;-)

red_trumpet · 2024-05-31T18:55:23 1717181723

Maybe not every formulaic song becomes a hit song, but every hit song is formulaic?

freejazz · 2024-05-31T19:11:24 1717182684

That's even more confused. They never said formulaic songs become popular, they said there is a formula to make popular songs. Those are different statements, and the latter is clearly not true.

red_trumpet · 2024-06-01T00:16:46 1717201006

> they said there is a formula to make popular songs.

No they didn't. They said all popular songs follow a formula.

freejazz · 2024-06-03T13:51:36 1717422696

Yes, a formula, not formulas

pclmulqdq · 2024-05-31T18:26:22 1717179982

They usually do have a good "hook" that is recognizable. AI music I have seen is very bad at creating those hooks.

A lot of the rest of the process is actually pretty easy to do without AI for competent arrangers and producers.

throw46365 · 2024-05-31T18:32:01 1717180321

A "hook" is a pretty pure example of musicianship, because it has to be "hooky".

Gregory Bateson said that "information is the difference that makes a difference".

For a musician as for music fans, hooks are like this: a thing will stop being hooky when everyone uses it.

It just becomes part of music language. Indeed you could argue that many of the fundamental qualities of popular music are hooky qualities pushed down a layer or two.

Carly Rae Jepsen's "Call Me Maybe" is an example of this. It's full of very obvious, well-understood musical tricks -- syncopation, breakdowns, subtle pacing changes.

But it's also full of maddeningly effective hooks that are incredibly, even deviously well-crafted. People write songs like this to feel proud of writing songs that get into everyone's heads, but also because it gets into their heads.

Since generative tools don't know how to focus on craft -- or even of its existence -- a good hook is something a generative algorithm will always struggle with, because it requires innovation, and because it is a complex, fragile element in itself.

The rest of it, as you say... it's likely quicker for an experienced musician to just do the work rather than keep poking a generative tool until it is right.

I bet you all of this holds true -- innovation, craft -- even to some extent in quite banal legal filings.

pclmulqdq · 2024-05-31T19:05:38 1717182338

As someone who has spent more on lawyers than most people, I agree with you that there is a significant element of craft to it.

There are many parts of the job that are somewhat mechanical, but there is already a significant amount of automation for the mechanical parts.

throw46365 · 2024-05-31T19:28:42 1717183722

> but there is already a significant amount of automation for the mechanical parts.

Right -- and there can be ordinary, boring, individual scrutiny of what those pieces do, and data/code fixes for them.

I mean, it's no small amount of irony that the systems lawyers use are close to the kinds of "expert systems" that dominated AI development after the first AI winter.

pradn · 2024-05-31T17:54:49 1717178089

These aren't creative tasks. They're research and summarization tasks. No one is asking a legal AI to write an opera.

engineer_22 · 2024-05-31T18:07:23 1717178843

It's true. I've integrated ChatGPT into my report writing work flow, and about all it's good for is brainstorming, and quickly formatting ideas. You've got to do the hard work yourself.

refulgentis · 2024-05-31T18:21:54 1717179714

TL;DR: There's a lot of really crappy, rushed, "AI" that accumulated the last 18 months. Yet AI is awesome. It's *really* stunning to live it. I cannot stress this enough. You could probably do an GPT wrapper startup today, and as long as you don't rush, roundly beat every single incumbent within a year.

Context:

I'm a sole developer that quit my job at Google 7 months ago, decided to 18 months ago after getting first-hand look at how search x AI was being built.

I just wrapped up 2 days of final benchmarking for release.

3-search-query RAG scores 97% on the USMLE, 6 points above "SOTA": Gemini 1.5 tuned for med that does 4 rounds of N answers each round, then 3 search queries to resolve differences.

How am I possibly roundly beating that? Rushing. Lack of attention to basic details while just getting to the desired outcome, "we finetuned for med and beat GPT-4". And that's *Google*, infinitely resourced.

And Perplexity? AI startup darling? 76%. That's the $40/month version. The free version is at 66%. I'm absolutely stunned, it didn't seem great, I didn't know it was actively much worse than just using ChatGPT.

If that's how Google and Perplexity are going, I can't even imagine the shenanigans that are going down at companies like these. They don't have infinite resources, expertise, or have it as a core competency. There's probably more effort put into who gets to work on the AI thing than working on the AI thing.

(I did legal & med benchmarking, for legal to compare TFA, perplexity free 58%, perplexity pro 67%, llama 8B w/no internet 65%, gpt4o x internet 90%)

swatcoder · 2024-05-31T18:00:57 1717178457

Industries do explore new products on hype, but they don't generally become widespread or disruptive until they can actually demonstrate some value.

Companies are obliged or incentivized to start introducing products so that customers can see what's acheivable, but it's not yet obvious where generative AI products will be able to really take root and where they'll just be written off until some big innovation is made.

And whatever happens, it seems very unlikely that it will apply to white collar jobs in general any time soon. It will apply to some industries, but others almost certainly won't be able to get what they need from what seems possible far -- or will only find use for it in certain narrow segments of their work.

__loam · 2024-05-31T17:50:18 1717177818

Sounds like an awesome way for an entire professional class to slowly lose their skills.

dcchambers · 2024-05-31T17:53:32 1717178012

Yup...that's my fear.

That and the fact that we're no longer training and leveling up juniors. Companies want to hire just one senior employee and let them use AI as they would a team of junior employees.

But since we're no longer training juniors, we're no longer creating seniors. What happens when there's no one skilled enough to replace those senior employees when they leave?

__loam · 2024-05-31T18:00:58 1717178458

Companies continuing the pattern of offloading training costs while complaining that it's hard to find skilled workers. LLMs are just an excuse.

Workaccount2 · 2024-05-31T18:24:11 1717179851

This is only really an issue if AI plateaus right around now.

I think the consensus though is still pretty strongly that these cutting edge models we have now are going to be laughably bad in a few years.

throw46365 · 2024-05-31T17:54:03 1717178043

Audit, verify, but also modify.

I don't know why anyone thinks that an AI that just trots out commonplace solutions to legal problems is something that most people even want.

GPTs might be able to do boilerplate basics, but since it cannot actually understand nuance, it's not going to help with really specific things.

Which makes you wonder why a non-LLM system -- an expert system, a database, some sort of DSL even -- isn't better than this stuff.

Why, if there is rigour and precedent, is there not a solution that doesn't start from a foundation of waffle and imprecision?

jcranmer · 2024-05-31T18:50:46 1717181446

> Why, if there is rigour and precedent, is there not a solution that doesn't start from a foundation of waffle and imprecision?

Law is a fact-specific discipline, laden with jargon, and where the answers can change on a pretty frequent basis. It's not that far off the mark to say that the answer to every legal question is "it depends." What you need is a system that can take a moderately free-form explanation from a user, tease out the legal aspects from that explanation to match against the legal database (which almost certainly isn't going to use the same terms the user used!) to arrive at the answer, and in the inevitable scenario where the user failed to provide enough information, also be able to recognize that more information is needed to answer the question, and query the user (in their own terms) to get them to provide that information.

LLMs are decently good at handling the language translation aspects, definitely, far better than any previous AI solutions; so it's not hard to see why people try this stuff with LLMs.

freejazz · 2024-05-31T18:42:05 1717180925

There are all sorts of legal treatises that solve this problem

Muromec · 2024-05-31T18:06:38 1717178798

And of course people will just not do the review part because it’s boring, low paying job and you can get away with subpar results majority of time

romeros · 2024-05-31T17:50:28 1717177828

Similar to the self checkout process at Walmart/HEB etc. One person will be there for any issues and will intervene minimally.

throw46365 · 2024-05-31T17:57:04 1717178224

17-33% is a whole lot less minimal than the person guiding a row of supermarket checkouts. In a busy supermarket they probably intervene less than 5%.

But then the supermarket checkout risk is a bounded one -- there's only so much that can be stolen using a supermarket handbasket.

Legal cases? Why does anyone think this approach can work?

pmarreck · 2024-05-31T17:52:06 1717177926

On the other hand, marriage has an error rate significantly higher than that and is still considered perfectly reasonable

Jensson · 2024-05-31T17:57:15 1717178235

So why not ask ChatGPT who to marry and follow that advice instead of your gut, its just random afterall!

No, people really care about reducing error rates, it doesn't matter that some things has higher error rates, if you can reduce the chance of a bad marriage or a bad contract people really want to do that.

eviks · 2024-05-31T17:55:40 1717178140

Interesting, in what profession is this acceptable?

glenstein · 2024-05-31T18:05:57 1717178757

Well as it relates to humans, typically at an orthodontist office you have the one licensed orthodontist and any number of dental assistants, and the license orthodontist can oversee the work of the assistants. The assistants do the majority of the work but aren't credentialed, but because they're overseen by the main orthodontist we regard the work as rising to a standard we associate with professional legitimacy.

I think the net result in these cases is that there's a lot more professional capacity to meet the needs of people who have braces.

I don't suspect AI is coming for our teeth anytime soon but there's precedent for it as an oversight structure that satisfies us of the legitimacy of certain types of work.

kjkjadksj · 2024-06-01T17:19:02 1717262342

The assistants do stuff like flossing the braces and such but the ortho comes in to actually install and set them up. Its not purely oversight but a skill gap and what is possible given an amount of training.

eviks · 2024-05-31T22:25:16 1717194316

Would dental assistants pull the wrong tooth 1 out of 6 times without supervision?

joe_the_user · 2024-05-31T18:08:22 1717178902

Indeed,

If an organization doesn't care if their pronouncements are bs, they would opt for AI and thing have to definitely be right, they'd opt for a human.

Logically, the jobs of lawyers are safe but the job of marketers are already looking doubtful.

The worst situation is when an organization can calculate that producing bs won't hard them but where there are people still harmed by the bs.

ImAnAmateur · 2024-05-31T17:56:06 1717178166

I got an ad for an AI legal consultant and it made me upset. Someone is trying to make a cheap buck by outsourcing legal advice to an unreliable source. Now I'm more prepared to explain how it's a bad idea, instead of just why.

I like that the article explains a few different ways hallucinations creep in, besides the obvious. Maybe what's most needed is a better retrieval/search system? If the AI can't fetch high quality data, then it's doomed before it tries.

mtlmtlmtlmtl · 2024-05-31T18:03:56 1717178636

What really upsets me is the idea of making therapist LLMs.

They may be able to regurgitate an impressive amount of psychology texts, but that doesn't give them the theory of mind, observational skill, experience or judgement needed to be a good psychotherapist.

lazide · 2024-05-31T19:39:58 1717184398

That one is doomed to failure from a customer perspective IMO - not because people want, or therapists do, anything that you’re saying though.

But often because the real reason why people are in therapy is to get someone else’s perspective on what they’re experiencing.

And fundamentally, that is destined to fail when it’s not an actual human involved.

betaby · 2024-05-31T17:47:02 1717177622

AI tools hallucinate 100% of the time for me for the security related topics, in particular encryption.

kromem · 2024-05-31T23:42:36 1717198956

Which models and how are you prompting/setting the context?

In some cases hallucinations are unavoidable, but 100% sounds like it might be a usage issue.

betaby · 2024-06-01T19:36:56 1717270616

Yes, I'm holding it wrong.

101008 · 2024-05-31T17:52:09 1717177929

I know LLMs and genAI are trendy and they seem espectacular and work like magic, but maybe they are not the tool for every problem. I can see them working fine in a lot of aspects/problem, but that doesn't mean they will work fine in every aspect that involves text, even if they are refined, etc.

I guess once the fashion passes (probably not soon, unfortunately), people will see the real value of LLMs and will start applying them to problem that are solvable by them, and not absolutely everything.

Workaccount2 · 2024-05-31T18:27:38 1717180058

No one knows what the real value of LLMs is though. Its a nascent technology. We could be discussing the internet circa 1994 or we could be discussing bitcoin circa 2011.

skydhash · 2024-05-31T21:14:40 1717190080

> the internet circa 1994 or we could be discussing bitcoin circa 2011.

People were clear about the value of both. SMTP by itself was already a clear value for business. The WWW itself was invented in 1989. Same for Bitcoin as the title of the original paper, "Bitcoin: A Peer-to-Peer Electronic Cash System ", explains it all. Some capabilities were not there yet, but it was mostly infrastructure and engineering issues (Acceptance was another issue for bitcoin).

LLMs generate tokens. The generation is governed by the models, but the model itself has no concept of usefulness or truth. Books are not enough to train them. The WWW has a lot of garbage informations. And natural language is insufficient to describe exact processes without formalizing it.

The fact is that the human mind can already do what the LLMs fans hope them to achieve. And we can transfer knowledge through media. Getting something done is mostly assembling the right people in a room, recursively. And we built tools to enhance productivity.

I believe that Microsoft and others are hoping that the general public will be ok with wrong results. But the fact is that all the capabilities exposed so far have an alternative and more deterministic process to achieve the same result. But it often requires learning or hiring a specialist and it's sad to see how many people are balking at that.

__loam · 2024-05-31T17:58:17 1717178297

Problems like auto-generating spam emails for Nigerian princes.

spdustin · 2024-05-31T18:49:33 1717181373

I'm going to go out on a limb and say that LexisNexis and Thompson Reuters didn't do nearly enough (if any) taxonomical engineering of the corpus before deciding to just do some sliding window chunking. Without that, the whole "think/plan" part of a natural language query pipeline is all but useless.

I just want to bang a marching bass drum while walking through their office and continually shout, "you still have to do the messy part!"

parhamn · 2024-05-31T17:49:04 1717177744

I try to read a few scotus rulings a year and I always wonder if anyone actually reads all the citations, notes, references, etc. It just feels like doing it right would take hundreds of hours. Sometimes I wonder if the judges have read their full final opinions or if its just the clerks.

Im genuinely curuous, how does this work in regular law? Does anyone ever check all the citations, etc? Does a judge really deep read all 200 pages of filings and arguments? Do they just scan them efficiently as a lot of it is familiar?

gnicholas · 2024-05-31T18:18:21 1717179501

SCOTUS justices have 4 full-time clerks, and yes they absolutely check all the citations. Getting onto law review in school involves a “cite checking” test, since this is a very important task for lawyers.

Also, some of the citations are routine, and are very well-known by lawyers and judges. For example, when a lawyer sees “Carolene Products footnote 4”, they don’t have to check it to know it’s about interstate commerce. In some SCOTUS opinions, 1/3 of the footnotes could be routine ones like this.

KittenInABox · 2024-05-31T17:54:31 1717178071

Clerks are portioned out to read a bunch of this. Judges presumably read their own opinions, as do judges who concur with any opinion written. Mind that reading and writing these things is 90% of their job, 10% is the actual hearings, so imagine dedicating at least 40/hrs a week over months and months over this stuff. I imagine it isn't particularly challenging a workload to write.

kemitchell · 2024-06-01T01:12:14 1717204334

I practice law in California.

Lawyers absolutely read every word of court decisions—citations, notes, appendices, concurrences, and dissents. Usually multiple times. But that tends to happen when looking to cite, criticize, or argue about a particular decision that matters. In other words, when you might have to defend your interpretation against another lawyer. And when you have the time. Which is money.

Lawyers reading for other reasons may skim, hunt for specific passages, or even just glance any provided syllabus, depending on why they are reading. When you've read enough of these, by the same judges, living and dead, over and over, you get a sense not just of formatting and structure but written and thinking style, as well. You get better at determining whether and when it's worth a full, deep read.

One of the things American law schools teach by experience in the first year is that the level of attention, organization, and critical thinking expected in reading is far higher, in its own peculiar way, than what most students are used to putting in, even from very strong academic backgrounds. Then they develop endurance for it, by assigning many cases to read for each class session, several times a week, for several classes at once. All in a competitive environment where your hiring prospects largely come down to grades and your grades come down to a single exam per subject, each awarded on strict statistical curves.

Part of it's that you learn what you need to read. Part of it's that you just grind. When you've been grinding long enough, you don't even feel it anymore. It still hurts, but it's a long, slow abrasion on your mind and personality. Not a stitch in your side anymore.

I'll never tell anyone off from reading Supreme Court opinions. It's your court! As long as you can keep it. But for most folks reading for interest, the syllabus is fine. If you want just a little more than that, read the intro and concluding sections of the "opinion of the court". Then read the intro of each dissent, if there are any.

parhamn · 2024-06-01T19:54:51 1717271691

Thanks!

trevithick · 2024-05-31T18:24:44 1717179884

> Im genuinely curuous, how does this work in regular law? Does anyone ever check all the citations, etc?

Yes. At higher levels (state supreme courts and above), there are staff checking every single citation. It's often checked in a paper copy of the original source, which in rare cases could be a book from e.g. the 1700s.

VancouverMan · 2024-05-31T17:55:43 1717178143

That's roughly the range of nonsensical or incorrect answers I've gotten from various human lawyers, accountants, financial advisors, and other professionals over the years. I've come to expect about a fifth to a quarter of their advice to have problems with it.

Being wrong a third of the time would be a notable improvement for the human medical professionals I've dealt with. It's basically a coin-flip when it comes to the correctness and quality of what they claim.

malicka · 2024-05-31T18:15:07 1717179307

There is a key difference: When an AI is wrong, it’s “buyer beware; you knew AI isn’t infallible.” You have no recourse.

When a human is incompetent, you can blame them or their employer. You have recourse.

kjkjadksj · 2024-06-01T17:30:51 1717263051

Where is the recourse when your financial advisor loses your money, your lawyer loses your case, and your surgeon loses your life? Happens all too often because there literally isn’t any recourse.

zarathustreal · 2024-05-31T18:39:53 1717180793

Wait, where do you live? You are able to afford legal fees to fight a hospital??

freejazz · 2024-05-31T18:45:19 1717181119

Smarmy response, but the point about malpractice stands.

zarathustreal · 2024-06-01T20:47:58 1717274878

I don’t think that word “smarmy” means what you think it means, and there is no valid point about malpractice - that’s the essence of the comment. Laws are only as effective as the means of enforcing them, and in this case not everyone has equal access to the justice system.

freejazz · 2024-06-03T13:51:06 1717422666

No, they don't. Not everyone has equal access to AI either, or medicine for that matter. You can feel free to disagree. I did, after all.

1970-01-01 · 2024-05-31T21:10:06 1717189806

Yes, legal analysis is probably the most consistent scenario where I've seen it completely fail. I'm glad this is finally getting more attention.

https://news.ycombinator.com/item?id=40366267

https://news.ycombinator.com/item?id=40348947#40362183

happypumpkin · 2024-05-31T19:54:43 1717185283

I have an uncle who is an attorney in X state. I had him try, using GPT4, a bunch of prompts about X state law in his specialty and the rate was of hallucination was much higher than 1 in 6. Probably half or more were incorrect. Often the answers would be correct for other states, but not for X state. Alternatively, they were correct for X state at a certain point in time, but no longer are.

cushpush · 2024-06-01T09:31:29 1717234289

Geoffrey Hinton says the term "confabulate" is more apt to describe the mutex/mutating memory recall operation of the brain as it summons "known facts" and "gets them wrong." confabulation is a feature, not a bug, of the brain, and also a feature, not a bug. If the goal is to reach brainlike computing, we are getting closer.

lionkor · 2024-06-01T07:10:57 1717225857

I feel like LLMs just show how utterly stupid we all are, fundamentally.

All it takes is a machine using full sentences, and suddenly we assume it must be smart. This is why humans fall for scams, and this is why humans get into cults.

I guess thats something LLMs can teach us; how flawed we are in evaluating whether someone is speaking the truth without researching anything.

russdpale · 2024-05-31T18:26:21 1717179981

Knowing this, lawyers who use these tools should be disbarred. Period.

RecycledEle · 2024-06-03T19:43:31 1717443811

How would a random attorney do?

I wish someone would get free consultations with 100 attorneys and record the meetings. Then grade their answers in the same way the LLMs were graded.

My guess (and it is only a guess) is that the AIs outperform real attorneys during free consultations by hallucinating less and giving truthful answers more often.

resource_waste · 2024-05-31T18:06:29 1717178789

I'm so good at AI, that I can beat this.

Ask the same question 4 different times, on 4 different models.

If any of the 4 contradict each other, have a human manually review.

You can even change the question, change the seed, etc...

Its interesting how few people know how to use LLMs effectively. Here is my most recent protip, after getting an answer say:

"That was a 2/10 answer, give me a 10/10 answer"

pbhjpbhj · 2024-05-31T19:30:58 1717183858

https://unify.ai/chat?default=true might help with that, was mentioned here a few days ago, https://news.ycombinator.com/item?id=40441945; it's a router to send prompts to multiple LLM.

myhf · 2024-05-31T18:29:34 1717180174

> If any of the 4 contradict each other

If you were able to answer this definitively, then you wouldn't need the 4 intermediate models.

neolefty · 2024-05-31T17:55:47 1717178147

We think of truthfulness as boolean — whether in humans or AI models — but it may be more useful to think of it as a skill. You can become more truthful by aligning your inner model with reality and then also by checking it with ground truth. A challenging and intrinsically imperfect process either way!

Gormo · 2024-06-01T20:30:16 1717273816

LLMs don't have access to reality, though. They're trained on the symbolic output of other models.

localfirst · 2024-05-31T18:11:18 1717179078

seeing a lot of bad takes in the comments but from enterprise sales point of view: companies aren't buying or trusting LLMs

1 out of 6 times the blackbox produces completely wonky/risky output. Yet it costs far far more to fix the changes introduced by the blackbox including the hallucinations.

This is the blockchain argument all over again I see in the comments. Literally just swap it out for AI/LLM and it feels like the crypto bubble peak

What AI hype is doing is giving companies to massively garnish white collar wages by making everybody think it can (when in reality AI cannot replace a sentient human interface completely) extrapolating to wild science fiction novels. A bit disappointing to see, thought reddit would be more for this type of speculation.

tiahura · 2024-05-31T23:17:07 1717197427

The solution is simple - put the relevant law in the prompt. The larger context size of Claude lets you shove in half-a-dozen cases, your petition, and a few other docs.

When you do that, the quality is more than passable, and instantaneous.

rayiner · 2024-06-01T01:56:43 1717207003

The article’s cited examples of errors, such as not understanding that a precedent has been overruled or inventing a provision of the bankruptcy code, gives a fascinating insight into how LLMs work and their limitations.

pmarreck · 2024-05-31T17:53:58 1717178038

Lawyers and programmers are valuable because of the rigor of the precise wording (or code) they write, and this is also where LLM's can screw up.

I wonder if tweaking the randomness factor might help here

elicksaur · 2024-05-31T22:08:05 1717193285

Let someone stake their license on LLMs giving legal advice.

We’ve seen this happen, and it seems to typically result in sanctions. Until that’s not the typical outcome, LLM performance is all just talk.

garbanz0 · 2024-05-31T17:50:00 1717177800

i get the sense that the solution to this problem is more use of LLMs (running critical feedback and review in a loop) rather than less use of LLMs.

If you can build good tooling around current kinda dumb LLMs now to lower that number, we will be in a pretty good position as the foundational models continue to improve.

lyu07282 · 2024-05-31T18:59:25 1717181965

Yeah I'd imagine the problem is not verifying the output against retrieved documents. If it just hallucinates it would ignore the given context, something that can absolutely be verified by another LLM.

GiorgioG · 2024-05-31T17:42:01 1717177321

60% of the time it works every time.

he0001 · 2024-06-01T09:56:49 1717235809

When talking to LLMs their responses are always a bit off. The best way I can describe it, it’s like they are speaking a specific dialect but you know they are using expressions that wouldn’t be used in that dialect. Or, it’s like they are way over their head in the specific matter and reiterate things they really don’t understand. Like there’s no substance in what they are saying.

And if they do this for a simple or a matter that you know of, how can you trust their answers in a matter if you can’t evaluate their answer?

It doesn’t get better when they are totally mixing things up but say it with certainty, instead of just saying that they don’t know or admit that they have gaps in what they know. They don’t just get it wrong like humans do, it’s just weirdly wrong. Humans usually get it consistently wrong. Perhaps it’s about what they are not saying.

Would I always be able to tell it’s not a human I talk with, probably not. But chances are that I will know, the longer I talk to them.

dukeofdoom · 2024-05-31T19:10:11 1717182611

Are laws even logically consistent. They seem arbitrary at times, and even product of their time, like fashion.

meepmorp · 2024-05-31T23:39:50 1717198790

What does that have to do with whether or not LLMs spit out bullshit or not?

dukeofdoom · 2024-05-31T23:46:50 1717199210

Wouldn't it increase the probability of bullshit, if they're trained on bullshit.