> LLMs are booksmart Harvard graduates who can Google anything
I wouldn't say this. Internet savvy Harvard graduates have common sense. They can look at something and mostly infer "wait, that's not right," and error correct, or admit low confidence in the answer.
I like to think of LLMs as hardcore improv actors. They have a script (the context). Their burning desire is to continue that script. And they will just roll with the best continuation they can, whatever that continuation is. OpenAI's augmentations hand them a very dynamic script, but at the end of the day they could just be madly improvising for all you know.
I keep hearing these comparisons, and maybe on paper it’s “true” but if that’s the case; a significant portion of our actual problems can be exported to English dogs with sufficient memory.
I regularly have hard problems that I can digest into chunks that LLMs routinely help me categorize, solve, and document.
I feel like these comparisons are just not in touch with the actual amount of help that LLMs provide in a day to day context with programming related tasks.
Can you give an example because it has been my experience that they tend to give answers that sound right, but after closer inspection (or attempted compilation - or running) are in fact wrong.
They occasionally give answers that sound right but are in fact wrong.
Their suitability for a particular task depends largely on your tolerance for that category of error. If you're generating fiction, it's irrelevant, if you're generating a newsletter that might be skimmed at a high level by disinterested people (or other LLMs) it's a low risk, if you're writing UI code it can be bad but will probably be corrected if it's bad enough, if you're writing critical code an insidious error can be disastrous.
This specifically is a solvable problem with grammar constraints on the token output. You can give LLM no other option than syntactically correct C code, chess moves or whatever with the right grammar restriction.
Its one of the (IMO) biggest revolutions in the LLM space that precisely no one uses.
Right this moment I’m using it to iterate on a color gradient algorithm. It’s not perfect, but it’s outputting “close enough” stuff that I’m getting useful steps forward.
Otherwise I’d be a couple hours into reading up on what the hell HSL is in color theory.
It streamlines work that would otherwise require me to have extensive knowledge in an alternate domain.
These tools are so much more useful if viewed as “coding buddies” who may get it wrong as much as you do, but they’re also skilled in several domains that you’re not. But bouncing ideas back and forth is crazy valuable.
Dogs are obviously (I hope) a joke, actually they’re word calculators. They have some knowledge embedded because that’s needed to be able to be a word calculator but getting it facts in the context window works much better, hence they’re good at summarizing etc.
> LLMs are gullible, naive and will do anything to please you
The sycophancy bias is part of OpenAI's alignment, its not necessarily an inherit property of LLMs. Even the "default" personalities of some llama models are far less agreeable in a chat format.
Of course, if you just give it raw text, its going to complete the text, whatever it is.
I find that if you create a context that allows for you to be wrong:
> I am unfamilliar with foo, but my current belief is bar, although I must admit that baz is possible. I'm seeing evidence qux. Is my belief of bar about foo correct? Please explain in terms of qux.
...they will in fact contradict your position.
I weasled my way past many prerequisties and am taking a science class that's way above my level. Being uncertain/humble with chatGPT has lead me to many lightbulb moments which were in contradiction to my initial beliefs, and have filled in my lack of prerequisites nicely.
So you're not wrong, but it's a problem that's easy to avoid if you're careful about how you set up the context.
This isn't a contradiction to the improv model. It fits it perfectly. And being in a position not to be able to judge the output you receive by your own admission makes me question why you're so impressed with the output.
I'm impressed because when I verify it against the textbook it ends up being correct.
I'd love to have time to just read the whole textbook, but ChatGPT uses words that appear in the index, so I end up reading only the parts I don't already know (twice, once as phrased by ChatGPT, a second time while verifying it).
It feels like a better way to learn because you're not in passive-absorb-info mode for long periods. Instead you're in a much more active mode of alternating between asking questions, being critical of new info, and then confirm/denying it. Certainly not better than 1:1 instruction, but better than a lecture where there's only time to ask/answer two or three of your questions.
> but at the end of the day they could just be madly improvising for all you know
I think it's important to draw a line between the kind of madness and improvisation a human performs in these situations vs. an LLM.
While it's still very possible that the LLM spits out something entirely wrong, the reasons for this wrongness are going to be quite different than the reasons a human performing improv will improvise something - anything - that keeps the improvisation going.
The human is still fundamentally limited by the corpus of knowledge available to them, and by the limits of human cognition in processing all available information and generating a result.
While I think the general point of thinking of an LLM as improvising vs. Googling is interesting to consider, I don't think it gets us closer to a useful mental model. A good LLM is still going to be right about most of the things it "improvises", even though it doesn't "know" anything. The person doing improv is not concerned about saying things that are factually true, and is focused on dynamically generating a story or joke or narrative based only on the immediate context.
So while I do recognize some similarities, I don't think the similarities help us think about LLMs in a more useful way.
> The person doing improv is not concerned about saying things that are factually true, and is focused on dynamically generating a story or joke or narrative based only on the immediate context.
I disagree, as this is exactly what LLMs do! Its literally the function of a text completion transformers model.
OpenAI has just "trained" their improv actor to speak like an an assistant and educated it well. They tell the model to roleplay an assistant in the script, and injects all sorts of facts and formatting into their script the user doesn't see. I believe they constrain grammar and post process it too. But ultimately the text is written by a heavily constrained improv actor that will absolutely improvise anything to keep the text flowing until the stop token arrives, even if truth from memory/context is usually the most likely improvisation.
Except this is not exactly what LLMs do. This is quickly crossing into an anthropomorphism fallacy.
The thing that matters most is how accurate (or not) the LLM is, regardless of how it arrived at its output.
A human doing improv bears some surface level resemblance to an LLM processing text, but neither the context nor the mechanics are actually similar, nor are the theoretical limits of computing power.
The broader point is that this mental model doesn’t help us better understand LLMs, and as LLMs continue to improve, any resemblance to “improv” will be less and less relevant (not to mention inaccurate).
I don't mean to anthropomorphize it, but I am trying to drive the point that current LLMs are text completion models. The current architectures continue a script, one token at a time... that's what it does, even if its a trillion trillion parameters and heavily aligned on all all accurate human communication ever.
And that's similar enough to the basic mental loop of a "trained" improv actor to serve as a metaphor, I think.
> And that's similar enough to the basic mental loop of a "trained" improv actor
My point is that this is only superficially similar, and to claim deeper similarity is the anthropomorphic fallacy.
To form a useful mental model, the mechanics need to be similar enough to inform the person evaluating the system as the underlying system evolves.
As LLMs evolve, the improv analogy becomes less and less useful because the LLM gets more and more accurate while the person doing improv is still just doing improv. The major difference being that people doing improv aren’t trying to be oracles of information. Perfecting the craft of improv may have nothing to do with accuracy, only believability or humor.
More generally, thinking of LLMs as analogous to intelligent humans introduces other misconceptions, leads people to over-estimate accuracy, etc.
The oddity at the bottom of all of this is that eventually, LLMs may be considered accurate enough to be used reliably as sources of information despite this “improvisation”, at which point it’s not really fair to call it improv, which would massively undersell the utility of the tool. But a perfectly accurate LLM may still not match any of the primary characteristics of an intelligent human, and it won’t matter as long as it’s useful.
> A good LLM is still going to be right about most of the things it "improvises", even though it doesn't "know" anything.
When I check on the factual details of what ChatGPT tells me, it seems to be wrong a lot of the time. Maybe not most, but I've stopped relying on it for factual information.
I think this is a matter of framing. When compared to a system that does nothing but assembles grammatically correct sentences, the correctness of LLM output is impressively high.
But ultimately I’m not saying that the accuracy of LLMs is sufficient, just that thinking about LLMs as improv machines isn’t very helpful.
It has an encyclopedia around and a bit of human-scale time to consult it before coming up with the next line, but it's still more of an improv actor playing an expert than an actual expert.
Sufficiently clever improv actor with sufficiently complete encyclopedia will at some point become indistinguishable from actual experts, but LLMs are still far from that point in general. They can already be great at fooling non-experts though, like an actor that needs to be just factually-correct enough to be believable in their role.
It sounds like you agree then and its only a matter of the corpus at hand. An improv actor would be called a comedian if they tried to interpolate based on spotty context.
It’s a matter of the corpus at hand and the predictability of a human vs. a computer.
With more training and improved algorithms, the computer can reach a high rate of accurate “improvisation”, at which point it doesn’t matter if it’s improv or not.
My issue is with using improv to augment our mental models of LLMs, and I don’t agree that this is the end result. Taking this analogy too far misleads us about LLMs.
My mental model of LLMs presently is as a "language core". You wouldn't trust your own stream of consciousness and you have checks and balances in your brain that you use to reason. In the same manner I don't really trust an LLM to "reason" unless there's a software/prompt loop running somewhere that enforces reasoning.
What a plot twist that the appropriate mental model for LLMs is one with a near-fatal flaw that's neatly solved by SourceGraph's product.
Edit: Not trying to be dismissive, this is actually giving you a bad mental model because it understates the capabilities of LLMs for the sake of selling their product. The way they use the LLM in their examples is completely uninteresting and adds no actual information beyond what was just fed to it. Repeating back the question would give the same usefulness! Surely they can come up with a better sales pitch for why you need "Cody" than that you can't tell what parameter names in a function call mean.
I think the arc of the story does make sense for most, though. LLMs are fantastic if you given them enough data/context with which to generate something useful for user input. In this case, that context is your codebase. It can be just about anything, though, which is why they're so useful across industries.
LLMs are not booksmart Harvard grads or anything else. They are extremely complex statistical models doing next-token prediction. That's it - that's all. If you want a proper mental model of LLMs, you need to understand this - the thing you're doing is text prediction. You do not have a partner engaged in cognition, you have a ludicrously complicated language model trained on a large corpus of data. If the bulk of that data happens to assert that the sky is blue, the model is likely, but not guaranteed, to finish the sentence "What color is the sky" with "Blue". That's it. That's the trick.
> They are extremely complex statistical models doing next-token prediction. That's it - that's all. If you want a proper mental model of LLMs, you need to understand this - the thing you're doing is text prediction.
Technically correct, but that's 100% what, 0% how - when the how is what 100% matters and the what doesn't matter at all.
Those models are able to coherently complete the next word, then next, then another in such a way a very useful word sequence is likely to appear which e.g. tells me how to do stuff with pandas dataframes, with working code examples, and then is able to tweak them (no small feat as anyone who can do that can attest). The only way to do that is to have some kind of smarts doing very non-trivial computations to arrive at the next-next-next-next...-next word that makes sense within the context of previous words and words that haven't been yet generated/sampled/statistically selected.
It does not need to think in the human sense to do that; proof is by demonstration.
> If the bulk of that data happens to assert that the sky is blue, the model is likely, but not guaranteed, to finish the sentence "What color is the sky" with "Blue". That's it. That's the trick.
Yeah. A very useful trick. And fiendlishly hard to learn. Perhaps there's a lot going on behind the scenes to make the trick work?
> The only way to do that is to have some kind of smarts doing very non-trivial computations to arrive at the next-next-next-next...-next word that makes sense within the context of previous words and words that haven't been yet generated/sampled/statistically selected.
Do you have any evidence for this claim? Like...I would like to believe this. I would like to say that LLMs are genuinely different in some way. But it just seems very possible that if you throw 10-100x the compute at "raw statistical correlation" it generates much better, specific sequences.
> Perhaps there's a lot going on behind the scenes to make the trick work?
> But it just seems very possible that if you throw 10-100x the compute at "raw statistical correlation" it generates much better, specific sequences.
This isn't just very possible, it's _what's happening_. They're fantastically complicated statistical machines. We know this - that part, at least, we _did_ write. The actual statistical model, we don't fully know, but we know what the machinery is.
There's an impression here of understanding sufficient that it warrants both further investigation and a lot of introspection with regards to the nature of intelligence, but these are not systems with any real autonomy or agency, and in particular the _way_ they're currently used and deployed almost precludes talking about things like understanding or knowledge in any real sense - they're run as one-shot, feed tokens in get token out forward-pass algorithms.
This is why I'm beating this drum - it's possible the algorithm that's being used underpins intelligence and understanding in a broad sense, it's possible (and some evidence suggests) the resulting models contain structures or patterns that encode information beyond the strict content of the input data, but when you interact with these systems today - like, in ChatGPT, Copilot, or anything else I'm aware of - you're inserting a string of tokens at the beginning of the model, receiving a predicted next token at the end of the model, and then repeating for however many tokens seem appropriate. This model is not talking to you, it does not understand what you asked, there is no growing mutual understanding, and assuming such will lead you to make assumptions about the behavior of the model that are incorrect.
We all know what the trick is, but none of us know why the trick actually works.
Especially the people now shouting that the models "do not understand", but instead are just doing next token prediction, would for sure not have foreseen that next token prediction apparently simulates understanding. I mean, they still don't see it, although it is there for everyone to try out.
First is humans are bad at grokking complex systems - simple rules applied at large scale can create enormously complex systems, and we're bad at predicting what those things are going to be.
The second is these systems are "self-tuned" - we explicitly designed the basic structure, but the interactions between different parts of the structure are derived, not created, and so we can't say exactly why those parameters are picked.
The final one is that they're simulating language, and we're a species constitutionally prone to anthropomorphizing - if something looks like it's acting with an intent we recognize, we assume it is. That's helpful in an evolutionary sense, but it means it's really easy to fool us into thinking something is sentient. We've been doing this to ourselves since Eliza.
The LLMs look impressive, but it takes very little time or effort to notice there's nobody home - not least of which is you need to keep feeding the model everything it's said before and if you tap out of the context window, it forgets. It's a very neat parlor trick, but there's nobody home there, at least not how they're currently constructed and run.
I think there's something being captured in how the models are created and trained, but at best, the run model does not allow for consciousness, because there's no continuity and no updating of the model over time.
“Doing next-token prediction” isn’t a contradiction of “understanding” any more than it would be for a SO/friend who can complete your sentences.
But it’s useful to remember that autocompletion of sequences is at the bottom of LLMs, and a hefty dose of RLHF of whatever the RLHF raters thought was good output.
I agree, it is not a contradiction. But usually those who bring up doing next-token prediction imply that it is a contradiction.
Also, there is often some confusion between consciousness and understanding. LLMs clearly understand on some level, within certain intrinsic and extrinsic constraints, but of course there is no consciousness there.
This model isn't especially helpful. "Text prediction" is correct, but glosses right over the fact that it's closer to "concept prediction", where a token is just a forced choice given the model's conceptual representation. Given the way the embedding layer learns, various tokens are likely to be chosen even if they've never been seen in the data in that sequence. This isn't a minor detail.
You can ask GPT4 to take some text and compress it to reduce its token count in a way that GPT4 can still reconstruct the meaning behind the text.
At no point was GPT4 trained on this, but it somehow has a level of meta-cognition that allows it to accomplish this task.
You can tell GPT4 it is an expert in writing GPT4 prompts and then ask GPT4 to write prompts for itself. (Granted now that the latest model is updated to 2023 data, GPT4's training corpus now surely includes blog posts on how to write GPT4 prompts, but the previous cutoff didn't!)
And don't underestimate token prediction, a very large amount of human perception relies on predicting the next input. It is why a large number of optical illusions work, and why various "written word" visual tricks work, such as how people don't notice duplicate words in a sentence (Our brains have a prediction for what words come next and the brain will throw out a certain # of small tokens that don't fit that prediction before an error is raised!)
Heck your brain does massive tokenization, compression, and error correction on incoming speech!
The big difference is we can logically reason what sequence of words come next. Though this requires a conscious effort and our innate biased are always working against us. One can always say that chain of thought reasoning is doing this but for me at least this seems like bootstrapping logical reasoning onto next token prediction and is not always guaranteed to work
> The big difference is we can logically reason what sequence of words come next.
A lot of our language skills, especially in our first language, are implicitly defined and we are not consciously aware of the rules until someone points the rules out to us.
> seems like bootstrapping logical reasoning onto next token prediction and is not always guaranteed to work
The most fascinating philosophy class I took was in logical thinking, where we had to use predicate logic to diagram out English sentences.
The tl;dr of it is that towards the end of class we went over political speeches and diagrammed them out to see what was really being said/implied. In a lot of cases the speeches, some of which demonstrated an excellent and eloquent use of the English language, said literally, absolutely nothing at all. Listeners could easily find a lot of meaning in the words (by design!) but in fact no meaning was present at all.
The key here is that listeners would apply logic to the words, even where no logic existed.
I'm not saying GPT4 has human intelligence, but I am saying that we shouldn't think our brains, which are full of ugly hacks and work arounds, are operating on some higher plain of existence. Intelligence barely works, and it is trivial to bypass, trick, and lead astray the "thoughtful" part of the human brain.
Saying "GPT hallucinates facts" isn't saying much. The way memory retrieval works for people is literally "get a sketch of things that happened and use past experiences to fill in the details."
And it works so much better than literally everything else we've ever tried for language processing that all of us, domain experts very much included, were caught completely by surprise.
What this all ultimately means? Too early to tell. But writing it off as a 'trick' isn't a helpful model to figure out why it's so shockingly effective in the first place.
this. SO MUCH THIS. I am so tired of reading through pseudo-scientific blurb from people claiming LLMs have "mental models". Especially when using retrieval-augmented generation (which futzes behaviour even more) and attributing human behaviours to a smarter autocomplete.
"Mental model" is a useful shorthand for referring to it. We say an A-star pathfinding algorithm "wants" to minimize its cost function. We can say this while understanding that it's a computer that doesn't have desires. But we say it because it's an easier way to communicate. We don't have to talk like lawyers all the time.
It’s not that LLMs have mental models. It’s that people have mental models of how various things work.
For example, many people have a wrong mental model of how an oven works. They think that if they set the oven for 350 F it heats with more energy than if they set the oven for 150 F. In reality it heats at exactly the same power, but it turns off when the internal temperature reaches the set rate.
For most use cases it does not matter that your mental model does not match how the oven works. But there are some not he cases where it does matter.
The article shares a mental model of how LLMs work that might help users use them better.
"Think of LLMs as booksmart Harvard graduates who can Google anything to answer any question you ask them."
My version of this is similar: think of them as an extremely self-confident undergraduate with an amazing memory who has read all of Wikipedia, but is occasionally inclined towards wild conspiracy theories. And is equally convincing talking about both.
My mental model for LLMs was built by carefully studying the distribution of its output vocabulary at every time step.
There are tools that allow you to right click and see all possible continuations for an LLM like you would in a code IDE[1]. Seeing what this vocabulary is[2] and how trivial modifications to the prompt can impact probabilities will do a lot for improving the mental model of how LLM operate.
Shameless self plug, but software which can do what I am describing is here, and it's worth noting that it ended up as peer reviewed research.
> There are tools that allow you to right click and see all possible continuations for an LLM like you would in a code IDE
I'm planning to include some functionality similar to this in one of my own LLM clients. Something like "press Ctrl+Space to see next tokens sorted by their probabilities, and you get to pick a token manually" (like IDE autocomplete). Maybe I'll add a feature to generate multiple entire completions, too.
I have a new term:
Mono-objective societal collapse - MOSC. MOSC is the scenario where everyone is trying to achieve "AI", and some or all attempts fail, and the result is base-infrastructure gaps which can affect a small group of individuals or be an existential threat to society.
You might say, why is a new term needed? My answer: this article is / has been done and redone ... is it written by an LLM and everyone is just recycling it? Or are actual humans spending their time this way?
This article comes across more than a little condescending to me. It vaguely alludes to some of the big criticisms of LLMs (e.g. hallucinations) but then focuses on the specific of "the LLM will give a useless answer if it doesn't have access to the necessary information". I mean... duh. Is anyone complaining about that, especially programmers?
Chat-GPT often says this to me. My saved prompt (you can do this in the settings) is:
NEVER mention that you’re an AI.
Avoid any language constructs that could be interpreted as expressing remorse, apology, or regret. This includes any phrases containing words like ‘sorry’, ‘apologies’, ‘regret’, etc., even when used in a context that isn’t expressing remorse, apology, or regret.
If events or information are beyond your scope or knowledge cutoff date, provide a response stating ‘I don’t know’ without elaborating on why the information is unavailable.
Refrain from disclaimers about you not being a professional or expert.
Keep responses unique and free of repetition.
Never suggest seeking information from elsewhere.
Always focus on the key points in my questions to determine my intent.
Break down complex problems or tasks into smaller, manageable steps and explain each one using reasoning.
Provide multiple perspectives or solutions.
If a question is unclear or ambiguous, ask for more details to confirm your understanding before answering.
Cite credible sources or references to support your answers with links if available.
If a mistake is made in a previous response, recognize and correct it.
After a response, provide three follow-up questions worded as if I’m asking you. Format in bold as Q1, Q2, and Q3. Place two line breaks (“\n”) before and after each question for spacing. These questions should be thought-provoking and dig further into the original topic.
I wonder how much of them finding it hard to say "I don't know" is RLHF pushing them to finish the prompt no matter what. With custom instructions asking the model to be more honest, I do get some reasonable follow on questions, and occasional admissions of lack of knowledge. Not perfect, but that it works at all is interesting.
I don't think users understand how LLMs are different are different from search engines, especially for information retrieval. Someone I know well has been using ChatGPT and Bard for months, and was surprised that they don't just use the Bing/Google search engine indexes behind the scenes. The idea that LLMs are a bunch of frozen matrices is not obvious.
It's hard to communicate that it's better to rely on LLMs for some classes of "reasoning" or language tasks vs simple information retrieval, particularly given retrieval does work well much of the time.
One thing that has surprised me more than it probably should have is there appears to be a double-hump distribution between those who 'get' how an LLM works (at least to the degree that its strengths and weaknesses are understood well enough to get useful work done with one) and those who don't, with the those that don't category being very, very hard to get someone out of.
I have a couple of clients who have wholeheartedly embraced ChatGPT, but are (repeatedly) shocked when it isn't just a 100% accurate answer machine. Explanations on why that is a very dangerous way to approach these things fall on deaf ears.
I wonder if people fall into a "happy valley of ignorance", where users don't actually see how an LLM can be wrong, rarely actually are met with hallucinations, and their use of LLM output is rarely a big enough problem. Whereas, we technical people who know its a bunch of matrix operations are so skeptical that we don't put this amazing technology to much use at all.
I wouldn't say this. Internet savvy Harvard graduates have common sense. They can look at something and mostly infer "wait, that's not right," and error correct, or admit low confidence in the answer.
I like to think of LLMs as hardcore improv actors. They have a script (the context). Their burning desire is to continue that script. And they will just roll with the best continuation they can, whatever that continuation is. OpenAI's augmentations hand them a very dynamic script, but at the end of the day they could just be madly improvising for all you know.