Large Enough

tikkun · 2024-07-24T15:43:11.000000Z

Links to chat with models that released this week:

Large 2 - https://chat.mistral.ai/chat

Llama 3.1 405b - https://www.llama2.ai/

I just tested Mistral Large 2 and Llama 3.1 405b on 5 prompts from my Claude history.

I'd rank as:

1. Sonnet 3.5

2. Large 2 and Llama 405b (similar, no clear winner between the two)

If you're using Claude, stick with it.

My Claude wishlist:

1. Smarter (yes, it's the most intelligent, and yes, I wish it was far smarter still)

2. Longer context window (1M+)

3. Native audio input including tone understanding

4. Fewer refusals and less moralizing when refusing

5. Faster

6. More tokens in output

campers · 2024-07-24T23:38:40.000000Z

It seems to be the way with these releases, sticking with Claude, at least for the 'hard' tasks. In my agent platform I have LLMs assigned for easy/medium/hard categorised tasks, which was somewhat inspired from the Claude 3 release with Haiku/Sonnet/Opus. GPT4-mini has bumped Haiku for the easy category for now. Sonnet 3.5 bumped Opus for the hard category, so I could possibly downgrade the medium tasks from Sonnet 3.5 to Mistral Large 2 if the price is right on the platforms with only 123b params compared to 405b. I was surprised how much Llama3 405b was on together.ai $5/mil for input/output! I'll stick to Sonnet 3.5. Then I was also surprised how much cheaper Fireworks was at $3/mil

Gemini has two aces up its sleeve now with the long context and now the context caching for 75% reduced input token cost. I was looking at the "Improved Facuality and Reasoning in Language Models through Multi-agent debate" paper the other days, and thought Gemini would have a big cost advantage implementing this technique with the context caching. If only Google could get their model up to the level of Anthropic.

drewnick · 2024-07-24T15:54:35.000000Z

All 3 models you ranked cannot get "how many r's are in strawberry?" correct. They all claim 2 r's unless you press them. With all the training data I'm surprised none of them fixed this yet.

Kuinox · 2024-07-24T16:02:48.000000Z

Tokenization make it hard for it to count the letters, that's also why if you ask it to do maths, writing the number in letters will yield better results.

for strawberry, it see it as [496, 675, 15717], which is str aw berry.

If you insert characters to breaks the tokens down, it find the correct result: how many r's are in "s"t"r"a"w"b"e"r"r"y" ?

> There are 3 'r's in "s"t"r"a"w"b"e"r"r"y".

GenerWork · 2024-07-24T17:19:24.000000Z

>If you insert characters to breaks the tokens down, it find the correct result: how many r's are in "s"t"r"a"w"b"e"r"r"y" ?

The issue is that humans don't talk like this. I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.

est31 · 2024-07-24T17:47:45.000000Z

Humans also constantly make mistakes that are due to proximity in their internal representation. "Could of"/"Should of" comes to mind: the letters "of" have a large edit distance from "'ve", but their pronunciation is very similar.

Especially native speakers are prone to the mistake as they grew up learning english as illiterate children, from sounds only, compared to how most people learning english as second language do it, together with the textual representation.

Psychologists use this trick as well to figure out internal representations, for example the rorschach test.

And probably, if you asked random people in the street how many p's there is in "Philippines", you'd also get lots of wrong answers. It's tricky due to the double p and the initial p being part of an f sound. The demonym uses "F" as the first letter, and in many languages, say Spanish, also the country name uses an F.

rahimnathwani · 2024-07-24T18:00:55.000000Z

Until I was ~12, I thought 'a lot' was a single word.

itishappy · 2024-07-24T18:20:36.000000Z

https://hyperboleandahalf.blogspot.com/2010/04/alot-is-bette...

andy_ppp · 2024-07-25T00:11:17.000000Z

Oh I thought essay was some kind of abbreviation for S.A. - short article maybe…

rahimnathwani · 2024-07-25T00:15:33.000000Z

https://www.youtube.com/watch?v=KwiI_5ZSKZY

esaym · 2024-07-25T01:02:01.000000Z

Atleast you learnt.

santoshalper · 2024-07-25T03:59:38.000000Z

Yeah, but for most people, it would be because the don't know how to spell "Philippines" at all. Confoundingly, LLMs know exactly how to spell Strawberry and still get this wrong.

coder543 · 2024-07-24T18:01:23.000000Z

> I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.

No, I would actually be pretty confident you don’t ask people that question… at all. When is the last time you asked a human that question?

I can’t remember ever having anyone in real life ask me how many r’s are in strawberry. A lot of humans would probably refuse to answer such an off-the-wall and useless question, thus “failing” the test entirely.

A useless benchmark is useless.

In real life, people overwhelmingly do not need LLMs to count occurrences of a certain letter in a word.

jvm___ · 2024-07-25T15:52:01.000000Z

AI being the same as human, while AI fails at a task that any human can easily do means AI isn't human equivalent in an easily demonstrable way.

If full artificial intelligence, as we're being promised, falls short in this simple way.

observationist · 2024-07-24T17:44:20.000000Z

Count the number of occurrences of the letter e in the word "enterprise".

Problems can exist as instances of a class of problems. If you can't solve a problem, it's useful to know if it's a one off, or if it belongs to a larger class of problems, and which class it belongs to. In this case, the strawberry problem belongs to the much larger class of tokenization problems - if you think you've solved the tokenization problem class, you can test a model on the strawberry problem, with a few other examples from the class at large, and be confident that you've solved the class generally.

It's not about embodied human constraints or how humans do things; it's about what AI can and can't do. Right now, because of tokenization, things like understanding the number of Es in strawberry are outside the implicit model of the word in the LLM, with downstream effects on tasks it can complete. This affects moderation, parsing, generating prose, and all sorts of unexpected tasks. Having a workaround like forcing the model to insert spaces and operate on explicitly delimited text is useful when affected tasks appear.

Zambyte · 2024-07-24T17:59:23.000000Z

Humans also would probably be very likely to guess 2 r's if they had never seen any written words or had the word spelled out to them as individual letters before, which is kind of close to how lanugage models treat it, despite being a textual interface.

huac · 2024-07-24T21:08:37.000000Z

> Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.

We are also not exactly looking letter by letter at everything we read.

kaibee · 2024-07-25T02:04:27.000000Z

Not exactly the same thing, but I actually didn't expect this to work.

https://chatgpt.com/share/4298efbf-1c29-474a-b333-c6cc1a3ce3...

jahewson · 2024-07-24T22:43:02.000000Z

On the other hand explain to me how you are able to read the word “spotvoxilhapentosh”.

smeej · 2024-07-24T23:31:45.000000Z

Just because we normally aren't reading letter by letter, it doesn't mean we can't. We can recognize common words on sight, ignoring minor variations, because we've seen the words thousands or millions of times, but that doesn't somehow disable the much less frequently used ability to approach a brand new word.

chipdart · 2024-07-25T01:56:43.000000Z

I think that humans indeed identify words as a whole and do not read letter by letter.

However, this implies you need to know the word to begin with.

I can write "asdf" and you might be oblivious to what I mean. I can mention "adsf" to a JavaScript developer and he will immediately think of the tool versioning tool. Because context and familiarity is important.

elphinstone · 2024-07-25T04:32:49.000000Z

I believe it's a bit more nuanced than that. Short ubiquitous words like "and" or "the" we instantly recognize at a glance, but long unfamiliar or rarer words we read from the beginning, one syllable or letter at a time, until pattern recognition from memory kicks in. All unconsciously, unless the word is so odd, out of place, mispelled, or unknown that it comes to conscious awareness and interrupts our reading.

sebastiennight · 2024-07-28T09:32:24.000000Z

"spot"

"vox"

"il"

"ha"

"pen"

"tosh"

is how I read it.

A lot of schools teach kids to read with a syllabic method... so... super close to the tokenization concept.

bhelkey · 2024-07-24T17:38:30.000000Z

It's not a human. I imagine if you have a use case where counting characters is critical, it would be trivial to programmatically transform prompts into lists of letters.

A token is roughly four letters [1], so, among other probable regressions, this would significantly reduce the effective context window.

[1] https://help.openai.com/en/articles/4936856-what-are-tokens-...

latentsea · 2024-07-24T22:23:57.000000Z

This is the kind of task that you'd just use a bash one liner for, right? LLM is just wrong tool for the job.

amluto · 2024-07-25T01:23:01.000000Z

Humans do chain-of-thought.

User: Write “strawberry” one letter at a time, with a space between each letter. Then count how many r’s are in strawberry.

gpt-3.5-turbo: ASSISTANT s t r a w b e r r y

There are 2 r's in strawberry.

After some experimenting, it seems like the actual problem is that many LLMs can’t count.

User: How many r’s are in the following sequence of letters:

S/T/R/A/W/B/E/R/R/Y

gpt-4o-mini: In the sequence S/T/R/A/W/B/E/R/R/Y, there are 2 occurrences of the letter "R."

Oddly, if I change a bunch of the non-R letters, I seem to start getting the right answer.

lucubratory · 2024-07-25T05:12:27.000000Z

>I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.

You don't ask a human being how many r's there are in strawberry at all. The only reason you or anyone else asks that question is because it's an interesting quirk of how LLMs work that they struggle to answer it in that format. It's like an alien repeatedly showing humans an optical illusion that relies on the existence of our (literal) blind spot and using it as evidence of our supposed lack of intelligence.

soneca · 2024-07-24T17:38:43.000000Z

This is only an issue if you send commands to a LLM as you were communicating to a human.

antisthenes · 2024-07-24T18:05:27.000000Z

> This is only an issue if you send commands to a LLM as you were communicating to a human.

Yes, it's an issue. We want the convenience of sending human-legible commands to LLMs and getting back human-readable responses. That's the entire value proposition lol.

pegasus · 2024-07-24T18:42:31.000000Z

Far from the entire value proposition. Chatbots are just one use of LLMs, and not the most useful one at that. But sure, the one "the public" is most aware of. As opposed to "the hackers" that are supposed to frequent this forum. LOL

WhitneyLand · 2024-07-25T00:26:39.000000Z

Where did you get this idea from?

Tokens aren’t the source of facts within a model. it’s an implementation detail and doesn’t inherently constrain how things could be counted.

observationist · 2024-07-25T15:06:21.000000Z

Tokens are the first form of information being encoded into the model. They're statistically guided, more or less a compression dictionary comparable to a Lempel Ziv setup.

Combinations of tokens get encoded, so if the feature isn't part of the information being carried forward into the network as it models the information in the corpus, the feature isn't modeled well, or at all. The consequence of having many character tokens is that the relevance of individual characters is lost, and you have to explicitly elicit the information. Models know that words have individual characters, but "strawberry" isn't encoded as a sequence of letters, it's encoded as an individual feature of the tokenizer embedding.

Other forms of tokenizing have other tradeoffs. The trend lately is to increase tokenizer dictionary scope, up to 128k in Llama3 from 50k in gpt-3. The more tokens, the more nuanced individual embedding features in that layer can be before downstream modeling.

Tokens inherently constrain how the notion of individual letters are modeled in the context of everything an LLM learns. In a vast majority of cases, the letters don't matter, so the features don't get mapped and carried downstream of the tokenizer.

WhitneyLand · 2024-07-25T22:38:14.000000Z

So, it's conjecture then?

What you're saying sounds plausible, but I don't see how we can conclude that definitively without at least some empirical tests, say a set words that predictably give an error along token boundaries.

The thing is, there are many ways a model can get around to answering the same question, it doesn't just depend on the architecture but also on how the training data is structured.

For example, if it turned out tokenization was the cause of this glitch, conceivably it could be fixed by adding enough documents with data relating to letter counts, providing another path to get the right output.

Kuinox · 2024-07-25T22:46:06.000000Z

There isn't a lot of place that teach the AI which letters there is in each tokens. It's a made up concept, and the AI doesn't have enough information in the dataset about this concept, it can difficulty generalize this.

There is a lot of problems like that, that can be reformulated. For example if you ask it what is the biggest between 9.11 and 9.9, it will respond 9.9. If you look at how it's tokenized, you can see it restate an easy problem as something not straightforward even for a human. if you restart the problem by writing the number in full letters, it will correctly respnod.

sebastiennight · 2024-07-28T09:34:40.000000Z

Wait... 9.9 is the correct answer, isn't it? I'm not getting your point.

generalizations · 2024-07-24T17:54:43.000000Z

Testing models on their tokenization has always struck me as kinda odd. Like, that has nothing to do with their intelligence.

ca_tech · 2024-07-24T19:17:15.000000Z

Its like showing someone a color and asking how many letters it has. 4... 3? blau, blue, azul, blu The color holds the meaning and the words all map back.

In the model the individual letters hold little meaning. Words are composed of letters but simply because we need some sort of organized structure for communication that helps represents meaning and intent. Just like our color blue/blau/azul/blu.

Not faulting them for asking the question but I agree that the results do not undermine the capability of the technology. In fact it just helps highlight the constraints and need for education.

SirMaster · 2024-07-24T18:39:33.000000Z

How is a layman supposed to even know that it's testing on that? All they know is it's a large language model. It's not unreasonable they should expect it to be good at things having to do with language, like how many letters are in a word.

Seems to me like a legit question for a young child to answer or even ask.

stavros · 2024-07-24T20:29:48.000000Z

> How is a layman supposed to even know that it's testing on that?

They're not, but laymen shouldn't think that the LLM tests they come up with have much value.

SirMaster · 2024-07-24T20:48:36.000000Z

I'm saying a layman or say a child wouldn't even think this is a "test". They are just asking a language model a seemingly simple language related question from their point of view.

groby_b · 2024-07-24T21:13:03.000000Z

layman or children shouldn't use LLMs.

They're pointless unless you have the expertise to check the output. Just because you can type text in a box doesn't mean it's a tool for everybody.

SirMaster · 2024-07-24T21:26:15.000000Z

Well they certainly aren't being marketed or used that way...

I'm seeing everyone and their parents using chatgpt.

wruza · 2024-07-24T18:19:57.000000Z

It doesn’t test “on tokenization” though. What happens when an answer is generated is few abstraction levels deeper than tokens. A “thinking” “slice” of an llm is completely unaware of tokens as an immediate part of its reasoning. The question just shows lack of systemic knowledge about strawberry as a word (which isn’t surprising, tbh).

qeternity · 2024-07-24T18:40:05.000000Z

It is. Strawberry is one token in many tokenziers. The model doesn't have a concept that there are letters there.

alew1 · 2024-07-24T20:16:02.000000Z

If I show you a strawberry and ask how many r’s are in the name of this fruit, you can tell me, because one of the things you know about strawberries is how to spell their name.

Very large language models also “know” how to spell the word associated with the strawberry token, which you can test by asking them to spell the word one letter at a time. If you ask the model to spell the word and count the R’s while it goes, it can do the task. So the failure to do it when asked directly (how many r’s are in strawberry) is pointing to a real weakness in reasoning, where one forward pass of the transformer is not sufficient to retrieve the spelling and also count the R’s.

qeternity · 2024-07-24T23:54:01.000000Z

Sure, that's a different issue. If you prompt in a way to invoke chain of thought (e.g. what humans would do internally before answering) all of the models I just tested got it right.

viraptor · 2024-07-24T20:59:11.000000Z

That's not always true. They often fail the spelling part too.

sebastiennight · 2024-07-28T09:36:36.000000Z

> If I show you a strawberry and ask how many r’s are in the name of this fruit, you can tell me, because one of the things you know about strawberries is how to spell their name.

LOL. I would fail your test, because "fraise" only has one R, and you're expecting me to reply "3".

wruza · 2024-07-24T22:17:34.000000Z

The thinking part of a model doesn’t know about tokens either. Like a regular human few thousand years ago didn’t think of neural impulses or air pressure distribution when talking. It might “know” about tokens and letters like you know about neurons and sound, but not access them on the technical level, which is completely isolated from it. The fact that it’s a chat of tokens of letters, which are a form of information passing between humans, is accidental.

abdullahkhalids · 2024-07-24T19:12:31.000000Z

If I ask an LLM to generate new words for some concept or category, it can do that. How do the new words form, if not from joining letters?

mirekrusin · 2024-07-24T20:15:26.000000Z

Not letters, but tokens. Think that it's translating everything to/from Chinese.

abdullahkhalids · 2024-07-24T20:43:36.000000Z

How does that explain why the tokens for strawberry, melon and "Stellaberry" [1] are close to each other?

[1] Suggestion from chatgpt3.5 for new fruit name.

roywiggins · 2024-07-24T21:56:25.000000Z

Illiterate humans can come up with new words like that too without being able to spell, LLMs are modeling language without precisely modeling spelling.

coder543 · 2024-07-25T16:02:57.000000Z

The tokenizer system supports virtually any input text that you want, so it follows that it also allows virtually any output text. It isn’t limited to a dictionary of the 1000 most common words or something.

There are tokens for individual letters, but the model is not trained on text written with individual tokens per letter, it is trained on text that has been converted into as few tokens as possible. Just like you would get very confused if someone started spelling out entire sentences as they spoke to you, expecting you to reconstruct the words from the individual spoken letters, these LLMs also would perform terribly if you tried to send them individual tokens per letter of input (instead of the current tokenizer scheme that they were trained on).

Even though you might write a message to an LLM, it is better to think of that as speaking to the LLM. The LLM is effectively hearing words, not reading letters.

guywhocodes · 2024-07-24T18:59:40.000000Z

This is pretty much equivalent to the statement "multicharacter tokens are a dead end for understanding text". Which I agree with.

sebzim4500 · 2024-07-24T20:34:20.000000Z

That doesn't follow from what he said at all. Knowing how to spell words and understanding them are basically unrelated tasks.

meroes · 2024-07-24T18:40:27.000000Z

I hear this a lot but there are vast sums of money thrown at where a model fails the strawberry cases.

Think about math and logic. If a single symbol is off, it’s no good.

Like a prompt where we can generate a single tokenization error at my work, by my very rough estimates, generates 2 man hours of work. (We search for incorrect model responses, get them to correct themselves, and if they can’t after trying, we tell them the right answer, and edit it for perfection). Yes even for counting occurrences of characters. Think about how applicable that is. Finding the next term in a sequence, analyzing strings, etc.

antonvs · 2024-07-24T19:31:06.000000Z

> Think about math and logic. If a single symbol is off, it’s no good.

In that case the tokenization is done at the appropriate level.

This is a complete non-issue for the use cases these models are designed for.

meroes · 2024-07-24T20:36:58.000000Z

But we don’t restrict it to math or logical syntax. Any prompt across essentially all domains. The same model is expected to handle any kind of logical reasoning that can be brought into text. We don’t mark it incorrect if it spells an unimportant word wrong, however keep in mind the spelling of a word can be important for many questions, for example—off the top of my head: please concatenate “d”, “e”, “a”, “r” into a common English word without rearranging the order. The types of examples are endless. And any type of example it gets wrong, we want to correct it. I’m not saying most models will fail this specific example, but it’s to show the breadth of expectations.

fmbb · 2024-07-24T20:32:26.000000Z

> that has nothing to do with their intelligence.

Of course. Because these models have no intelligence.

Everyone who believes they do seem to believe intelligence derives from being able to use language, however, and not being able to tell how many times the letter r is in the word strawberry is a very low bar to not pass.

roywiggins · 2024-07-24T21:57:30.000000Z

An LLM trained on single letter tokens would be able to, it just would be much more laborious to train.

wruza · 2024-07-24T22:42:51.000000Z

Why would it be able to?

roywiggins · 2024-07-24T23:01:23.000000Z

If you give LLMs the letters one a time they often count them just fine, though Claude at least seems to need to keep a running count to get it right:

"How many R letters are in the following? Keep a running count. s t r a w b e r r y"

They are terrible at counting letters in words because they rarely see them spelled out. An LLM trained one byte at a time would always see every character of every word and would have a much easier time of it. An LLM is essentially learning a new language without a dictionary, of course it's pretty bad at spelling. The tokenization obfuscates the spelling not entirely unlike how verbal language doesn't always illuminate spelling.

wruza · 2024-07-24T23:18:47.000000Z

May the effect you see, when you spell it out, be not a result of “seeing” tokens, but a result of the fact that a model learned – at a higher level – how lists in text can be summarized, summed up, filtered and counted?

Iow, what makes you think that it’s exactly letter-tokens that help it and not the high-level concept of spelling things out itself?

roywiggins · 2024-07-26T12:50:56.000000Z

It's more that it's liable to struggle to guess how to spell tokens [10295, 947] (or whatever it is) since there's no a priori reason that it will learn to associate them with the exact right tokens for the individual letters in the right order. If it's trained on bytes though, it doesn't need to infer that. It's like asking a smart, semi-literate person a spelling question- they might have a rough sense of it but they will not be very good at it.

Once it is just counting lists then it's probably drawing on a higher level capability, yeah.

swatcoder · 2024-07-24T18:01:27.000000Z

Surfacing and underscoring obvious failure cases for general "helpful chatbot" use is always going to be valuable because it highlights how the "helpful chatbot" product is not really intuitively robust.

Meanwhile, it helps make sure engineers and product designers who want to build a more targeted product around LLM technology know that it's not suited to tasks that may trigger those kinds of failures. This may be obvious to you as an engaged enthusiast or cutting edge engineer or whatever you are, but it's always going to be new information to somebody as the field grows.

VincentEvans · 2024-07-24T18:49:31.000000Z

I don’t know anything about LLMs beyond using ChatGPT and Copilot… but unless because of this lack of knowledge I am misinterpreting your reply - it sounds as if you are excusing the model giving a completely wrong answer to a question that anyone intelligent enough to learn alphabet can answer correctly.

danieldk · 2024-07-24T19:09:43.000000Z

The problem is that the model never gets to see individual letters. The tokenizers used by these models break up the input in pieces. Even though the smallest pieces/units are bytes in most encodings (e.g. BBPE), the tokenizer will cut up most of the input in much larger units, because the vocabulary will contain fragments of words or even whole words.

For example, if we tokenize Welcome to Hacker News, I hope you like strawberries. The Llama 405B tokenizer will tokenize this as:

    Welcome Ġto ĠHacker ĠNews , ĠI Ġhope Ġyou Ġlike Ġstrawberries .

(Ġ means that the token was preceded by a space.)

Each of these pieces is looked up and encoded as a tensor with their indices. Adding a special token for the beginning and end of the text, giving:

    [128000, 14262, 311, 89165, 5513, 11, 358, 3987, 499, 1093, 76203, 13]

So, all the model sees for 'Ġstrawberries' is the number 76204 (which is then used in the piece embedding lookup). The model does not even have access to the individual letters of the word.

Of course, one could argue that the model should be fed with bytes or codepoints instead, but that would make them vastly less efficient with quadratic attention. Though machine learning models have done this in the past and may do this again in the future.

Just wanted to finish of this comment with saying that the tokens might be provided in the model splitted if the token itself is not in the vocabulary. For instance, the same sentence translated to my native language is tokenized as:

    Wel kom Ġop ĠHacker ĠNews , Ġik Ġhoop Ġdat Ġje Ġvan Ġa ard be ien Ġh oud t .

And the word voor strawberries (aardbeien) is split, though still not in letters.

TiredOfLife · 2024-07-24T19:46:31.000000Z

The thing is, how the tokenizing work is about as relevant to the person asking the question as name of the cat of the delivery guy who delivered the GPU that the llm runs on.

danieldk · 2024-07-24T20:19:42.000000Z

How the tokenizer works explains why a model can’t answer the question, what the name of the cat is doesn’t explain anything.

This is Hacker News, we are usually interested in how things work.

VincentEvans · 2024-07-24T20:45:29.000000Z

Indeed, I appreciate the explanation, it is certainly both interesting and informative to me, but to somewhat echo the person you are replying to - if I wanted a boat, and you offer me a boat, and it doesn’t float - the reasons for failure are perhaps full of interesting details, but perhaps the most important thing to focus on first - is to make the boat float, or stop offering it to people who are in need of a boat.

To paraphrase how this thread started - it was someone testing different boats to see whether they can simply float - and they couldn’t. And the reply was questioning the validity of testing boats whether they can simply float.

At least this is how it sounds to me when I am told that our AI overlords can’t figure out how many Rs are in the word “strawberry”.

viraptor · 2024-07-24T21:38:16.000000Z

At some point you need to just accept the details and limitations of things. We do this all the time. Why is your calculator giving only approximate result? Why can't your car go backwards as fast as forwards? Etc. It sucks that everyone gets exposed to the relatively low level implementation with LLM (almost the raw model), but that's the reality today.

roywiggins · 2024-07-24T22:00:02.000000Z

People do get similarly hung up on surprising floating point results: why can't you just make it work properly? And a full answer is a whole book on how floating point math works.

michaelmrose · 2024-07-24T21:35:05.000000Z

The test problem is emblematic of a type of synthetic query that could fail but of limited import in actual usage.

For instance you could ask it for a JavaScript function to count any letter in any word and pass it r and strawberry and it would be far more useful.

Having edge cases doesn't mean its not useful it is neither a free assastant nor a coder who doesn't expect a paycheck. At this stage it's a tool that you can build on.

To engage with the analogy. A propeller is very useful but it doesn't replace the boat or the Captain.

drevil-v2 · 2024-07-24T23:47:04.000000Z

Does not seem work universally. Just tested a few with this prompt

"create a javascript function to count any letter in any word. Run this function for the letter "r" and the word "strawberry" and print the count"

ChatGPT-4o => Output is 3. Passed

Claude3.5 => Output is 2. Failed. Told it the count is wrong. It apologised and then fixed the issue in the code. Output is now 3. Useless if the human does not spot the error.

llama3.1-70b(local) => Output is 2. Failed.

llama3.1-70b(Groq) => Output is 2. Failed.

Gemma2-9b-lt(local) => Output is 2. Failed.

Curiously all the ones that failed had this code (or some near identical version of it)

```javascript

function countLetter(letter, word) {

  // Convert both letter and word to lowercase to make the search case-insensitive

  const lowerCaseWord = word.toLowerCase();

  const lowerCaseLetter = letter.toLowerCase();


  // Use the split() method with the letter as the separator to get an array of substrings separated by the letter

  const substrings = lowerCaseWord.split(lowerCaseLetter);

  // The count of the letter is the number of splits minus one (because there are n-1 spaces between n items)

  return substrings.length - 1;

}

// Test the function with "r" and "strawberry"

console.log(countLetter("r", "strawberry")); // Output: 2 ```

sebastiennight · 2024-07-28T09:43:13.000000Z

It's not the job of the LLM to run the code... if you ask it to run the code, it will just do its best approximation at giving you a result similar to what the code seems to be doing. It's not actually running it.

Just like Dall-E is not layering coats of pain to make a watercolor... it just makes something that looks like one.

Your LLM (or you) should run the code in a code interpretor. Which ChatGPT did because it has access to tools. Your local ones don't.

glandium · 2024-07-25T01:57:36.000000Z

Your function returns 3, and I don't see how it can return 2.

drevil-v2 · 2024-07-25T03:23:05.000000Z

I did not run the code myself. The code block and console log I have pasted is verbatim copy from Claude3.5

michaelmrose · 2024-07-25T04:46:52.000000Z

Claude isn't actually running console.log() it produced correct code.

This prompt "please write a javascript function that takes a string and a letter and iterates over the characters in a string and counts the occurrences of the letter"

Produced a correct function given both chatGPT4o and claude3.5 for me.

dTal · 2024-07-24T20:44:07.000000Z

It is however a highly relevant thing to be aware of when evaluating a LLM for 'intelligence', which was the context this was brought up in.

Without looking at the word 'strawberry', or spelling it one letter at a time, can you rattle off how many letters are in the word off the top of your head? No? That is what we are asking the LLM to do.

psb217 · 2024-07-24T18:45:51.000000Z

How can I know whether any particular question will test a model on its tokenization? If a model makes a boneheaded error, how can I know whether it was due to lack of intelligence or due to tokenization? I think finding places where models are surprisingly dumb is often more informative than finding particular instances where they seem clever.

It's also funny, since this strawberry question is one where a model that's seriously good at predicting the next character/token/whatever quanta of information would get it right. It requires no reasoning, and is unlikely to have any contradicting text in the training corpus.

viraptor · 2024-07-24T21:55:22.000000Z

> How can I know whether any particular question will test a model on its tokenization?

Does something deal with separate symbols rather than just meaning of words? Then yes.

This affects spelling, math (value calculation), logic puzzles based on symbols. (You'll have more success with a puzzle about "A B A" rather than "ABA")

> It requires no reasoning, and is unlikely to have any contradicting text in the training corpus.

This thread contains contradictions. Every other announcement of an llm contains a comment with a contradicting text when people post the wrong responses.

AuryGlenz · 2024-07-25T17:09:14.000000Z

I suppose what models should have are some instructions of things they aren’t good at and will need to break out into python code or what have you. Humans have an intuition for this - I have basic knowledge of when I need to write something down or use a calculator. LLMs don’t have intuition (yet - though I suppose one could use a smaller model for that), so explicit instructions would work for now.

furyofantares · 2024-07-24T23:49:25.000000Z

It's not very interesting when they fail at it, but it will be interesting if they get good at it.

Also there are some cases where regular people will stumble into it being awful at this without any understanding why (like asking it to help them with their wordle game.)

baq · 2024-07-24T18:45:06.000000Z

Call me when models understand when to convert the token into actual letters and count them. Can’t claim they’re more than word calculators before that.

antonvs · 2024-07-24T19:46:43.000000Z

That's misleading.

When you read and comprehend text, you don't read it letter by letter, unless you have a severe reading disability. Your ability to comprehend text works more like an LLM.

Essentially, you can compare the human brain to a multi-model or modular system. There are layers or modules involved in most complex tasks. When reading, you recognize multiple letters at a time[], and those letters are essentially assembled into tokens that a different part of your brain can deal with.

Breaking down words into letters is essentially a separate "algorithm". Just like your brain, it's likely to never make sense for a text comprehension and generation model to operate at the level of letters - it's inefficient.

A multi-modal model with a dedicated model for handling individual letters could easily convert tokens into letters and operate on them when needed. It's just not a high priority for most use cases currently.

[]https://www.researchgate.net/publication/47621684_Letters_in...

baq · 2024-07-24T19:54:14.000000Z

I agree completely, that wasn’t the point though: the point was that my 6 yo knows when to spell the word when asked and the blob of quantized floats doesn’t, or at least not reliably.

So the blob wasn’t trained to do that (yeah low utility I get that) but it also doesn’t know it doesn’t know, which is an another much bigger and still unsolved problem.

stanleydrew · 2024-07-24T20:54:36.000000Z

I would argue that most sota models do know that they don't know this, as evidenced by the fact that when you give them a code interpreter as a tool they choose to use it to write a script that counts the number of letters rather than try to come up with an answer on their own.

(A quick demo of this in the langchain docs, using claude-3-haiku: https://python.langchain.com/v0.2/docs/integrations/tools/ri...)

patall · 2024-07-24T20:10:53.000000Z

The model communicates in a language, but our letters are not necessary for such and in fact not part of the english language. You could write english using per word pictographs and it would still be the same english&the same information/message. It's like asking you if there is a '5' in 256 but you read binary.

jahsome · 2024-07-24T19:09:32.000000Z

Is anyone in the know, aside from mainstream media (god forgive me for using this term unironically) and civillians on social media claiming LLMs are anything but word calculators?

I think that's a perfect description by the way, I'm going to steal it.

dTal · 2024-07-24T20:53:05.000000Z

I think it's a very poor intuition pump. These 'word calculators' have lots of capabilities not suggested by that term, such as a theory of mind and an understanding of social norms. If they are a "merely" a "word calculator", then a "word calculator" is a very odd and counterintuitively powerful algorithm that captures big chunks of genuine cognition.

jahsome · 2024-07-25T19:59:15.000000Z

Do they actually have those capabilities, or does it just seem like they do because they're very good calculators?

dTal · 2024-07-25T21:25:26.000000Z

There is no philosophical difference. It's like asking if Usain Bolt is really a fast runner, or if he just seems like it because he has long legs and powerful muscles.

jahsome · 2024-07-26T03:04:27.000000Z

I think that's a poor a comparison, but I understand your point. I just disagree about there being no philosophical difference. I'd argue the difference is philosophical, rather than factual.

You also indirectly answered my initial question -- so thanks!

dTal · 2024-07-26T11:53:35.000000Z

What is the difference?

jahsome · 2024-07-26T18:52:01.000000Z

I'm not sure I'm educated (or rested) enough to answer that in a coherent manner, certainly not in a comment thread typing on mobile. So I won't waste your time babbling.

I don't disagree they produce astonishing responses but the nuance of why it's producing that output matters to me.

For example, with regard to social mores, I think a good way to summarize my hang up is that my understanding is LLMs just pattern match their way to approximations.

That to me is different from actually possessing an understanding, even though the outcome may be the same.

I can't help but draw comparisons to my autistic masking.

robbiep · 2024-07-24T21:31:49.000000Z

They’re trained on the available corpus of human knowledge and writings. I would think that the word calculators have failed if they were unable to predict the next word or sentiment given the trillions of pieces of data they’ve been fed. Their training environment is literally people talking to each other and social norms. Doesn’t make them anything more than p-zombies though.

As an aside, I wish we would call all of this stuff pseudo intelligence rather than artificial intelligence

dTal · 2024-07-25T21:24:16.000000Z

I side with Dennett (and Turing for that matter) that a "p-zombie" is a logically incoherent thing. Demonstrating understanding is the same as having understanding because there is no test that can distinguish the two.

Are LLMs human? No. Can they do everything humans do? No. But they can do a large enough subset of things that until now nothing but a human could do that we have no choice but to call it "thinking". As Hofstadter says - if a system is isomorphic to another one, then its symbols have "meaning", and this is indeed the definition of "meaning".

onlyrealcuzzo · 2024-07-24T19:19:50.000000Z

> Like, that has nothing to do with their intelligence.

Because they don't have intelligence.

If they did, they could count the letters in strawberry.

TwentyPosts · 2024-07-24T19:22:04.000000Z

People have been over this. If you believe this, you don't understand how LLMs work.

They fundamentally perceive the world in terms of tokens, not "letters".

antonvs · 2024-07-24T19:34:32.000000Z

> If you believe this, you don't understand how LLMs work.

Nor do they understand how intelligence works.

Humans don't read text a letter at a time. We're capable of deconstructing words into individual letters, but based on the evidence that's essentially a separate "algorithm".

Multi-model systems could certainly be designed to do that, but just like the human brain, it's unlikely to ever make sense for a text comprehension and generation model to work at the level of individual letters.

probably_wrong · 2024-07-24T18:30:14.000000Z

I would counterargue with "that's the model's problem, not mine".

Here's a thought experiment: if I gave you 5 boxes and told you "how many balls are there in all of this boxes?" and you answered "I don't know because they are inside boxes", that's a fail. A truly intelligent individual would open them and look inside.

A truly intelligent model would (say) retokenize the word into its individual letters (which I'm optimistic they can) and then would count those. The fact that models cannot do this is proof that they lack some basic building blocks for intelligence. Model designers don't get to argue "we are human-like except in the tasks where we are not".

pegasus · 2024-07-24T18:36:40.000000Z

Of course they lack building blocks for full intelligence. They are good at certain tasks, and counting letters is emphatically not one of them. They should be tested and compared on the kind of tasks they're fit for, and so the kind of tasks they will be used in solving, not tasks for which they would be misemployed to begin with.

probably_wrong · 2024-07-24T18:53:31.000000Z

I agree with you, but that's not what the post claims. From the article:

"A significant effort was also devoted to enhancing the model’s reasoning capabilities. (...) the new Mistral Large 2 is trained to acknowledge when it cannot find solutions or does not have sufficient information to provide a confident answer."

Words like "reasoning capabilities" and "acknowledge when it does not have enough information" have meanings. If Mistral doesn't add footnotes to those assertions then, IMO, they don't get to backtrack when simple examples show the opposite.

pegasus · 2024-07-24T19:05:16.000000Z

You're right, I missed that claim.

mrkstu · 2024-07-24T19:03:59.000000Z

Its not like an LLM is released with a hit list of "these are the tasks I really suck at." Right now users have to figure it out on the fly or have a deep understanding of how tokenizers work.

That doesn't even take into account what OpenAI has typically done to intercept queries and cover the shortcomings of LLMs. It would be useful if each model did indeed come out with a chart covering what it cannot do and what it has been tailored to do above and beyond the average LLM.

sebastiennight · 2024-07-28T09:47:12.000000Z

Ah, so Nick Vujicic[0] would fail your "balls in a box" test, and is not an intelligent entity.

[0]: https://genius-u-attachments.s3.amazonaws.com/uploads/articl...

jackbrookes · 2024-07-24T19:16:15.000000Z

It just needs a little hint

    Me: spell "strawberry" with 1 bullet point per letter
    ChatGPT:
       S
       T
       R
       A
       W
       B
       E
       R
       R
       Y
    Me: How many Rs? 
    ChatGPT: There are three Rs in "strawberry".

TiredOfLife · 2024-07-24T19:49:39.000000Z

Me: try again ChatGPT: There are two Rs in "strawberry."

kevindamm · 2024-07-24T22:17:10.000000Z

ChatGPT: "I apologize, there are actually two Rs in strawberry."

groby_b · 2024-07-24T21:13:57.000000Z

LLMs are not truly intelligent.

Never have been, never will be. They model language, not intelligence.

CamperBob2 · 2024-07-25T00:03:30.000000Z

We don't know what intelligence is. It's extremely arrogant to say that something or someone doesn't have it, and never will.

groby_b · 2024-07-28T20:43:56.000000Z

Sure, if you want to go with wildly theoretical approaches, we can't even be sure if the rock on the ground doesn't have some form of intelligence.

Meanwhile, for practical purposes, there's little arrogance needed to say that some things are preconditions for any form of intelligence that's even remotely recognizable.

1) Learning needs to happen continuously. That's a no-go for now, maybe solvable.

2) Learning needs to require much less data. Very dubious without major breakthroughs, likely on the architectural level. (At which point it's not really an LLM any more, not in the current sense)

3) They need to adapt to novel situations, which requires 1&2 as preconditions. 3

4) There's a good chance intelligence requires embodiment. It's not proven, but it's likely. For one, without observing outcomes, they have little capability to self-improve their reasoning.

5) They lack long-term planning capacity. Again, reliant on memory, but also executive planning.

There's a whole bunch more. Yes, LLMs are absolutely amazing achievements. They are useful, they imply a lot of interesting things about the nature of language, but they aren't intelligent. And without modifying them to the extent that they aren't recognizably what we currently call LLMs, there won't be intelligence. Sure, we can have the ship of Theseus debate, but for practical purposes, nope, LLMs aren't intelligent.

CamperBob2 · 2024-07-29T14:40:40.000000Z

Eh, maybe.

1,2,3) All subjective opinions.

4) 'Embodiment' is another term we don't really know how to define. At what point does an entity have a 'body' of the sort that supports 'intelligence'? If you want to stick with vague definitions, 'awareness' seems sufficient. Otherwise you will end up arguing about paralyzed people, Helen Keller, that rock opera by the Who about the pinball player, and so on.

5) OK, so the technology that dragged Lee Sedol up and down the goban lacks long-term planning capacity. Got it.

None of these criteria are up to the task of supporting or refuting something as vague as 'intelligence.' I almost think there has to be an element of competition involved. If you said that the development of true intelligence requires a self-directed purpose aimed at outcompeting other entities for resources, that would probably be harder to dismiss. Could also argue that an element of cooperation is needed, again serving the ultimate purpose of improving competitive fitness.

OKRainbowKid · 2024-07-24T21:42:36.000000Z

They model the dataset they were trained on. How would a dataset of what you consider intelligence look like?

groby_b · 2024-07-25T17:31:25.000000Z

LLMs are not a tool to model intelligence. It's not a function of the dataset, they are, as is, not sufficient. One of the largest shortcomings being the lack of continuous learning, memory, and (likely) forgetting

michaelmrose · 2024-07-24T22:03:09.000000Z

Those who develop AI that know anything don't actually describe current technology as human like intelligence rather it is capable of many tasks which previously required human intelligence.

tikkun · 2024-07-24T15:56:54.000000Z

When using a prompt that involves thinking first, all three get it correct.

"Count how many rs are in the word strawberry. First, list each letter and indicate whether it's an r and tally as you go, and then give a count at the end."

Llama 405b: correct

Mistral Large 2: correct

Claude 3.5 Sonnet: correct

jedberg · 2024-07-24T17:05:39.000000Z

This reminds me of when I had to supervise outsourced developers. I wanted to say "build a function that does X and returns Y". But instead I had to say "build a function that takes these inputs, loops over them and does A or B based on condition C, and then return Y by applying Z transformation"

At that point it was easier to do it myself.

mratsim · 2024-07-24T17:26:39.000000Z

Exact instruction challenge https://www.youtube.com/watch?v=cDA3_5982h8

HPsquared · 2024-07-24T18:24:04.000000Z

"What programming computers is really like."

EDIT: Although perhaps it's even more important when dealing with humans and contracts. Someone could deliberately interpret the words in a way that's to their advantage.

layer8 · 2024-07-24T16:12:41.000000Z

It’s not impressive that one has to go to that length though.

unshavedyak · 2024-07-24T16:29:12.000000Z

Imo it's impressive that any of this even remotely works. Especially when you consider all the hacks like tokenization that i'd assume add layers of obfuscation.

There's definitely tons of weaknesses with LLMs for sure, but i continue to be impressed at what they do right - not upset at what they do wrong.

mattnewton · 2024-07-24T16:47:18.000000Z

You can always find something to be unimpressed by I suppose, but the fact that this was fixable with plain english is impressive enough to me.

layer8 · 2024-07-24T17:11:00.000000Z

The technology is frustrating because (a) you never know what may require fixing, and (b) you never know if it is fixable by further instructions, and if so, by which ones. You also mostly* cannot teach it any fixes (as an end user). Using it is just exhausting.

*) that is, except sometimes by making adjustments to the system prompt

mattnewton · 2024-07-24T18:28:56.000000Z

I think this particular example, of counting letters, is obviously going to be hard when you know how tokenization works. It's totally possible to develop an intuition for other times things will work or won't work, but like all ML powered tools, you can't hope for 100% accuracy. The best you can do is have good metrics and track performance on test sets.

I actually think the craziest part of LLMs is that how, as a developer or SME, just how much you can fix with plain english prompting once you have that intuition. Of course some things aren't fixable that way, but the mere fact that many cases are fixable simply by explaining the task to the model better in plain english is a wildly different paradigm! Jury is still out but I think it's worth being excited about, I think that's very powerful since there are a lot more people with good language skills than there are python programmers or ML experts.

diffeomorphism · 2024-07-25T10:02:12.000000Z

The problem is that the models hallucinate too confidently. In this case it is quite amusing (I had llama3.1:8b tell me confidently it is 1, then revise to 2, then apologize again and give the correct answer). However, while it is obvious here, having it confidently make up supposed software features from thin air when asking for "how do I ..." is more problematic. The answers sound plausible, so you actually waste time verifying whether they work or are nonsense.

psb217 · 2024-07-24T18:47:52.000000Z

Well, the answer is probably between 1 and 10, so if you try enough prompts I'm sure you'll find one that "works"...

petesergeant · 2024-07-24T17:11:08.000000Z

> In a park people come across a man playing chess against a dog. They are astonished and say: "What a clever dog!" But the man protests: "No, no, he isn't that clever. I'm leading by three games to one!"

Spivak · 2024-07-24T16:30:02.000000Z

To me it's just a limitation based on the world as seen by these models. They know there's a letter called 'r', they even know that some words start with 'r' or have r's in them, and they know what the spelling of some words is. But they've never actually seen one in as their world is made up entirely of tokens. The word 'red' isn't r-e-d but is instead like a pictogram to them. But they know the spelling of strawberry and can identify an 'r' when it's on its own and count those despite not being able to see the r's in the word itself.

emmelaich · 2024-07-24T18:40:46.000000Z

I think it's more that the question is not unlike "is there a double r in strawberry?' or 'is the r in strawberry doubled?'

Even some people will make this association, it's no surprise that LLMs do.

layer8 · 2024-07-24T17:02:54.000000Z

The great-parent demonstrates that they are nevertheless capable of doing so, but not without special instructions. Your elaboration doesn’t explain why the special instructions are needed.

jonas21 · 2024-07-24T18:01:12.000000Z

To be fair, I just asked a real person and had to go to even greater lengths:

Me: How many "r"s are in strawberry?

Them: What?

Me: How many times does the letter "r" appear in the word "strawberry"?

Them: Is this some kind of trick question?

Me: No. Just literally, can you count the "r"s?

Them: Uh, one, two, three. Is that right?

Me: Yeah.

Them: Why are you asking me this?

tedunangst · 2024-07-24T20:18:44.000000Z

You need to prime the other person with a system prompt that makes them compliant and obedient.

ukuina · 2024-07-25T01:57:24.000000Z

I look forward to the day when LLM refusal takes on a different meaning.

"No, I don't think I shall answer that. The question is too basic, and you know better than to insult me."

SirMaster · 2024-07-24T18:30:44.000000Z

Try asking a young child...

ThrowawayTestr · 2024-07-24T16:41:59.000000Z

Compared to chat bots of even 5 years ago the answer of two is still mind-blowing.

asadm · 2024-07-24T16:30:12.000000Z

this can be automated.

grumbel · 2024-07-24T17:43:55.000000Z

GPT4o already does that, for problems involving math it will write small Python programs to handle the calculations instead of doing it with the LLM itself.

skyde · 2024-07-24T18:50:27.000000Z

It “work” but the LLM having to use the calculator mean the LLM doesn’t understand arithmetic enough and doesn’t know how to use an follow a set of step (algorithm ) natively to find the answer for bug numbers.

I believe this could be fixed and is worth fixing. Because it’s the only way LLM will be able to help math and physic researcher write proof and make real scientific progress

OKRainbowKid · 2024-07-24T21:47:26.000000Z

It generates the code to run for the answer. Surely that means it actually knows to build the appropriate algorithm - it just struggles to perform the actual calculation.

pegasus · 2024-07-24T18:59:14.000000Z

Appending "Think step-by-step" is enough to fix it for both Sonnet and LLama 3.1 70B.

For example, the latter model answered with:

To count the number of Rs in the word "strawberry", I'll break it down step by step:

Start with the individual letters: S-T-R-A-W-B-E-R-R-Y Identify the letters that are "R": R (first one), R (second one), and R (third one) Count the total number of Rs: 1 + 1 + 1 = 3

There are 3 Rs in the word "strawberry".

tcgv · 2024-07-24T17:34:49.000000Z

Chain-of-Thought (CoT) prompting to the rescue!

We should always put some effort into prompt engineering before dismissing the potential of generative AI.

IncreasePosts · 2024-07-24T19:09:51.000000Z

Why doesn't the model prompt engineer itself?

tcgv · 2024-07-25T12:04:31.000000Z

Because it is a challenging task, you would need to define a prompt (or a set of prompts) that can precisely generate chain-of-thought prompts for the various generic problems the model encounters.

And sometimes CoT may not be the best approach. Depending on the problem other prompt engineering techniques will perform better.

johntb86 · 2024-07-24T18:06:42.000000Z

By this point, instruction tuning should include tuning the model to use chain of thought in the appropriate circumstances.

hansworst · 2024-07-24T17:25:57.000000Z

Can’t you just instruct your llm of choice to transform your prompts like this for you? Basically feed it with a bunch of heuristics that will help it better understand the thing you tell it.

Maybe the various chat interfaces already do this behind the scenes?

joshstrange · 2024-07-24T16:25:39.000000Z

Lots of replies mention tokens as the root cause and I’m not well versed in this stuff at the low level but to me the answer is simple:

When this question is asked (from what the models trained on) the question is NOT “count the number of times r appears in the word strawberry” but instead (effectively) “I’ve written ‘strawbe’, now how many r’s are in strawberry again? Is it 1 or 2?”.

I think most humans would probably answer “there are 2” if we saw someone was writing and they asked that question, even without seeing what they have written down. Especially if someone said “does strawberry have 1 or 2 r’s in it?”. You could be a jerk and say “it actually has 3” or answer the question they are actually asking.

It’s an answer that is _technically_ incorrect but the answer people want in reality.

Tepix · 2024-07-24T16:06:34.000000Z

LLMs think in tokens, not letters. It's like asking someone who is dyslexic about spelling. Not their strong suit. In practice, it doesn't matter much, does it?

recursive · 2024-07-24T16:39:59.000000Z

Sometimes it does, sometimes it doesn't.

It is evidence that LLMs aren't appropriate for everything, and that there could exist something that works better for some tasks.

Zambyte · 2024-07-24T18:15:56.000000Z

Language models are best treated like consciousness. Our consciousness does a lot less than people like to attribute to it. It is mostly a function of introspection and making connections, rather than being the part of the brain where higher level reasoning and the functions of the brain that tell your body how to stay alive (like beating your heart).

By allowing a language model to do function calling, you are essentially allowing it to do specialized "subconscious" thought. The language model becomes a natural language interface to the capabilities of its "subconsciousness".

A specific human analogy could be: I tell you to pick up a pen off of the table, and then you do it. Most of your mental activity would be subconscious, orienting your arm and hand properly to pick up the pen, actually grabbing the pen, and picking it up. The linguistic representation of the action would exist in your concious mind (pick up the pen), but not much else.

A language model could very easily call out to a text processing function to correctly do things like count the number of r's in the word strawberry. That is a job that your concious mind can dispatch to your subconciousness.

saagarjha · 2024-07-26T02:28:35.000000Z

My consciousness is right a lot more often than a LLM is.

Zambyte · 2024-07-26T12:06:17.000000Z

Yet it chose not to read beyond the first sentence :-)

imtringued · 2024-07-24T16:48:06.000000Z

This doesn't make any sense. The LLM knows the token to letter associations if you ask it. Once it has split the token into letters, it can answer the question anyway.

This tells us that the model is unaware of its own limitations, otherwise it would have figured that this tasks needs it to split the letters first.

staticman2 · 2024-07-24T17:18:22.000000Z

Well splitting into letters doesn't help with Mistral Large 2. I tried this with Mistral 2:

[Question: How many r's in strawberry? First split the word into letters, one letter on each line then answer the question.]

[Answer: Let's split the word "strawberry" into letters:

s t r a w b e r r y

Now, let's count the number of 'r's:

There are 2 'r's in the word "strawberry".]

[Question: Did you count the number of r's?]

[answer: Yes, I did. Here's the count:

r r

So, there are 2 'r's in the word "strawberry".]

I suspect the issue is these models have to be trained to accomplish tasks, and this is not a task it makes sense to train a model to perform. You might train it to do this with function calling one day, calling a python library to count the r's in a string, but actually manually counting r's just doesn't play to an LLM's strengths.

wruza · 2024-07-24T18:43:59.000000Z

Does it? It knows how a word in some context represented by tokens split into letters. It may know or not know the table. I wouldn’t trust what it tells about word/token correspondence more than in general.

doctoboggan · 2024-07-24T15:58:40.000000Z

Due to the fact that LLMs work on tokens and not characters, these sort of questions will always be hard for them.

ipython · 2024-07-24T22:50:02.000000Z

I really wanted to believe this could be simply fixed by more prompt engineering and chain of thought, so I tried it with Claude 3.5 and … well unfortunately still no luck. the reply is amusing - I’ll include it verbatim below.

Of note, this happened because earlier I had asked it without CoT and it incorrectly stated two “r”s. Once I started a new conversation (so fresh context window), it gave the right answer (3) and no longer claimed the “still the second r” as shown below.

Let's think through this step-by-step:

1. First, let's spell out "strawberry" letter by letter: s-t-r-a-w-b-e-r-r-y

2. Now, let's count the "r"s as we go through each letter: s (0) t (0) r (1) - First "r" found a (1) w (1) b (1) e (1) r (2) - Second "r" found r (2) - This is still the second "r" y (2)

3. After going through all the letters, we've counted 2 "r"s.

Therefore, the word "strawberry" contains 2 "r"s.

ryanpanwar · 2024-07-24T23:12:36.000000Z

Works for me with Claude 3.5:

Me: How many Rs are in the word strawberry? Think step by step

Claude: Let's approach this step-by-step:

1. First, let's write out the word: strawberry

2. Now, let's go through each letter one by one: s - not an R t - not an R r - this is an R (count: 1) a - not an R w - not an R b - not an R e - not an R r - this is another R (count: 2) r - this is a third R (count: 3) y - not an R

3. After going through all the letters, we've counted 3 Rs.

Therefore, there are 3 Rs in the word "strawberry".

ipython · 2024-07-25T01:16:08.000000Z

Yeah, I figured out that it didn’t work at first because I still had the previous attempt (where it got it wrong) in my conversation history. Starting with a fresh conversation gave me the correct answer. It was still funny to see it “rationalize” the wrong answer tho.

takumif · 2024-07-25T05:13:35.000000Z

For these classes of problems that LLMs struggle with, a more reliable way to go about them seems to be to ask them to solve them using tools, e.g. writing and executing a Python script to count the "R"s.

stitched2gethr · 2024-07-24T18:33:46.000000Z

Interestingly enough much simpler models can write an accurate function to give you the answer.

I think it will be a while before we get there. An LLM can lookup knowledge but can't actually perform calculations itself, without some external processor.

stanleydrew · 2024-07-24T21:04:41.000000Z

Why do we have to "get there?" Humans use calculators all the time, so why not have every LLM hooked up to a calculator or code interpreter as a tool to use in these exact situations?

vorticalbox · 2024-07-24T18:20:12.000000Z

I just tried llama 3.1 8 b this is its reply.

According to multiple sources, including linguistic analysis and word breakdowns, there are 3 Rs in the word "strawberry".

medmunds · 2024-07-24T18:42:59.000000Z

How much do threads like this provide the training data to convince future generations that—despite all appearances to the contrary—strawberry is in fact spelled with only two R's?

I just researched "how many r's are in strawberry?" in a search engine, and based solely on the results it found, I would have to conclude there is substantial disagreement on whether the correct answer is two or three.

fluoridation · 2024-07-24T18:48:19.000000Z

Speaking as a 100% human, my vote goes to the compromise position that "strawberry" has in fact four Rs.

m2024 · 2024-07-24T18:57:30.000000Z

Does anyone have input on the feasibility of running an LLM locally and providing an interface to some language runtime and storage space, possibly via a virtual machine or container?

No idea if there's any sense to this, but an LLM could be instructed to formulate and continually test mathematical assumptions by writing / running code and fine-tuning accordingly.

stanleydrew · 2024-07-24T20:38:47.000000Z

Yes, we are doing this at Riza[0] (via WASM). I'd love to have folks try our downloadable CLI which wraps isolated Python/JS runtimes (also Ruby/PHP but LLMs don't seem to write those very well). Shoot me an email[1] or say hi in Discord[1].

[0]:https://riza.io [1]:mailto:andrew@riza.io [2]:https://discord.gg/4P6PUeJFW5

killthebuddha · 2024-07-24T19:02:08.000000Z

FWIW this (approximately) is what everybody (approximately) is trying to do.

Terr_ · 2024-07-25T00:09:57.000000Z

I'm not surprised, because it's an issue with the fundamental design of the "pick words that tend to flow after the other words" machine.

Training data will only "fix" it in the shallow sense that it will have seen a comment like yours before. (As opposed to the deeper sense of "learning to count.")

kremi · 2024-07-24T20:47:28.000000Z

Your question sounds like you want to know how the word is spelled, and no one would put two r’s at straw, so the model could be assuming that you’re asking whether it’s strawbery or strawberry.

What happens if you ask the total number of occurrences of the letter r in the word? Does it still not get it right?

exebook · 2024-07-24T20:49:30.000000Z

Tokenization can be bypassed like this; (mistral large 2)

How many letters R are in the word "s-t-r-a-w-b-e-r-r-y"?

The word "s-t-r-a-w-b-e-r-r-y" contains three instances of the letter "R."

How many letters R contain the word strawberry?

The word "strawberry" contains two instances of the letter "R."

stanleydrew · 2024-07-24T20:34:08.000000Z

Plug in a code interpreter as a tool and the model will write Python or JavaScript to solve this and get it right 100% of the time. (Full disclosure: I work on a product called Riza that you can use as a code interpreter tool for LLMs)

sashank_1509 · 2024-07-25T02:47:35.000000Z

While strawberry can be attributed to tokenization here are some other basic stuff I’ve seen language models fail at:

1. Play tic tac toe such that you never lose

2. Which is bigger 9.11 or 9.9

3. 4 digit multiplication even with CoT prompting

ChikkaChiChi · 2024-07-24T16:00:02.000000Z

4o will get the answer right on the first go if you ask it "Search the Internet to determine how many R's are in strawberry?" which I find fascinating

paulcole · 2024-07-24T18:25:38.000000Z

I didn't even need to do that. 4o got it right straight away with just:

"how many r's are in strawberry?"

The funny thing is, I replied, "Are you sure?" and got back, "I apologize for the mistake. There are actually two 'r's in the word strawberry."

ofrzeta · 2024-07-24T20:00:40.000000Z

I kind of tried to replicate your experiment (in German where "Erdbeere" has 4 E) that went the same way. The interesting thing was that after I pointed out the error I couldn't get it to doubt the result again. It stuck to the correct answer that seemed kind of "reinforced".

It was also interesting to observe how GPT (4o) even tried to prove/illustrate the result typographically by placing the same word four times and putting the respective letter in bold font (without being prompted to do that).

jcheng · 2024-07-24T18:58:51.000000Z

GPT-4o-mini consistently gives me this:

> How many times does the letter “r” appear in the word “strawberry”?

> The letter "r" appears 2 times in the word "strawberry."

But also:

> How many occurrences of the letter “r” appear in the word “strawberry”?

> The word "strawberry" contains three occurrences of the letter "r."

brandall10 · 2024-07-24T20:39:03.000000Z

Neither phrase is causing the LLM to evaluate the word itself, it just helps focus toward parts of the training data.

Using more 'erudite' speech is a good technique to help focus an LLM on training data from folks with a higher education level.

Using simpler speech opens up the floodgates more toward the general populous.

brandall10 · 2024-07-24T20:30:40.000000Z

All that's happening is it finds 3 most commonly in the training set. When you push it, it responds with the next most common answer.

paulcole · 2024-07-24T23:42:38.000000Z

But then why does it stick to its guns on other questions but not this one?

brandall10 · 2024-07-25T00:54:31.000000Z

I haven't played with this model, but rarely do I find working w/ Claude or GPT-4 for that to be the case. If you say it's incorrect, it will give you another answer instead of insisting on correctness.

paulcole · 2024-07-25T03:46:09.000000Z

Wait what? You haven’t used 4o and you confidently described how it works?

brandall10 · 2024-07-25T13:44:53.000000Z

It's how LLMs work in general.

If you find a case where forceful pushback is sticky, it's either because the primary answer is overwhelmingly present in the training set compared to the next best option or because there are conversations in the training that followed similar stickiness, esp. if the structure of the pushback itself is similar to what is found in those conversations.

paulcole · 2024-07-25T16:46:19.000000Z

Right... except you said:

> If you say it's incorrect, it will give you another answer instead of insisting on correctness.

> When you push it, it responds with the next most common answer.

Which clearly isn't as black and white as you made it seem.

brandall10 · 2024-07-25T17:56:14.000000Z

I'll put it another way - behavior like this is extremely rare in my experience. I'm just trying to explain if one encounters it why it's likely happening.

0x1ceb00da · 2024-07-25T01:33:14.000000Z

> how many r's are in strawberry

How many thoughts go through your brain when you read this comment? You can give me a number but it will be a guess at best.

Der_Einzige · 2024-07-24T16:27:12.000000Z

I wrote and published a paper at COLING 2022 on why LLMs in general won't solve this without either 1. radically increasing vocab size, 2. rethinking how tokenizers are done, or 3. forcing it with constraints:

https://aclanthology.org/2022.cai-1.2/

Stumbling · 2024-07-24T17:56:31.000000Z

Claude 3 Opus gave correct answer.

jbritton · 2024-07-24T21:07:59.000000Z

I just tried Copilot 2 ‘r’ in strawberry, 3 ‘r’ in strawberries.

mirekrusin · 2024-07-24T20:10:17.000000Z

How many "r"s are in [496, 675, 15717]?

eschneider · 2024-07-24T18:51:29.000000Z

The models are text generators. They don't "understand" the question.

taf2 · 2024-07-24T18:26:04.000000Z

sonate 3.5 thinks 2

groby_b · 2024-07-24T21:11:48.000000Z

I mean, yes? Letter-based gymnastics are difficult in a token-based environment, and they're not a particularly valuable use case (outside of online dunking), so it's not really a surprise nobody cares much about those in terms of model training.

qwertox · 2024-07-24T19:38:02.000000Z

Claude needs to fix their text input box. It tries to be so advanced that code in backticks gets reformatted, and when you copy it, the formatting is lost (even the backticks).

nickthesick · 2024-07-24T20:45:18.000000Z

They are using Tiptap for their input and just a couple of days ago we called them out on some perf improvements that could be had in their editor: https://news.ycombinator.com/item?id=41036078

I am curious what you mean by the formatting is lost though?

qwertox · 2024-07-25T06:06:28.000000Z

Odd, multiline backtick code works very good, I don't know why I thought that it was also broken.

When you type "test `foo` done" in the editor, it immediately changes `foo` into a wrapped element. When you then copy the text without submitting it, then the backticks are lost, losing the inline-code formatting. I thought that this could also happen to multiline code. Somehow it does.

Type the following:

   Test: ```
   def foo():
      return bar
   ```

Delete that and type

   Test:
   ```
      def foo():
         return bar
   ```
   done

In the first case, the ``` in the line "Test: ```" does not open the code block, this happens with the second backtics. Maybe that's the way markdown works.

In the second case, all behaves normally, until you try to copy what you just wrote into the clipboard. Then you end up with

   Test:

   def foo():
      return bar

   done

Ok, only the backticks are lost but the formatting is preserved.

I think I have been trained by OpenAI to always copy what I submit before submitting, because it sometimes loses the submitted content, forcing me to re-submit.

cpursley · 2024-07-24T22:47:14.000000Z

Claude is truly incredible but I'm so tired of the JavaScript bloat everywhere. Just why. Both theirs and ChatGPTs UIs are hot garbage when it comes to performance (I constantly have to clear my cache and have even relegated them to a different browser entirely). Not everyone has an M4, and if we did - we'd probably just run our own models.

chipdart · 2024-07-25T01:31:33.000000Z

> Just why. Both theirs and ChatGPTs UIs are hot garbage when it comes to performance (...)

I have been using ChatGPT and Claude for a while and I never noticed anything resembling a performance issue. Can you elaborate on what you perceived as being "hot garbage"?

spaceship__sun · 2024-07-24T23:35:13.000000Z

Hardly a day passes without ML progress...405B dethroned in ONE DAY