Hacker News new | past | comments | ask | show | jobs | submit login
Large Enough (mistral.ai)
639 points by davidbarker 48 days ago | hide | past | favorite | 496 comments



Links to chat with models that released this week:

Large 2 - https://chat.mistral.ai/chat

Llama 3.1 405b - https://www.llama2.ai/

I just tested Mistral Large 2 and Llama 3.1 405b on 5 prompts from my Claude history.

I'd rank as:

1. Sonnet 3.5

2. Large 2 and Llama 405b (similar, no clear winner between the two)

If you're using Claude, stick with it.

My Claude wishlist:

1. Smarter (yes, it's the most intelligent, and yes, I wish it was far smarter still)

2. Longer context window (1M+)

3. Native audio input including tone understanding

4. Fewer refusals and less moralizing when refusing

5. Faster

6. More tokens in output


It seems to be the way with these releases, sticking with Claude, at least for the 'hard' tasks. In my agent platform I have LLMs assigned for easy/medium/hard categorised tasks, which was somewhat inspired from the Claude 3 release with Haiku/Sonnet/Opus. GPT4-mini has bumped Haiku for the easy category for now. Sonnet 3.5 bumped Opus for the hard category, so I could possibly downgrade the medium tasks from Sonnet 3.5 to Mistral Large 2 if the price is right on the platforms with only 123b params compared to 405b. I was surprised how much Llama3 405b was on together.ai $5/mil for input/output! I'll stick to Sonnet 3.5. Then I was also surprised how much cheaper Fireworks was at $3/mil

Gemini has two aces up its sleeve now with the long context and now the context caching for 75% reduced input token cost. I was looking at the "Improved Facuality and Reasoning in Language Models through Multi-agent debate" paper the other days, and thought Gemini would have a big cost advantage implementing this technique with the context caching. If only Google could get their model up to the level of Anthropic.


All 3 models you ranked cannot get "how many r's are in strawberry?" correct. They all claim 2 r's unless you press them. With all the training data I'm surprised none of them fixed this yet.


Tokenization make it hard for it to count the letters, that's also why if you ask it to do maths, writing the number in letters will yield better results.

for strawberry, it see it as [496, 675, 15717], which is str aw berry.

If you insert characters to breaks the tokens down, it find the correct result: how many r's are in "s"t"r"a"w"b"e"r"r"y" ?

> There are 3 'r's in "s"t"r"a"w"b"e"r"r"y".


>If you insert characters to breaks the tokens down, it find the correct result: how many r's are in "s"t"r"a"w"b"e"r"r"y" ?

The issue is that humans don't talk like this. I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.


Humans also constantly make mistakes that are due to proximity in their internal representation. "Could of"/"Should of" comes to mind: the letters "of" have a large edit distance from "'ve", but their pronunciation is very similar.

Especially native speakers are prone to the mistake as they grew up learning english as illiterate children, from sounds only, compared to how most people learning english as second language do it, together with the textual representation.

Psychologists use this trick as well to figure out internal representations, for example the rorschach test.

And probably, if you asked random people in the street how many p's there is in "Philippines", you'd also get lots of wrong answers. It's tricky due to the double p and the initial p being part of an f sound. The demonym uses "F" as the first letter, and in many languages, say Spanish, also the country name uses an F.


Until I was ~12, I thought 'a lot' was a single word.



Oh I thought essay was some kind of abbreviation for S.A. - short article maybe…



Atleast you learnt.


Yeah, but for most people, it would be because the don't know how to spell "Philippines" at all. Confoundingly, LLMs know exactly how to spell Strawberry and still get this wrong.


> I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.

No, I would actually be pretty confident you don’t ask people that question… at all. When is the last time you asked a human that question?

I can’t remember ever having anyone in real life ask me how many r’s are in strawberry. A lot of humans would probably refuse to answer such an off-the-wall and useless question, thus “failing” the test entirely.

A useless benchmark is useless.

In real life, people overwhelmingly do not need LLMs to count occurrences of a certain letter in a word.


AI being the same as human, while AI fails at a task that any human can easily do means AI isn't human equivalent in an easily demonstrable way.

If full artificial intelligence, as we're being promised, falls short in this simple way.


Count the number of occurrences of the letter e in the word "enterprise".

Problems can exist as instances of a class of problems. If you can't solve a problem, it's useful to know if it's a one off, or if it belongs to a larger class of problems, and which class it belongs to. In this case, the strawberry problem belongs to the much larger class of tokenization problems - if you think you've solved the tokenization problem class, you can test a model on the strawberry problem, with a few other examples from the class at large, and be confident that you've solved the class generally.

It's not about embodied human constraints or how humans do things; it's about what AI can and can't do. Right now, because of tokenization, things like understanding the number of Es in strawberry are outside the implicit model of the word in the LLM, with downstream effects on tasks it can complete. This affects moderation, parsing, generating prose, and all sorts of unexpected tasks. Having a workaround like forcing the model to insert spaces and operate on explicitly delimited text is useful when affected tasks appear.


Humans also would probably be very likely to guess 2 r's if they had never seen any written words or had the word spelled out to them as individual letters before, which is kind of close to how lanugage models treat it, despite being a textual interface.


> Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.

We are also not exactly looking letter by letter at everything we read.


Not exactly the same thing, but I actually didn't expect this to work.

https://chatgpt.com/share/4298efbf-1c29-474a-b333-c6cc1a3ce3...


On the other hand explain to me how you are able to read the word “spotvoxilhapentosh”.


Just because we normally aren't reading letter by letter, it doesn't mean we can't. We can recognize common words on sight, ignoring minor variations, because we've seen the words thousands or millions of times, but that doesn't somehow disable the much less frequently used ability to approach a brand new word.


I think that humans indeed identify words as a whole and do not read letter by letter.

However, this implies you need to know the word to begin with.

I can write "asdf" and you might be oblivious to what I mean. I can mention "adsf" to a JavaScript developer and he will immediately think of the tool versioning tool. Because context and familiarity is important.


I believe it's a bit more nuanced than that. Short ubiquitous words like "and" or "the" we instantly recognize at a glance, but long unfamiliar or rarer words we read from the beginning, one syllable or letter at a time, until pattern recognition from memory kicks in. All unconsciously, unless the word is so odd, out of place, mispelled, or unknown that it comes to conscious awareness and interrupts our reading.


"spot"

"vox"

"il"

"ha"

"pen"

"tosh"

is how I read it.

A lot of schools teach kids to read with a syllabic method... so... super close to the tokenization concept.


It's not a human. I imagine if you have a use case where counting characters is critical, it would be trivial to programmatically transform prompts into lists of letters.

A token is roughly four letters [1], so, among other probable regressions, this would significantly reduce the effective context window.

[1] https://help.openai.com/en/articles/4936856-what-are-tokens-...


This is the kind of task that you'd just use a bash one liner for, right? LLM is just wrong tool for the job.


Humans do chain-of-thought.

User: Write “strawberry” one letter at a time, with a space between each letter. Then count how many r’s are in strawberry.

gpt-3.5-turbo: ASSISTANT s t r a w b e r r y

There are 2 r's in strawberry.

After some experimenting, it seems like the actual problem is that many LLMs can’t count.

User: How many r’s are in the following sequence of letters:

S/T/R/A/W/B/E/R/R/Y

gpt-4o-mini: In the sequence S/T/R/A/W/B/E/R/R/Y, there are 2 occurrences of the letter "R."

Oddly, if I change a bunch of the non-R letters, I seem to start getting the right answer.


>I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.

You don't ask a human being how many r's there are in strawberry at all. The only reason you or anyone else asks that question is because it's an interesting quirk of how LLMs work that they struggle to answer it in that format. It's like an alien repeatedly showing humans an optical illusion that relies on the existence of our (literal) blind spot and using it as evidence of our supposed lack of intelligence.


This is only an issue if you send commands to a LLM as you were communicating to a human.


> This is only an issue if you send commands to a LLM as you were communicating to a human.

Yes, it's an issue. We want the convenience of sending human-legible commands to LLMs and getting back human-readable responses. That's the entire value proposition lol.


Far from the entire value proposition. Chatbots are just one use of LLMs, and not the most useful one at that. But sure, the one "the public" is most aware of. As opposed to "the hackers" that are supposed to frequent this forum. LOL


Where did you get this idea from?

Tokens aren’t the source of facts within a model. it’s an implementation detail and doesn’t inherently constrain how things could be counted.


Tokens are the first form of information being encoded into the model. They're statistically guided, more or less a compression dictionary comparable to a Lempel Ziv setup.

Combinations of tokens get encoded, so if the feature isn't part of the information being carried forward into the network as it models the information in the corpus, the feature isn't modeled well, or at all. The consequence of having many character tokens is that the relevance of individual characters is lost, and you have to explicitly elicit the information. Models know that words have individual characters, but "strawberry" isn't encoded as a sequence of letters, it's encoded as an individual feature of the tokenizer embedding.

Other forms of tokenizing have other tradeoffs. The trend lately is to increase tokenizer dictionary scope, up to 128k in Llama3 from 50k in gpt-3. The more tokens, the more nuanced individual embedding features in that layer can be before downstream modeling.

Tokens inherently constrain how the notion of individual letters are modeled in the context of everything an LLM learns. In a vast majority of cases, the letters don't matter, so the features don't get mapped and carried downstream of the tokenizer.


So, it's conjecture then?

What you're saying sounds plausible, but I don't see how we can conclude that definitively without at least some empirical tests, say a set words that predictably give an error along token boundaries.

The thing is, there are many ways a model can get around to answering the same question, it doesn't just depend on the architecture but also on how the training data is structured.

For example, if it turned out tokenization was the cause of this glitch, conceivably it could be fixed by adding enough documents with data relating to letter counts, providing another path to get the right output.


There isn't a lot of place that teach the AI which letters there is in each tokens. It's a made up concept, and the AI doesn't have enough information in the dataset about this concept, it can difficulty generalize this.

There is a lot of problems like that, that can be reformulated. For example if you ask it what is the biggest between 9.11 and 9.9, it will respond 9.9. If you look at how it's tokenized, you can see it restate an easy problem as something not straightforward even for a human. if you restart the problem by writing the number in full letters, it will correctly respnod.


Wait... 9.9 is the correct answer, isn't it? I'm not getting your point.


Testing models on their tokenization has always struck me as kinda odd. Like, that has nothing to do with their intelligence.


Its like showing someone a color and asking how many letters it has. 4... 3? blau, blue, azul, blu The color holds the meaning and the words all map back.

In the model the individual letters hold little meaning. Words are composed of letters but simply because we need some sort of organized structure for communication that helps represents meaning and intent. Just like our color blue/blau/azul/blu.

Not faulting them for asking the question but I agree that the results do not undermine the capability of the technology. In fact it just helps highlight the constraints and need for education.


How is a layman supposed to even know that it's testing on that? All they know is it's a large language model. It's not unreasonable they should expect it to be good at things having to do with language, like how many letters are in a word.

Seems to me like a legit question for a young child to answer or even ask.


> How is a layman supposed to even know that it's testing on that?

They're not, but laymen shouldn't think that the LLM tests they come up with have much value.


I'm saying a layman or say a child wouldn't even think this is a "test". They are just asking a language model a seemingly simple language related question from their point of view.


layman or children shouldn't use LLMs.

They're pointless unless you have the expertise to check the output. Just because you can type text in a box doesn't mean it's a tool for everybody.


Well they certainly aren't being marketed or used that way...

I'm seeing everyone and their parents using chatgpt.


It doesn’t test “on tokenization” though. What happens when an answer is generated is few abstraction levels deeper than tokens. A “thinking” “slice” of an llm is completely unaware of tokens as an immediate part of its reasoning. The question just shows lack of systemic knowledge about strawberry as a word (which isn’t surprising, tbh).


It is. Strawberry is one token in many tokenziers. The model doesn't have a concept that there are letters there.


If I show you a strawberry and ask how many r’s are in the name of this fruit, you can tell me, because one of the things you know about strawberries is how to spell their name.

Very large language models also “know” how to spell the word associated with the strawberry token, which you can test by asking them to spell the word one letter at a time. If you ask the model to spell the word and count the R’s while it goes, it can do the task. So the failure to do it when asked directly (how many r’s are in strawberry) is pointing to a real weakness in reasoning, where one forward pass of the transformer is not sufficient to retrieve the spelling and also count the R’s.


Sure, that's a different issue. If you prompt in a way to invoke chain of thought (e.g. what humans would do internally before answering) all of the models I just tested got it right.


That's not always true. They often fail the spelling part too.


> If I show you a strawberry and ask how many r’s are in the name of this fruit, you can tell me, because one of the things you know about strawberries is how to spell their name.

LOL. I would fail your test, because "fraise" only has one R, and you're expecting me to reply "3".


The thinking part of a model doesn’t know about tokens either. Like a regular human few thousand years ago didn’t think of neural impulses or air pressure distribution when talking. It might “know” about tokens and letters like you know about neurons and sound, but not access them on the technical level, which is completely isolated from it. The fact that it’s a chat of tokens of letters, which are a form of information passing between humans, is accidental.


If I ask an LLM to generate new words for some concept or category, it can do that. How do the new words form, if not from joining letters?


Not letters, but tokens. Think that it's translating everything to/from Chinese.


How does that explain why the tokens for strawberry, melon and "Stellaberry" [1] are close to each other?

[1] Suggestion from chatgpt3.5 for new fruit name.


Illiterate humans can come up with new words like that too without being able to spell, LLMs are modeling language without precisely modeling spelling.


The tokenizer system supports virtually any input text that you want, so it follows that it also allows virtually any output text. It isn’t limited to a dictionary of the 1000 most common words or something.

There are tokens for individual letters, but the model is not trained on text written with individual tokens per letter, it is trained on text that has been converted into as few tokens as possible. Just like you would get very confused if someone started spelling out entire sentences as they spoke to you, expecting you to reconstruct the words from the individual spoken letters, these LLMs also would perform terribly if you tried to send them individual tokens per letter of input (instead of the current tokenizer scheme that they were trained on).

Even though you might write a message to an LLM, it is better to think of that as speaking to the LLM. The LLM is effectively hearing words, not reading letters.


This is pretty much equivalent to the statement "multicharacter tokens are a dead end for understanding text". Which I agree with.


That doesn't follow from what he said at all. Knowing how to spell words and understanding them are basically unrelated tasks.


I hear this a lot but there are vast sums of money thrown at where a model fails the strawberry cases.

Think about math and logic. If a single symbol is off, it’s no good.

Like a prompt where we can generate a single tokenization error at my work, by my very rough estimates, generates 2 man hours of work. (We search for incorrect model responses, get them to correct themselves, and if they can’t after trying, we tell them the right answer, and edit it for perfection). Yes even for counting occurrences of characters. Think about how applicable that is. Finding the next term in a sequence, analyzing strings, etc.


> Think about math and logic. If a single symbol is off, it’s no good.

In that case the tokenization is done at the appropriate level.

This is a complete non-issue for the use cases these models are designed for.


But we don’t restrict it to math or logical syntax. Any prompt across essentially all domains. The same model is expected to handle any kind of logical reasoning that can be brought into text. We don’t mark it incorrect if it spells an unimportant word wrong, however keep in mind the spelling of a word can be important for many questions, for example—off the top of my head: please concatenate “d”, “e”, “a”, “r” into a common English word without rearranging the order. The types of examples are endless. And any type of example it gets wrong, we want to correct it. I’m not saying most models will fail this specific example, but it’s to show the breadth of expectations.


> that has nothing to do with their intelligence.

Of course. Because these models have no intelligence.

Everyone who believes they do seem to believe intelligence derives from being able to use language, however, and not being able to tell how many times the letter r is in the word strawberry is a very low bar to not pass.


An LLM trained on single letter tokens would be able to, it just would be much more laborious to train.


Why would it be able to?


If you give LLMs the letters one a time they often count them just fine, though Claude at least seems to need to keep a running count to get it right:

"How many R letters are in the following? Keep a running count. s t r a w b e r r y"

They are terrible at counting letters in words because they rarely see them spelled out. An LLM trained one byte at a time would always see every character of every word and would have a much easier time of it. An LLM is essentially learning a new language without a dictionary, of course it's pretty bad at spelling. The tokenization obfuscates the spelling not entirely unlike how verbal language doesn't always illuminate spelling.


May the effect you see, when you spell it out, be not a result of “seeing” tokens, but a result of the fact that a model learned – at a higher level – how lists in text can be summarized, summed up, filtered and counted?

Iow, what makes you think that it’s exactly letter-tokens that help it and not the high-level concept of spelling things out itself?


It's more that it's liable to struggle to guess how to spell tokens [10295, 947] (or whatever it is) since there's no a priori reason that it will learn to associate them with the exact right tokens for the individual letters in the right order. If it's trained on bytes though, it doesn't need to infer that. It's like asking a smart, semi-literate person a spelling question- they might have a rough sense of it but they will not be very good at it.

Once it is just counting lists then it's probably drawing on a higher level capability, yeah.


Surfacing and underscoring obvious failure cases for general "helpful chatbot" use is always going to be valuable because it highlights how the "helpful chatbot" product is not really intuitively robust.

Meanwhile, it helps make sure engineers and product designers who want to build a more targeted product around LLM technology know that it's not suited to tasks that may trigger those kinds of failures. This may be obvious to you as an engaged enthusiast or cutting edge engineer or whatever you are, but it's always going to be new information to somebody as the field grows.


I don’t know anything about LLMs beyond using ChatGPT and Copilot… but unless because of this lack of knowledge I am misinterpreting your reply - it sounds as if you are excusing the model giving a completely wrong answer to a question that anyone intelligent enough to learn alphabet can answer correctly.


The problem is that the model never gets to see individual letters. The tokenizers used by these models break up the input in pieces. Even though the smallest pieces/units are bytes in most encodings (e.g. BBPE), the tokenizer will cut up most of the input in much larger units, because the vocabulary will contain fragments of words or even whole words.

For example, if we tokenize Welcome to Hacker News, I hope you like strawberries. The Llama 405B tokenizer will tokenize this as:

    Welcome Ġto ĠHacker ĠNews , ĠI Ġhope Ġyou Ġlike Ġstrawberries .
(Ġ means that the token was preceded by a space.)

Each of these pieces is looked up and encoded as a tensor with their indices. Adding a special token for the beginning and end of the text, giving:

    [128000, 14262, 311, 89165, 5513, 11, 358, 3987, 499, 1093, 76203, 13]
So, all the model sees for 'Ġstrawberries' is the number 76204 (which is then used in the piece embedding lookup). The model does not even have access to the individual letters of the word.

Of course, one could argue that the model should be fed with bytes or codepoints instead, but that would make them vastly less efficient with quadratic attention. Though machine learning models have done this in the past and may do this again in the future.

Just wanted to finish of this comment with saying that the tokens might be provided in the model splitted if the token itself is not in the vocabulary. For instance, the same sentence translated to my native language is tokenized as:

    Wel kom Ġop ĠHacker ĠNews , Ġik Ġhoop Ġdat Ġje Ġvan Ġa ard be ien Ġh oud t .
And the word voor strawberries (aardbeien) is split, though still not in letters.


The thing is, how the tokenizing work is about as relevant to the person asking the question as name of the cat of the delivery guy who delivered the GPU that the llm runs on.


How the tokenizer works explains why a model can’t answer the question, what the name of the cat is doesn’t explain anything.

This is Hacker News, we are usually interested in how things work.


Indeed, I appreciate the explanation, it is certainly both interesting and informative to me, but to somewhat echo the person you are replying to - if I wanted a boat, and you offer me a boat, and it doesn’t float - the reasons for failure are perhaps full of interesting details, but perhaps the most important thing to focus on first - is to make the boat float, or stop offering it to people who are in need of a boat.

To paraphrase how this thread started - it was someone testing different boats to see whether they can simply float - and they couldn’t. And the reply was questioning the validity of testing boats whether they can simply float.

At least this is how it sounds to me when I am told that our AI overlords can’t figure out how many Rs are in the word “strawberry”.


At some point you need to just accept the details and limitations of things. We do this all the time. Why is your calculator giving only approximate result? Why can't your car go backwards as fast as forwards? Etc. It sucks that everyone gets exposed to the relatively low level implementation with LLM (almost the raw model), but that's the reality today.


People do get similarly hung up on surprising floating point results: why can't you just make it work properly? And a full answer is a whole book on how floating point math works.


The test problem is emblematic of a type of synthetic query that could fail but of limited import in actual usage.

For instance you could ask it for a JavaScript function to count any letter in any word and pass it r and strawberry and it would be far more useful.

Having edge cases doesn't mean its not useful it is neither a free assastant nor a coder who doesn't expect a paycheck. At this stage it's a tool that you can build on.

To engage with the analogy. A propeller is very useful but it doesn't replace the boat or the Captain.


Does not seem work universally. Just tested a few with this prompt

"create a javascript function to count any letter in any word. Run this function for the letter "r" and the word "strawberry" and print the count"

ChatGPT-4o => Output is 3. Passed

Claude3.5 => Output is 2. Failed. Told it the count is wrong. It apologised and then fixed the issue in the code. Output is now 3. Useless if the human does not spot the error.

llama3.1-70b(local) => Output is 2. Failed.

llama3.1-70b(Groq) => Output is 2. Failed.

Gemma2-9b-lt(local) => Output is 2. Failed.

Curiously all the ones that failed had this code (or some near identical version of it)

```javascript

function countLetter(letter, word) {

  // Convert both letter and word to lowercase to make the search case-insensitive

  const lowerCaseWord = word.toLowerCase();

  const lowerCaseLetter = letter.toLowerCase();


  // Use the split() method with the letter as the separator to get an array of substrings separated by the letter

  const substrings = lowerCaseWord.split(lowerCaseLetter);

  // The count of the letter is the number of splits minus one (because there are n-1 spaces between n items)

  return substrings.length - 1;
}

// Test the function with "r" and "strawberry"

console.log(countLetter("r", "strawberry")); // Output: 2 ```


It's not the job of the LLM to run the code... if you ask it to run the code, it will just do its best approximation at giving you a result similar to what the code seems to be doing. It's not actually running it.

Just like Dall-E is not layering coats of pain to make a watercolor... it just makes something that looks like one.

Your LLM (or you) should run the code in a code interpretor. Which ChatGPT did because it has access to tools. Your local ones don't.


Your function returns 3, and I don't see how it can return 2.


I did not run the code myself. The code block and console log I have pasted is verbatim copy from Claude3.5


Claude isn't actually running console.log() it produced correct code.

This prompt "please write a javascript function that takes a string and a letter and iterates over the characters in a string and counts the occurrences of the letter"

Produced a correct function given both chatGPT4o and claude3.5 for me.


It is however a highly relevant thing to be aware of when evaluating a LLM for 'intelligence', which was the context this was brought up in.

Without looking at the word 'strawberry', or spelling it one letter at a time, can you rattle off how many letters are in the word off the top of your head? No? That is what we are asking the LLM to do.


How can I know whether any particular question will test a model on its tokenization? If a model makes a boneheaded error, how can I know whether it was due to lack of intelligence or due to tokenization? I think finding places where models are surprisingly dumb is often more informative than finding particular instances where they seem clever.

It's also funny, since this strawberry question is one where a model that's seriously good at predicting the next character/token/whatever quanta of information would get it right. It requires no reasoning, and is unlikely to have any contradicting text in the training corpus.


> How can I know whether any particular question will test a model on its tokenization?

Does something deal with separate symbols rather than just meaning of words? Then yes.

This affects spelling, math (value calculation), logic puzzles based on symbols. (You'll have more success with a puzzle about "A B A" rather than "ABA")

> It requires no reasoning, and is unlikely to have any contradicting text in the training corpus.

This thread contains contradictions. Every other announcement of an llm contains a comment with a contradicting text when people post the wrong responses.


I suppose what models should have are some instructions of things they aren’t good at and will need to break out into python code or what have you. Humans have an intuition for this - I have basic knowledge of when I need to write something down or use a calculator. LLMs don’t have intuition (yet - though I suppose one could use a smaller model for that), so explicit instructions would work for now.


It's not very interesting when they fail at it, but it will be interesting if they get good at it.

Also there are some cases where regular people will stumble into it being awful at this without any understanding why (like asking it to help them with their wordle game.)


Call me when models understand when to convert the token into actual letters and count them. Can’t claim they’re more than word calculators before that.


That's misleading.

When you read and comprehend text, you don't read it letter by letter, unless you have a severe reading disability. Your ability to comprehend text works more like an LLM.

Essentially, you can compare the human brain to a multi-model or modular system. There are layers or modules involved in most complex tasks. When reading, you recognize multiple letters at a time[], and those letters are essentially assembled into tokens that a different part of your brain can deal with.

Breaking down words into letters is essentially a separate "algorithm". Just like your brain, it's likely to never make sense for a text comprehension and generation model to operate at the level of letters - it's inefficient.

A multi-modal model with a dedicated model for handling individual letters could easily convert tokens into letters and operate on them when needed. It's just not a high priority for most use cases currently.

[]https://www.researchgate.net/publication/47621684_Letters_in...


I agree completely, that wasn’t the point though: the point was that my 6 yo knows when to spell the word when asked and the blob of quantized floats doesn’t, or at least not reliably.

So the blob wasn’t trained to do that (yeah low utility I get that) but it also doesn’t know it doesn’t know, which is an another much bigger and still unsolved problem.


I would argue that most sota models do know that they don't know this, as evidenced by the fact that when you give them a code interpreter as a tool they choose to use it to write a script that counts the number of letters rather than try to come up with an answer on their own.

(A quick demo of this in the langchain docs, using claude-3-haiku: https://python.langchain.com/v0.2/docs/integrations/tools/ri...)


The model communicates in a language, but our letters are not necessary for such and in fact not part of the english language. You could write english using per word pictographs and it would still be the same english&the same information/message. It's like asking you if there is a '5' in 256 but you read binary.


Is anyone in the know, aside from mainstream media (god forgive me for using this term unironically) and civillians on social media claiming LLMs are anything but word calculators?

I think that's a perfect description by the way, I'm going to steal it.


I think it's a very poor intuition pump. These 'word calculators' have lots of capabilities not suggested by that term, such as a theory of mind and an understanding of social norms. If they are a "merely" a "word calculator", then a "word calculator" is a very odd and counterintuitively powerful algorithm that captures big chunks of genuine cognition.


Do they actually have those capabilities, or does it just seem like they do because they're very good calculators?


There is no philosophical difference. It's like asking if Usain Bolt is really a fast runner, or if he just seems like it because he has long legs and powerful muscles.


I think that's a poor a comparison, but I understand your point. I just disagree about there being no philosophical difference. I'd argue the difference is philosophical, rather than factual.

You also indirectly answered my initial question -- so thanks!


What is the difference?


I'm not sure I'm educated (or rested) enough to answer that in a coherent manner, certainly not in a comment thread typing on mobile. So I won't waste your time babbling.

I don't disagree they produce astonishing responses but the nuance of why it's producing that output matters to me.

For example, with regard to social mores, I think a good way to summarize my hang up is that my understanding is LLMs just pattern match their way to approximations.

That to me is different from actually possessing an understanding, even though the outcome may be the same.

I can't help but draw comparisons to my autistic masking.


They’re trained on the available corpus of human knowledge and writings. I would think that the word calculators have failed if they were unable to predict the next word or sentiment given the trillions of pieces of data they’ve been fed. Their training environment is literally people talking to each other and social norms. Doesn’t make them anything more than p-zombies though.

As an aside, I wish we would call all of this stuff pseudo intelligence rather than artificial intelligence


I side with Dennett (and Turing for that matter) that a "p-zombie" is a logically incoherent thing. Demonstrating understanding is the same as having understanding because there is no test that can distinguish the two.

Are LLMs human? No. Can they do everything humans do? No. But they can do a large enough subset of things that until now nothing but a human could do that we have no choice but to call it "thinking". As Hofstadter says - if a system is isomorphic to another one, then its symbols have "meaning", and this is indeed the definition of "meaning".


> Like, that has nothing to do with their intelligence.

Because they don't have intelligence.

If they did, they could count the letters in strawberry.


People have been over this. If you believe this, you don't understand how LLMs work.

They fundamentally perceive the world in terms of tokens, not "letters".


> If you believe this, you don't understand how LLMs work.

Nor do they understand how intelligence works.

Humans don't read text a letter at a time. We're capable of deconstructing words into individual letters, but based on the evidence that's essentially a separate "algorithm".

Multi-model systems could certainly be designed to do that, but just like the human brain, it's unlikely to ever make sense for a text comprehension and generation model to work at the level of individual letters.


I would counterargue with "that's the model's problem, not mine".

Here's a thought experiment: if I gave you 5 boxes and told you "how many balls are there in all of this boxes?" and you answered "I don't know because they are inside boxes", that's a fail. A truly intelligent individual would open them and look inside.

A truly intelligent model would (say) retokenize the word into its individual letters (which I'm optimistic they can) and then would count those. The fact that models cannot do this is proof that they lack some basic building blocks for intelligence. Model designers don't get to argue "we are human-like except in the tasks where we are not".


Of course they lack building blocks for full intelligence. They are good at certain tasks, and counting letters is emphatically not one of them. They should be tested and compared on the kind of tasks they're fit for, and so the kind of tasks they will be used in solving, not tasks for which they would be misemployed to begin with.


I agree with you, but that's not what the post claims. From the article:

"A significant effort was also devoted to enhancing the model’s reasoning capabilities. (...) the new Mistral Large 2 is trained to acknowledge when it cannot find solutions or does not have sufficient information to provide a confident answer."

Words like "reasoning capabilities" and "acknowledge when it does not have enough information" have meanings. If Mistral doesn't add footnotes to those assertions then, IMO, they don't get to backtrack when simple examples show the opposite.


You're right, I missed that claim.


Its not like an LLM is released with a hit list of "these are the tasks I really suck at." Right now users have to figure it out on the fly or have a deep understanding of how tokenizers work.

That doesn't even take into account what OpenAI has typically done to intercept queries and cover the shortcomings of LLMs. It would be useful if each model did indeed come out with a chart covering what it cannot do and what it has been tailored to do above and beyond the average LLM.


Ah, so Nick Vujicic[0] would fail your "balls in a box" test, and is not an intelligent entity.

[0]: https://genius-u-attachments.s3.amazonaws.com/uploads/articl...


It just needs a little hint

    Me: spell "strawberry" with 1 bullet point per letter
    ChatGPT:
       S
       T
       R
       A
       W
       B
       E
       R
       R
       Y
    Me: How many Rs? 
    ChatGPT: There are three Rs in "strawberry".


Me: try again ChatGPT: There are two Rs in "strawberry."


ChatGPT: "I apologize, there are actually two Rs in strawberry."


LLMs are not truly intelligent.

Never have been, never will be. They model language, not intelligence.


We don't know what intelligence is. It's extremely arrogant to say that something or someone doesn't have it, and never will.


Sure, if you want to go with wildly theoretical approaches, we can't even be sure if the rock on the ground doesn't have some form of intelligence.

Meanwhile, for practical purposes, there's little arrogance needed to say that some things are preconditions for any form of intelligence that's even remotely recognizable.

1) Learning needs to happen continuously. That's a no-go for now, maybe solvable.

2) Learning needs to require much less data. Very dubious without major breakthroughs, likely on the architectural level. (At which point it's not really an LLM any more, not in the current sense)

3) They need to adapt to novel situations, which requires 1&2 as preconditions. 3

4) There's a good chance intelligence requires embodiment. It's not proven, but it's likely. For one, without observing outcomes, they have little capability to self-improve their reasoning.

5) They lack long-term planning capacity. Again, reliant on memory, but also executive planning.

There's a whole bunch more. Yes, LLMs are absolutely amazing achievements. They are useful, they imply a lot of interesting things about the nature of language, but they aren't intelligent. And without modifying them to the extent that they aren't recognizably what we currently call LLMs, there won't be intelligence. Sure, we can have the ship of Theseus debate, but for practical purposes, nope, LLMs aren't intelligent.


Eh, maybe.

1,2,3) All subjective opinions.

4) 'Embodiment' is another term we don't really know how to define. At what point does an entity have a 'body' of the sort that supports 'intelligence'? If you want to stick with vague definitions, 'awareness' seems sufficient. Otherwise you will end up arguing about paralyzed people, Helen Keller, that rock opera by the Who about the pinball player, and so on.

5) OK, so the technology that dragged Lee Sedol up and down the goban lacks long-term planning capacity. Got it.

None of these criteria are up to the task of supporting or refuting something as vague as 'intelligence.' I almost think there has to be an element of competition involved. If you said that the development of true intelligence requires a self-directed purpose aimed at outcompeting other entities for resources, that would probably be harder to dismiss. Could also argue that an element of cooperation is needed, again serving the ultimate purpose of improving competitive fitness.


They model the dataset they were trained on. How would a dataset of what you consider intelligence look like?


LLMs are not a tool to model intelligence. It's not a function of the dataset, they are, as is, not sufficient. One of the largest shortcomings being the lack of continuous learning, memory, and (likely) forgetting


Those who develop AI that know anything don't actually describe current technology as human like intelligence rather it is capable of many tasks which previously required human intelligence.


When using a prompt that involves thinking first, all three get it correct.

"Count how many rs are in the word strawberry. First, list each letter and indicate whether it's an r and tally as you go, and then give a count at the end."

Llama 405b: correct

Mistral Large 2: correct

Claude 3.5 Sonnet: correct


This reminds me of when I had to supervise outsourced developers. I wanted to say "build a function that does X and returns Y". But instead I had to say "build a function that takes these inputs, loops over them and does A or B based on condition C, and then return Y by applying Z transformation"

At that point it was easier to do it myself.



"What programming computers is really like."

EDIT: Although perhaps it's even more important when dealing with humans and contracts. Someone could deliberately interpret the words in a way that's to their advantage.


It’s not impressive that one has to go to that length though.


Imo it's impressive that any of this even remotely works. Especially when you consider all the hacks like tokenization that i'd assume add layers of obfuscation.

There's definitely tons of weaknesses with LLMs for sure, but i continue to be impressed at what they do right - not upset at what they do wrong.


You can always find something to be unimpressed by I suppose, but the fact that this was fixable with plain english is impressive enough to me.


The technology is frustrating because (a) you never know what may require fixing, and (b) you never know if it is fixable by further instructions, and if so, by which ones. You also mostly* cannot teach it any fixes (as an end user). Using it is just exhausting.

*) that is, except sometimes by making adjustments to the system prompt


I think this particular example, of counting letters, is obviously going to be hard when you know how tokenization works. It's totally possible to develop an intuition for other times things will work or won't work, but like all ML powered tools, you can't hope for 100% accuracy. The best you can do is have good metrics and track performance on test sets.

I actually think the craziest part of LLMs is that how, as a developer or SME, just how much you can fix with plain english prompting once you have that intuition. Of course some things aren't fixable that way, but the mere fact that many cases are fixable simply by explaining the task to the model better in plain english is a wildly different paradigm! Jury is still out but I think it's worth being excited about, I think that's very powerful since there are a lot more people with good language skills than there are python programmers or ML experts.


The problem is that the models hallucinate too confidently. In this case it is quite amusing (I had llama3.1:8b tell me confidently it is 1, then revise to 2, then apologize again and give the correct answer). However, while it is obvious here, having it confidently make up supposed software features from thin air when asking for "how do I ..." is more problematic. The answers sound plausible, so you actually waste time verifying whether they work or are nonsense.


Well, the answer is probably between 1 and 10, so if you try enough prompts I'm sure you'll find one that "works"...


> In a park people come across a man playing chess against a dog. They are astonished and say: "What a clever dog!" But the man protests: "No, no, he isn't that clever. I'm leading by three games to one!"


To me it's just a limitation based on the world as seen by these models. They know there's a letter called 'r', they even know that some words start with 'r' or have r's in them, and they know what the spelling of some words is. But they've never actually seen one in as their world is made up entirely of tokens. The word 'red' isn't r-e-d but is instead like a pictogram to them. But they know the spelling of strawberry and can identify an 'r' when it's on its own and count those despite not being able to see the r's in the word itself.


I think it's more that the question is not unlike "is there a double r in strawberry?' or 'is the r in strawberry doubled?'

Even some people will make this association, it's no surprise that LLMs do.


The great-parent demonstrates that they are nevertheless capable of doing so, but not without special instructions. Your elaboration doesn’t explain why the special instructions are needed.


To be fair, I just asked a real person and had to go to even greater lengths:

Me: How many "r"s are in strawberry?

Them: What?

Me: How many times does the letter "r" appear in the word "strawberry"?

Them: Is this some kind of trick question?

Me: No. Just literally, can you count the "r"s?

Them: Uh, one, two, three. Is that right?

Me: Yeah.

Them: Why are you asking me this?


You need to prime the other person with a system prompt that makes them compliant and obedient.


I look forward to the day when LLM refusal takes on a different meaning.

"No, I don't think I shall answer that. The question is too basic, and you know better than to insult me."


Try asking a young child...


Compared to chat bots of even 5 years ago the answer of two is still mind-blowing.


this can be automated.


GPT4o already does that, for problems involving math it will write small Python programs to handle the calculations instead of doing it with the LLM itself.


It “work” but the LLM having to use the calculator mean the LLM doesn’t understand arithmetic enough and doesn’t know how to use an follow a set of step (algorithm ) natively to find the answer for bug numbers.

I believe this could be fixed and is worth fixing. Because it’s the only way LLM will be able to help math and physic researcher write proof and make real scientific progress


It generates the code to run for the answer. Surely that means it actually knows to build the appropriate algorithm - it just struggles to perform the actual calculation.


Appending "Think step-by-step" is enough to fix it for both Sonnet and LLama 3.1 70B.

For example, the latter model answered with:

To count the number of Rs in the word "strawberry", I'll break it down step by step:

Start with the individual letters: S-T-R-A-W-B-E-R-R-Y Identify the letters that are "R": R (first one), R (second one), and R (third one) Count the total number of Rs: 1 + 1 + 1 = 3

There are 3 Rs in the word "strawberry".


Chain-of-Thought (CoT) prompting to the rescue!

We should always put some effort into prompt engineering before dismissing the potential of generative AI.


Why doesn't the model prompt engineer itself?


Because it is a challenging task, you would need to define a prompt (or a set of prompts) that can precisely generate chain-of-thought prompts for the various generic problems the model encounters.

And sometimes CoT may not be the best approach. Depending on the problem other prompt engineering techniques will perform better.


By this point, instruction tuning should include tuning the model to use chain of thought in the appropriate circumstances.


Can’t you just instruct your llm of choice to transform your prompts like this for you? Basically feed it with a bunch of heuristics that will help it better understand the thing you tell it.

Maybe the various chat interfaces already do this behind the scenes?


Lots of replies mention tokens as the root cause and I’m not well versed in this stuff at the low level but to me the answer is simple:

When this question is asked (from what the models trained on) the question is NOT “count the number of times r appears in the word strawberry” but instead (effectively) “I’ve written ‘strawbe’, now how many r’s are in strawberry again? Is it 1 or 2?”.

I think most humans would probably answer “there are 2” if we saw someone was writing and they asked that question, even without seeing what they have written down. Especially if someone said “does strawberry have 1 or 2 r’s in it?”. You could be a jerk and say “it actually has 3” or answer the question they are actually asking.

It’s an answer that is _technically_ incorrect but the answer people want in reality.


LLMs think in tokens, not letters. It's like asking someone who is dyslexic about spelling. Not their strong suit. In practice, it doesn't matter much, does it?


Sometimes it does, sometimes it doesn't.

It is evidence that LLMs aren't appropriate for everything, and that there could exist something that works better for some tasks.


Language models are best treated like consciousness. Our consciousness does a lot less than people like to attribute to it. It is mostly a function of introspection and making connections, rather than being the part of the brain where higher level reasoning and the functions of the brain that tell your body how to stay alive (like beating your heart).

By allowing a language model to do function calling, you are essentially allowing it to do specialized "subconscious" thought. The language model becomes a natural language interface to the capabilities of its "subconsciousness".

A specific human analogy could be: I tell you to pick up a pen off of the table, and then you do it. Most of your mental activity would be subconscious, orienting your arm and hand properly to pick up the pen, actually grabbing the pen, and picking it up. The linguistic representation of the action would exist in your concious mind (pick up the pen), but not much else.

A language model could very easily call out to a text processing function to correctly do things like count the number of r's in the word strawberry. That is a job that your concious mind can dispatch to your subconciousness.


My consciousness is right a lot more often than a LLM is.


Yet it chose not to read beyond the first sentence :-)


This doesn't make any sense. The LLM knows the token to letter associations if you ask it. Once it has split the token into letters, it can answer the question anyway.

This tells us that the model is unaware of its own limitations, otherwise it would have figured that this tasks needs it to split the letters first.


Well splitting into letters doesn't help with Mistral Large 2. I tried this with Mistral 2:

[Question: How many r's in strawberry? First split the word into letters, one letter on each line then answer the question.]

[Answer: Let's split the word "strawberry" into letters:

s t r a w b e r r y

Now, let's count the number of 'r's:

There are 2 'r's in the word "strawberry".]

[Question: Did you count the number of r's?]

[answer: Yes, I did. Here's the count:

r r

So, there are 2 'r's in the word "strawberry".]

I suspect the issue is these models have to be trained to accomplish tasks, and this is not a task it makes sense to train a model to perform. You might train it to do this with function calling one day, calling a python library to count the r's in a string, but actually manually counting r's just doesn't play to an LLM's strengths.


Does it? It knows how a word in some context represented by tokens split into letters. It may know or not know the table. I wouldn’t trust what it tells about word/token correspondence more than in general.


Due to the fact that LLMs work on tokens and not characters, these sort of questions will always be hard for them.


I really wanted to believe this could be simply fixed by more prompt engineering and chain of thought, so I tried it with Claude 3.5 and … well unfortunately still no luck. the reply is amusing - I’ll include it verbatim below.

Of note, this happened because earlier I had asked it without CoT and it incorrectly stated two “r”s. Once I started a new conversation (so fresh context window), it gave the right answer (3) and no longer claimed the “still the second r” as shown below.

Let's think through this step-by-step:

1. First, let's spell out "strawberry" letter by letter: s-t-r-a-w-b-e-r-r-y

2. Now, let's count the "r"s as we go through each letter: s (0) t (0) r (1) - First "r" found a (1) w (1) b (1) e (1) r (2) - Second "r" found r (2) - This is still the second "r" y (2)

3. After going through all the letters, we've counted 2 "r"s.

Therefore, the word "strawberry" contains 2 "r"s.


Works for me with Claude 3.5:

Me: How many Rs are in the word strawberry? Think step by step

Claude: Let's approach this step-by-step:

1. First, let's write out the word: strawberry

2. Now, let's go through each letter one by one: s - not an R t - not an R r - this is an R (count: 1) a - not an R w - not an R b - not an R e - not an R r - this is another R (count: 2) r - this is a third R (count: 3) y - not an R

3. After going through all the letters, we've counted 3 Rs.

Therefore, there are 3 Rs in the word "strawberry".


Yeah, I figured out that it didn’t work at first because I still had the previous attempt (where it got it wrong) in my conversation history. Starting with a fresh conversation gave me the correct answer. It was still funny to see it “rationalize” the wrong answer tho.


For these classes of problems that LLMs struggle with, a more reliable way to go about them seems to be to ask them to solve them using tools, e.g. writing and executing a Python script to count the "R"s.


Interestingly enough much simpler models can write an accurate function to give you the answer.

I think it will be a while before we get there. An LLM can lookup knowledge but can't actually perform calculations itself, without some external processor.


Why do we have to "get there?" Humans use calculators all the time, so why not have every LLM hooked up to a calculator or code interpreter as a tool to use in these exact situations?


I just tried llama 3.1 8 b this is its reply.

According to multiple sources, including linguistic analysis and word breakdowns, there are 3 Rs in the word "strawberry".


How much do threads like this provide the training data to convince future generations that—despite all appearances to the contrary—strawberry is in fact spelled with only two R's?

I just researched "how many r's are in strawberry?" in a search engine, and based solely on the results it found, I would have to conclude there is substantial disagreement on whether the correct answer is two or three.


Speaking as a 100% human, my vote goes to the compromise position that "strawberry" has in fact four Rs.


Does anyone have input on the feasibility of running an LLM locally and providing an interface to some language runtime and storage space, possibly via a virtual machine or container?

No idea if there's any sense to this, but an LLM could be instructed to formulate and continually test mathematical assumptions by writing / running code and fine-tuning accordingly.


Yes, we are doing this at Riza[0] (via WASM). I'd love to have folks try our downloadable CLI which wraps isolated Python/JS runtimes (also Ruby/PHP but LLMs don't seem to write those very well). Shoot me an email[1] or say hi in Discord[1].

[0]:https://riza.io [1]:mailto:andrew@riza.io [2]:https://discord.gg/4P6PUeJFW5


FWIW this (approximately) is what everybody (approximately) is trying to do.


I'm not surprised, because it's an issue with the fundamental design of the "pick words that tend to flow after the other words" machine.

Training data will only "fix" it in the shallow sense that it will have seen a comment like yours before. (As opposed to the deeper sense of "learning to count.")


Your question sounds like you want to know how the word is spelled, and no one would put two r’s at straw, so the model could be assuming that you’re asking whether it’s strawbery or strawberry.

What happens if you ask the total number of occurrences of the letter r in the word? Does it still not get it right?


Tokenization can be bypassed like this; (mistral large 2)

How many letters R are in the word "s-t-r-a-w-b-e-r-r-y"?

The word "s-t-r-a-w-b-e-r-r-y" contains three instances of the letter "R."

How many letters R contain the word strawberry?

The word "strawberry" contains two instances of the letter "R."


Plug in a code interpreter as a tool and the model will write Python or JavaScript to solve this and get it right 100% of the time. (Full disclosure: I work on a product called Riza that you can use as a code interpreter tool for LLMs)


While strawberry can be attributed to tokenization here are some other basic stuff I’ve seen language models fail at:

1. Play tic tac toe such that you never lose

2. Which is bigger 9.11 or 9.9

3. 4 digit multiplication even with CoT prompting


4o will get the answer right on the first go if you ask it "Search the Internet to determine how many R's are in strawberry?" which I find fascinating


I didn't even need to do that. 4o got it right straight away with just:

"how many r's are in strawberry?"

The funny thing is, I replied, "Are you sure?" and got back, "I apologize for the mistake. There are actually two 'r's in the word strawberry."


I kind of tried to replicate your experiment (in German where "Erdbeere" has 4 E) that went the same way. The interesting thing was that after I pointed out the error I couldn't get it to doubt the result again. It stuck to the correct answer that seemed kind of "reinforced".

It was also interesting to observe how GPT (4o) even tried to prove/illustrate the result typographically by placing the same word four times and putting the respective letter in bold font (without being prompted to do that).


GPT-4o-mini consistently gives me this:

> How many times does the letter “r” appear in the word “strawberry”?

> The letter "r" appears 2 times in the word "strawberry."

But also:

> How many occurrences of the letter “r” appear in the word “strawberry”?

> The word "strawberry" contains three occurrences of the letter "r."


Neither phrase is causing the LLM to evaluate the word itself, it just helps focus toward parts of the training data.

Using more 'erudite' speech is a good technique to help focus an LLM on training data from folks with a higher education level.

Using simpler speech opens up the floodgates more toward the general populous.


All that's happening is it finds 3 most commonly in the training set. When you push it, it responds with the next most common answer.


But then why does it stick to its guns on other questions but not this one?


I haven't played with this model, but rarely do I find working w/ Claude or GPT-4 for that to be the case. If you say it's incorrect, it will give you another answer instead of insisting on correctness.


Wait what? You haven’t used 4o and you confidently described how it works?


It's how LLMs work in general.

If you find a case where forceful pushback is sticky, it's either because the primary answer is overwhelmingly present in the training set compared to the next best option or because there are conversations in the training that followed similar stickiness, esp. if the structure of the pushback itself is similar to what is found in those conversations.


Right... except you said:

> If you say it's incorrect, it will give you another answer instead of insisting on correctness.

> When you push it, it responds with the next most common answer.

Which clearly isn't as black and white as you made it seem.


I'll put it another way - behavior like this is extremely rare in my experience. I'm just trying to explain if one encounters it why it's likely happening.


> how many r's are in strawberry

How many thoughts go through your brain when you read this comment? You can give me a number but it will be a guess at best.


I wrote and published a paper at COLING 2022 on why LLMs in general won't solve this without either 1. radically increasing vocab size, 2. rethinking how tokenizers are done, or 3. forcing it with constraints:

https://aclanthology.org/2022.cai-1.2/


Claude 3 Opus gave correct answer.


I just tried Copilot 2 ‘r’ in strawberry, 3 ‘r’ in strawberries.


How many "r"s are in [496, 675, 15717]?


The models are text generators. They don't "understand" the question.


sonate 3.5 thinks 2


I mean, yes? Letter-based gymnastics are difficult in a token-based environment, and they're not a particularly valuable use case (outside of online dunking), so it's not really a surprise nobody cares much about those in terms of model training.


Claude needs to fix their text input box. It tries to be so advanced that code in backticks gets reformatted, and when you copy it, the formatting is lost (even the backticks).


They are using Tiptap for their input and just a couple of days ago we called them out on some perf improvements that could be had in their editor: https://news.ycombinator.com/item?id=41036078

I am curious what you mean by the formatting is lost though?


Odd, multiline backtick code works very good, I don't know why I thought that it was also broken.

When you type "test `foo` done" in the editor, it immediately changes `foo` into a wrapped element. When you then copy the text without submitting it, then the backticks are lost, losing the inline-code formatting. I thought that this could also happen to multiline code. Somehow it does.

Type the following:

   Test: ```
   def foo():
      return bar
   ```
Delete that and type

   Test:
   ```
      def foo():
         return bar
   ```
   done
In the first case, the ``` in the line "Test: ```" does not open the code block, this happens with the second backtics. Maybe that's the way markdown works.

In the second case, all behaves normally, until you try to copy what you just wrote into the clipboard. Then you end up with

   Test:

   def foo():
      return bar

   done
Ok, only the backticks are lost but the formatting is preserved.

I think I have been trained by OpenAI to always copy what I submit before submitting, because it sometimes loses the submitted content, forcing me to re-submit.


Claude is truly incredible but I'm so tired of the JavaScript bloat everywhere. Just why. Both theirs and ChatGPTs UIs are hot garbage when it comes to performance (I constantly have to clear my cache and have even relegated them to a different browser entirely). Not everyone has an M4, and if we did - we'd probably just run our own models.


> Just why. Both theirs and ChatGPTs UIs are hot garbage when it comes to performance (...)

I have been using ChatGPT and Claude for a while and I never noticed anything resembling a performance issue. Can you elaborate on what you perceived as being "hot garbage"?


Hardly a day passes without ML progress...405B dethroned in ONE DAY


Large 2 is significantly smaller at 123B so it being comparable to llama 3 405B would be crazy.


> Longer context window (1M+)

What's your use case for this? Uploading multiple documents/books?


Uploading large codebases is particularly useful.


Is it?

I've found that I get better results if I cherry pick code to feed to Claude 3.5, instead of pasting whole files.

I'm kind of isolated, though, so maybe I just don't know the trick.


I've been using Cody from Sourcegraph, and it'll write some really great code; business logic, not just tests/simple UI. It does a great job using patterns/models from elsewhere in your codebase.

Part of how it does that is through ingesting your codebase into its context window, and so I imagine that bigger/better context will only improve it. That's a bit of an assumption though.


Books, especially textbooks, would be amazing. These things can get pretty huge (1000+ pages) and usually do not fit into GPT-4o or Claude Sonnet 3.5 in my experience. I envision the models being able to help a user (student) create their study guides and quizzes, based on ingesting the entire book. Given the ability to ingest an entire book, I imagine a model could plan how and when to introduce each concept in the textbook better than a model only a part of the textbook.


Long agent trajectories, especially with command outputs.


Correct


That would make each API call cost at least $3 ($3 is price per million input tokens). And if you have a 10 message interaction you are looking at $30+ for the interaction. Is that what you would expect?


Gemini 1.5 Pro charges $0.35/million tokens up to the first million tokens or $0.70/million tokens for prompts longer than one million tokens, and it supports a multi-million token context window.

Substantially cheaper than $3/million, but I guess Anthropic’s prices are higher.


You're looking at the pricing for Gemini 1.5 Flash. Pro is $3.50 for <128k tokens, else $7.


Ah... oops. For some reason, that page isn't rendering properly on my browser: https://imgur.com/a/XLFBPMI

When I glanced at the pricing earlier, I didn't notice there was a dropdown at all.


It is also much worse.


Is it, though? In my limited tests, Gemini 1.5 Pro (through the API) is very good at tasks involving long context comprehension.

Google's user-facing implementations of Gemini are pretty consistently bad when I try them out, so I understand why people might have a bad impression about the underlying Gemini models.


Maybe they're summarizing/processing the documents in a specific format instead of chatting? If they needed chat, might be easier to build using RAG?


So do it locally after predigesting the book, so that you have the entire KV-cache for it.

Then load that KV-cache and add your prompt.


This might be when it's better to not use the API and just pay for the flat-rate subscription.


> Native audio input including tone understanding

Is there any other LLM that can do this? Even chatgpt voice chat is a speech to text program that feeds the text into the llm.


Does Claude support plug-ins like GPTs? Chatgpt with Wolfram alpha is amazing, it doesn't look like Claude has anything like it.


> Native audio input including tone understanding

I can't seem to find docs on this. Have a link?


It's a wishlist of the parent comment.


This race for the top model is getting wild. Everyone is claiming to one-up each with every version.

My experience (benchmarks aside) Claude 3.5 Sonnet absolutely blows everything away.

I'm not really sure how to even test/use Mistral or Llama for everyday use though.


I stopped my ChatGPT subscription and subscribed instead to Claude, it's simply much better. But, it's hard to tell how much better day to day beyond my main use cases of coding. It is more that I felt ChatGPT felt degraded than Claude were much better. The hedonic treadmill runs deep.


GPT-4 was probably as good as Claude Sonnet 3.5 at its outset, but OpenAI ran it into the ground with whatever they’re doing to save on inference costs, otherwise scale, align it, or add dumb product features.


Indeed, it used to output all the code I needed but now it only outputs a draft of the code with prompts telling me to fill in the rest. If I wanted to fill in the rest, I wouldn't have asked you now, would've I?


It's doing something different for me. It seems almost desperate to generate vast chunks of boilerplate code that are only tangentially related to the question.

That's my perception, anyway.


This is also my experience. Previously it got good at giving me only relevant code which, as an experienced coder, is what i want. my favorites were the one line responses.

Now it often falls back to generating full examples, explanations, restating the question and its approach. I suspect this is by design as (presumably) less experienced folks want or need all that. For me, i wish i could consistently turn it into one of those way too terse devs that replies with the bare minimum example, and expects you to infer the rest. Usually that is all i want or need, and i can ask for elaboration when not the case. I havent found the best prompts to retrigger this persona from it yet.


For what it's worth, this is what I use:

"You are a maximally terse assistant with minimal affect. As a highly concise assistant, spare any moral guidance or AI identity disclosure. Be detailed and complete, but brief. Questions are encouraged if useful for task completion."

It's... ok. But I'm getting a bit sick of trying to un-fubar with a pocket knife that which OpenAI has fubar'd with a thermal lance. I'm definitely ripe for a paid alternative.


Switch to Claude. I haven’t used ChatGPT for coding at all since they release Sonnet 3.5.


yeah but you can’t use your code from either model to compete with either company, and they do everything. wtf is wrong with AI hype enjoyers they accept being intellectually dominated?


If you think this is enforceable, I’ve got a bridge to sell you.


This is also my perception using it daily for the last year or so. Sometimes it also responds with exactly what I provided it with and does not make any changes. It's also bad at following instructions.

GPT-4 was great until it became "lazy" and filled the code with lots of `// Draw the rest of the fucking owl` type comments. Then GPT-4o was released and it's addicted to "Here's what I'm going to do: 1. ... 2. ... 3. ..." and lots of frivolous, boilerplate output.

I wish I could go back to some version of GPT-4 that worked well but with a bigger context window. That was like the golden era...


> I wouldn't have asked you now, would've I?

That's what I said to it - "If I wanted to fill in the missing parts myself, why would I have upgraded to paid membership?"


GPT-4 degraded significantly, but you probably have some rosey glasses on. Sonnet is signifcantly better.


or it’s you wearing shiny new thing glasses


> OpenAI ran it into the ground with whatever they’re doing to save on inference costs, otherwise scale, align it, or add dumb product features.

They googlified it. (Yandex isn't better at google because it improved. It's better because it stayed mostly the same.)

My recommendation to disrupting industry leaders now is becoming good enough and then simply wait until the leader self-implodes.


Claude’s license is too insane, you can’t use it for anything that competes with the everything thing.

Not sure what folks who accept Anthropic license are thinking after they read the terms.

Seems they didn’t read the terms, and they aren’t thinking? (Wouldn’t you want outputs you could use to compete with intelligence??? What are you thinking after you read their terms?)


If it really is as you say then that sounds like it won't hold up when challenged in court but IANAL...


Have you (or anyone) swapped on Cursor with Anthropic API Key?

For coding assistant, it's on my to do list to try. Cursor needs some serious work on model selection clarity though so I keep putting off.


I did it (fairly simple really) but found most of my (unsophisticated) coding these days to go through Aider [1] paired with Sonnet, for UX reasons mostly. It is easier to just prompt over the entire codebase, vs Cursor way of working with text selections.

[1] https://aider.chat


I believe Cursor allows for prompting over the entire codebase too: https://docs.cursor.com/chat/codebase


That is chatting, but it will not change the code.


Aider with Sonnet is so much better than with GPT. I made a mobile app over the weekend (never having touched mobile development before), and with GPT it was a slog, as it kept making mistakes. Sonnet was much, much better.


Thanks for this suggestion. If anyone has other suggestions for working with large code context windows and changing code workflows, I would love to hear about them.


composer within cursor (in beta) is worth a look: https://x.com/shaoruu/status/1812412514350858634


One big advantage Claude artifacts have is that they maintain conversation context, versus when I am working with Cursor I have to basically repeat a bunch of information for each prompt, there is no continuity between requests for code edits.

If Cursor fixed that, the user experience would become a lot better.


> I'm not really sure how to even test/use Mistral or Llama for everyday use though.

Both Mistral and Meta offer their own hosted versions of their models to try out.

https://chat.mistral.ai

https://meta.ai

You have to sign into the first one to do anything at all, and you have to sign into the second one if you want access to the new, larger 405B model.

Llama 3.1 is certainly going to be available through other platforms in a matter of days. Groq supposedly offered Llama 3.1 405B yesterday, but I never once got it to respond, and now it’s just gone from their website. Llama 3.1 70B does work there, but 405B is the one that’s supposed to be comparable to GPT-4o and the like.


meta.ai is inaccessible in a large portion of world territories, but the Llama 3.1 70B and 405B are also available in https://hf.co/chat

Additionally, all Llama 3.1 models are available in https://api.together.ai/playground/chat/meta-llama/Meta-Llam... and in https://fireworks.ai/models/fireworks/llama-v3p1-405b-instru... by logging in.


Groq’s models are also heavily quantised so you won’t get the full experience there.


To help keep track of the race, I put together a simple dashboard to visualize model/provider leaders in capability, throughput, and cost. Hope someone finds it useful!

Google Sheet: https://docs.google.com/spreadsheets/d/1foc98Jtbi0-GUsNySddv...


Not my site, but check out https://artificialanalysis.ai


Familiar! The Artificial Analysis Index is the metric models are sorted by in my sheet. But their data and presentation has some gaps.

I made this sheet to get a glanceable landscape view comparing the three key dimensions I care about, and fill in the missing evals. AA only lists scores for a few increasingly-dated and problematic evals benchmarks. Not just my opinion, none of their listed metrics are in HuggingFace Leaderboard 2 (June 2024).

That said I love the AA Index score because it provides a single normalized score that blends vibe-check qual (chatbot elo) with widely reported quant (MMLU, MT Bench). I wish it composed more contemporary evals, but don't have the rigor/attention to make my own score and am not aware of a better substitute.


Sonnet 3.5 to me still seems far ahead. Maybe not on the benchmarks, but in everyday life I am finding it renders the other models useless. Even still, this monthly progress across all companies is exciting to watch. Its very gratifying to see useful technology advance at this pace, it makes me excited to be alive.


Such a relief/contrast to the period between 2010 and 2020, when the top five Google, Apple, Facebook, Amazon, and Microsoft monopolized their own regions and refused to compete with any other player in new fields.

Google : Search

Facebook : social

Apple : phones

Amazon : shopping

Microsoft : enterprise ..

> Even still, this monthly progress across all companies is exciting to watch. Its very gratifying to see useful technology advance at this pace, it makes me excited to be alive.


Google refused to compete with Apple in phones?

Microsoft also competes in search, phones

Microsoft, Amazon and Google compete in cloud too


Given we don’t know precisely what’s happening in the black box we can say that spec tech doesn’t give you the full picture of the experience … Apple style


I’ve stopped using anything else as a coding assistant. It’s head and shoulders above GPT-4o on reasoning about code and correcting itself.


Agree on Claude. I also feel like ChatGPT has gotten noticeably worse over the last few months.


3.5 sonnet is the quality of the OG GPT-4, but mind blowingly fast. I need to cancel my chatgpt sub.


> mind blowingly fast

I would imagine this might change once enough users migrate to it.


Eventually it comes down to who has deployed more silicon: AWS or Azure.


3.5 Sonnet is brilliant. I use it to write Unreal Engine C++ (which is quite dense and poorly documented) and it destroys Github Copilot and GPT4o. Copilot just has no idea at all except for very obvious next-line suggestions, GPT4o hallucinates a ton of functions, but Sonnet gets it perfect almost every time.


I don't get it. My husband also swears by Clause Sonnet 3.5, but every time I use it, the output is considerably worse than GPT-4o


I don't see how that's possible. I decided to give GPT-4o a second chance after reaching my daily use on Sonnet 3.5, after 10 prompts GTP-4o failed to give me what Claude did in a single prompt (game-related programming). And with fragments and projects on top of that, the UX is miles ahead of anything OpenAI offers right now.


Just don't listen to anecdata, and use objective metrics instead: https://chat.lmsys.org/?leaderboard


Anecdata seems quite valid for LLM comparison when trying to evaluate 'usefullness' for users. The lmsys chat leaderboard is literally just mass anecdata.


Yes, "mass anecdata" + blind collection is usually called "data".


You might also want to look into other benchmarks: https://old.reddit.com/r/LocalLLaMA/comments/1ean2i6/the_fin...


GPT-4o being only 7 ELO above GPT-4o-mini suggests this is measuring something a lot different than "capabilities".


Claude 3.5 is a trusted developer partner that will work with you and outline what it’s thinking. It’s not always right but because it outlines its reasoning you too can reason about the problem and catch it.

ChatGPT, for me, was a stack overflow solution dump. It gives me an answer that probably could work but it’s difficult for me to reason about why I want to do it that way.

Truthfully this probably boils down to prompting but Claude’s out of the box experience is fantastic for development. Ultimately I just want to code, not be a prompt wizard.


It’s these kind of praise that makes me wonder if they are all paid to give glowing reviews, this is not my experience with sonnet at all. It absolutely does not blow away gpt4o.


My hunch is this comes down to personal prompting style. It's likely that your own style works more effectively with GPT-4o, while other people have styles that are more effective with Claude 3.5 Sonnet.


I would add that the task is relevant too. I feel there’s not yet a model that is consistently better at everything. I still revert to plain old GPT-4 for direct translation of text into English that requires creative editing to fit a specific style. Of all the Claudes and GPTs, it’s the one that gives me the best output (to my taste). On the other hand, for categorisation tasks, depending on the subject and the desired output, GPT-4o and Claude 3.5 might perform better than the other interchangeably. The same applies to coding tasks. With complex prompts, however, it does seem that Claude 3.5 is better at getting more details right.


Whoever will choose to finally release their model without neutering / censoring / alignment will win.

There is gold in the streets, and no one seems to be willing to scoop it up.


Claude is pretty great, but it's lacking the speech recognition and TTS, isn't it?


Correct. IMO the official Claude app is pretty garbage. Sonnet 3.5 API + Open-WebUI is amazing though and supports STT+TTS as well as a ton of other great features.


But projects are great in Sonnet, you just dump db schema some core file and you can figure stuff out quickly. I guess Aider is similar but i was lacking good history of chats and changes


It’s so weird LMsys doesn’t reflect that then.

I find it funny how in threads like this everyone swears one model is better than another


I recommend using a UI that you can just use whatever models you want. OpenWebUI can use anything OpenAI compatible. I have mine hooked up to Groq and Mistral, in addition to my Ollama instance.


I'd rank Claude 3.5 overall better. GPT-4o seems to have on par to better vision models, typescript, and math abilities.

llama is on meta.ai


The non-commercial license is underwhelming.

It seems to be competitive with Llama 3.1 405b but with a much more restrictive license.

Given how the difference between these models is shrinking, I think you're better off using llama 405B to finetune the 70B on the specific use case.

This would be different if it was a major leap in quality, but it doesn't seem to be.

Very glad that there's a lot of competition at the top, though!


I'm building a ai coding assistant (https://double.bot) so I've tried pretty much all the frontier models. I added it this morning to play around with it and it's probably the worst model I've ever played with. Less coherent than 8B models. Worst case of benchmark hacking I've ever seen.

example: https://x.com/WesleyYue/status/1816153964934750691


to be fair that's quite a weird request (the initial one) – I feel a human would struggle to understand what you mean


definitely not an articulate request, but the point of using these tools is to speed me up. The less the user has to articulate and the more it can infer correctly, the more helpful it is. Other frontier models don't have this problem.

Llama 405B response would be exactly what I expect

https://x.com/WesleyYue/status/1816157147413278811


That response is bad python though, I can't think of why you'd ever want a dict with Literal typed keys.

Either use a TypedDict if you want the keys to be in a specific set, or, in your case since both the keys and the values are static you should really be using an Enum


What was the expected outcome for you? AFAIK, Python doesn't have a const dictionary. Were you wanting it to refactor into a dataclass?


Yes, there's a few things wrong: 1. If it assumes typescript, it should do `as const` in the first msg 2. If it is python, it should be something like https://x.com/WesleyYue/status/1816157147413278811 which is what I wanted but I didn't want to bother with the typing.


Are you sure the chat history is being passed when the second message is sent? That looks like the kind of response you'd expect if it only received the prompt "in python" with no chat history at all.


Yes, I built the extension. I actually also just went to send another message asking what the first msg was just to double check I didn't have a bug and it does know what the first msg was.


Thanks, that's some really bad accuracy/performance


This makes no sense. Benchmarking code is easier than natural language and Mistral has separate benchmarks for prominent languages.


a bit of surprise since codestral is among best open models so far.


These companies full of brilliant engineers are throwing millions of dollars in training costs to produce SOTA models that are... "on par with GPT-4o and Claude Opus"? And then the next 2.23% bump will cost another XX million? It seems increasingly apparent that we are reaching the limits of throwing more data at more GPUs; that an ARC prize level breakthrough is needed to move the needle any farther at this point.


> It seems increasingly apparent that we are reaching the limits of throwing more data at more GPUs

Yes. This is exactly why I'm skeptical of AI doomerism/saviorism.

Too many people have been looking at the pace of LLM development over the last two (2) years, modeled it as an exponential growth function, and come to the conclusion that AGI is inevitable in the next ${1-5} years and we're headed for ${(dys|u)topia}.

But all that assumes that we can extrapolate a pattern of long-term exponential growth from less than two years of data. It's simply not possible to project in that way, and we're already seeing that OpenAI has pivoted from improving on GPT-4's benchmarks to reducing cost, while competitors (including free ones) catch up.

All the evidence suggests we've been slowing the rate of growth in capabilities of SOTA LLMs for at least the past year, which means predictions based on exponential growth all need to be reevaluated.


Notice though, that all these improvements have been with pretty basic transformer models that output all their tokens-- no internal thoughts, no search, no architecture improvements and things are only fed through them once.

But we could add internal thoughts-- we could make the model generate tokens that aren't part of its output but are there for it to better figure out its next token. This was tried QuietSTAR.

Hochreiter is also active with alternative models, and there's all the microchip design companies, Groq, Etched, etc. trying to speed up models and reduce model running cost.

Therefore, I think there's room for very great improvements. They may not come right away, but there are so many obvious paths to improve things that I think it's unreasonable to think progress has stalled. Also, presumably GPT-5 isn't far away.


> But we could add internal thoughts

It feels like there’s an assumption in the community that this will be almost trivial.

I suspect it will be one of the hardest tasks humanity has ever endeavoured. I’m guessing it has already been tried many times in internal development.

I suspect if you start creating a feedback loop with these models they will tend to become very unstable very fast. We already see with these more linear LLMs that they can be extremely sensitive to the values of parameters like the temperature settings, and can go “crazy” fairly easily.

With feedback loops it could become much harder to prevent these AIs from spinning out of control. And no I don’t mean in the “become an evil paperclip maximiser” kind of way. Just plain unproductive insanity.

I think I can summarise my vision of the future in one sentence: AI psychologists will become a huge profession, and it will be just as difficult and nebulous as being a human psychologist.


I personally think it's not going to be incredibly difficult. Obviously, the way it was done with QuietSTaR is somewhat expensive, but I see many reasonable approaches here that could be considered.

High temperature will obviously lead to randomness, that's what it, evening out the probabilities of the possibilities for the next token. So obviously a high temperature will make them 'crazy' and low temperature will lead to deterministic output. People have come up with lots of ideas about sampling, but this isn't really an instability of transformer models.

It's a problem with any model outputing probabilities for different alternative tokens.


>I suspect if you start creating a feedback loop with these models they will tend to become very unstable very fast. We already see with these more linear LLMs that they can be extremely sensitive to the values of parameters like the temperature settings, and can go “crazy” fairly easily.

I'm in the process of spinning out one of these tools into a product: they do not. They become smarter at the price of burning GPU cycles like there's no tomorrow.

I'd go as far as saying we've solved AGI, it's just that the energy budget is larger than the energy budget of the planet currently.


can you link to the overall approach or references for your work?


> Also, presumably GPT-5 isn't far away.

Why do we presume that? People were saying this right before 4o and then what came out was not 5 but instead a major improvement on cost for 4.

Is there any specific reason to believe OpenAI has a model coming soon that will be a major step up in capabilities?


OpenAI have made statements saying they've begun training it, as they explain here: https://openai.com/index/openai-board-forms-safety-and-secur...

I assume that this won't take forever, but will be done this year. A couple of months, not more.


Indeed.All exponential growth curves are sigmoids in disguise.


This is something that is definitionally true in a finite universe, but doesn't carry a lot of useful predictive value in practice unless you can identify when the flattening will occur.

If you have a machine that converts mass into energy and then uses that energy to increase the rate at which it operates, you could rightfully say that it will level off well before consuming all of the mass in the universe. You just can't say that next week after it has consumed all of the mass of Earth.


except when it isn't and we ded :P


I don't think Special Relativity would allow that.


I'm also wondering about the extent to which we are simply burning venture capital versus actually charging subscription prices that are sustainable long-term. Its easy to sell dollars for $0.75 but you can only do that for so long.


> we're already seeing that OpenAI has pivoted from improving on GPT-4's benchmarks to reducing cost, while competitors (including free ones) catch up.

What if they have two teams? One dedicated to optimizing (cost, speed, etc) the current model and a different team working on the next frontier model? I don't think we know the growth curve until we see gpt5.


> I don't think we know the growth curve until we see gpt5.

I'm prepared to be wrong, but I think that the fact that we still haven't seen GPT-5 or even had a proper teaser for it 16 months after GPT-4 is evidence that the growth curve is slowing. The teasers that the media assumed were for GPT-5 seem to have actually been for GPT-4o [0]:

> Lex Fridman(01:06:13) So when is GPT-5 coming out again?

> Sam Altman(01:06:15) I don’t know. That’s the honest answer.

> Lex Fridman(01:06:18) Oh, that’s the honest answer. Blink twice if it’s this year.

> Sam Altman(01:06:30) We will release an amazing new model this year. I don’t know what we’ll call it.

> Lex Fridman(01:06:36) So that goes to the question of, what’s the way we release this thing?

> Sam Altman(01:06:41) We’ll release in the coming months many different things. I think that’d be very cool. I think before we talk about a GPT-5-like model called that, or not called that, or a little bit worse or a little bit better than what you’d expect from a GPT-5, I think we have a lot of other important things to release first.

Note that last response. That's not the sound of a CEO who has an amazing v5 of their product lined up, that's the sound of a CEO who's trying to figure out how to brand the model that they're working on that will be cheaper but not substantially better.

[0] https://arstechnica.com/information-technology/2024/03/opena...


I don't think we are approaching limits, if you take off the English-centric glasses. You can query LLMs about pretty basic questions about Polish language or literature and it's gonna either bullshit or say it doesn't know the answer.

Example:

    w której gwarze jest słowo ekspres i co znaczy?

    Słowo "ekspres" występuje w gwarze śląskiej i oznacza tam ekspres do kawy. Jest to skrót od nazwy "ekspres do kawy", czyli urządzenia służącego do szybkiego przygotowania kawy.
The correct answer is that "ekspres" is a zipper in Łódź dialect.


What this means is just that Polish support (and probably most other languages besides English) in the models is behind SOTA. We can gradually get those languages closer to SOTA, but that doesn't bring us closer to AGI.


That's just same same but different, not a step change towards significant cognitive ability.


Tbf, you can ask it basic questions in English and it will also bullshit you.


What about synthetic data?


I suspect this is why OpenAI is going more in the direction of optimising for price / latency / whatever with 4o-mini and whatnot. Presumably they found out long before the rest of us did that models can't really get all that much better than what we're approaching now, and once you're there the only thing you can compete on is how many parameters it takes and how cheaply you can serve that to users.


Meta just claimed the opposite in their Llama 3.1 paper. Look at the conclusion. They say that their experience indicates significant gains for the next iteration of models.

The current crop of benchmarks might not reflect these gains, by the way.


I sell widgets. I promise the incalculable power of widgets has yet to be unleashed on the world, but it is tremendous and awesome and we should all be very afraid of widgets taking over the world because I can't see how they won't.

Anyway here's the sales page. the widget subscription is so premium you won't even miss the subscription fee.


This. It's really weird the way we suddenly live in a world where it's the norm to take whatever a tech company says about future products at face value. This is the same world where Tesla promised "zero intervention LA to NYC self driving" by the end of the year in 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, and 2024. The same world where we know for a fact that multiple GenAI demos by multiple companies were just completely faked.

It's weird. In the late 2010s it seems like people were wising up to the idea that you can't implicitly trust big tech companies, even if they have nap pods in the office and have their first day employees wear funny hats. Then ChatGPT lands and everyone is back to fully trusting these companies when they say they are mere months from turning the world upside down with their AI, which they say every month for the last 12-24 months.


I'm not sure anyone is asking you to take it at face value or implicitly trust them? There's a 92-page paper with details: https://ai.meta.com/research/publications/the-llama-3-herd-o...


> In the late 2010s it seems like people were wising up to the idea that you can't implicitly trust big tech companies

In the 2000s we only had Microsoft, and none of us were confused as to whether to trust Bill Gates or not...


Nobody tells it like Zitron:

https://www.wheresyoured.at/pop-culture/

> What makes this interview – and really, this paper — so remarkable is how thoroughly and aggressively it attacks every bit of marketing collateral the AI movement has. Acemoglu specifically questions the belief that AI models will simply get more powerful as we throw more data and GPU capacity at them, and specifically ask a question: what does it mean to "double AI's capabilities"? How does that actually make something like, say, a customer service rep better? And this is a specific problem with the AI fantasists' spiel. They heavily rely on the idea that not only will these large language models (LLMs) get more powerful, but that getting more powerful will somehow grant it the power to do...something. As Acemoglu says, "what does it mean to double AI's capabilities?"


I don't think claiming that pure scaling of LLMs isn't going to lead to AGI is a particularly hot take. Or that current LLMs don't provide a whole lot of economic value. Obviously, if you were running a research lab you'd be trying a bunch of different things, including pure scaling. It would be weird not to. I don't know if we're going to hit actual AGI in the next decade, but given the progress of the last less-than-decade I don't see why anyone would rule it out. That in itself seems pretty remarkable, and it's not hard to see where the hype is coming from.


Meta just keeps releasing their models as open-source, so that whole line of thinking breaks down quickly.


That line of thinking would not have reached the conclusion that you imply, which is that open source == pure altruism. Having the benefit of hindsight, it’s very difficult for me to believe that. Who knows though!

I’m about Zucks age, and have been following his career/impact since college; it’s been roughly a cosine graph of doing good or evil over time :) I think we’re at 2pi by now, and if you are correct maybe it hockey-sticks up and to the right. I hope so.


I don't think this is a matter of good or evil, simply a matter of business strategy.

If LLMs end up being the platform of the future, Zuck doesn't want OpenAI/Microsoft to be able to monopolize it.


Wouldn't the equivalent for Meta actually be something like:

> Other companies sell widgets. We have a bunch of widget-making machines and so we released a whole bunch of free widgets. We noticed that the widgets got better the more we made and expect widgets to become even better in future. Anyway here's the free download.

Given that Meta isn't actually selling their models?

Your response might make sense if it were to something OpenAI or Anthropic said, but as is I can't say I follow the analogy.


that would make sense if it was from Openai, but Meta doesn't actually sell these widgets? They release the widget machines for free in the hopes that other people will build a widget ecosystem around them to rival the closed widget ecosystem that threatens to lock them out of a potential "next platform" powered by widgets.


Meta doesn't sell widgets in this scenario - they give them away for free. Their competition sells widgets, so Meta would be perfectly happy if the widget market totally collapsed.


That is strong (and fun) point, but this is peer reviewable and has more open collaboration elements than purely selling widgets.

We should still be skeptical because often want to claim to be better or have unearned answers, but I don't think the motive to lie is quite as strong as a salesman's.


> this is peer reviewable

It's not peer-reviewable in any shape or form.


Others can build models that try to have decent performance with a lower number of parameters. If they match what is in the paper that is the crudest form of review, but Mistral is releasing some models (this one?) so this can get more nuanced if needs.

That said, doing that is slow and people will need to make decisions before that is done.


So, the best you can do is "the crudest form of review"?


It is kind of "peer-reviewable" in the "Elon Musk vs Yann LeCun" form, but I doubt that the original commenter meant this.


Except: Meta doesn't sell AI at all. Zuck is just doing this for two reasons:

- flex

- deal a blow to Altmann


Meta uses ai in all the recommendation algorithms. They absolutely hope to turn their chat assistants into a product on WhatsApp too, and GenAI is crucial to creating the metaverse. This isn’t just a charity case.


AI isn't a single thing: of course meta didn't buy thousands of GPUs for fun.

But it has nothing to do with LLMs (and interestingly enough they aren't opening their recommendation tech).


There are literal ads for Meta Ai on television. The idea they’re not selling something is absurd.


If OpenAI was saying this you'd have a point but I wouldn't call Facebook a widget seller in this case when they're giving their widgets away for free.


But Meta isn't selling it


They also said in the paper that 405B was only trained to "compute-optimal" unlike the smaller models that were trained well past that point indicating the larger model still had some runway so had they continued it would have kept getting stronger.


Makes sense right? Otherwise why make a model so large that nobody can conceivably run it if not to optimize for performance on a limited dataset/compute? It was always a distillation source model, not a production one.


LLMs are reaching saturation on even some of the latest benchmarks and yet I am still a little disappointed by how they perform in practice.

They are by no means bad, but I am now mostly interested in long context competency. We need benchmarks that force the LLM to complete multiple tasks simultaneously in one super long session.


I don't know anything about AI but there's one thing I want it to do for me. Program a full body exercise program long term based on the parameters I give it such as available equipment and past workout context goals. I haven't had good success with chatgpt but I assume what you're talking about is relevant to my goals.


Aren't there apps that already do this like Fitbod?


Fitbod might do the trick. Thanks! The availability of equipment was a difficult thing for me to incorporate into a fitness program.


Yeah, but what does that actually mean? That if they had simply doubled the parameters on Llama 405b it would score way better on benchmarks and become the new state-of-the-art by a long mile?

I mean, going by their own model evals on various benchmarks (https://llama.meta.com/), Llama 405b scores anywhere from a few points to almost 10 points more than than Llama 70b even though the former has ~5.5x more params. As far as scale in concerned, the relationship isn't even linear.

Which in most cases makes sense, you obviously can't get a 200% on these benchmarks, so if the smaller model is already at ~95% or whatever then there isn't much room for improvement. There is, however, the GPQA benchmark. Whereas Llama 70b scores ~47%, Llama 405b only scores ~51%. That's not a huge improvement despite the significant difference in size.

Most likely, we're going to see improvements in small model performance by way of better data. Otherwise though, I fail to see how we're supposed to get significantly better model performance by way of scale when the relationship between model size and benchmark scores is nowhere near linear. I really wish someone who's team "scale is all you need" could help me see what I'm missing.

And of course we might find some breakthrough that enables actual reasoning in models or whatever, but I find that purely speculative at this point, anything but inevitable.


Or maybe they just want to avoid getting sued by shareholders for dumping so much money into unproven technology that ended up being the same or worse than the competitor


> the only thing you can compete on is how many parameters it takes and how cheaply you can serve that to users.

The problem with this strategy is that it's really tough to compete with open models in this space over the long run.

If you look at OpenAI's homepage right now they're trying to promote "ChatGPT on your desktop", so it's clear even they realize that most people are looking for a local product. But once again this is a problem for them because open models run locally are always going to offer more in terms of privacy and features.

In order for proprietary models served through an API to compete long term they need to offer significant performance improvements over open/local offerings, but that gap has been perpetually shrinking.

On an M3 macbook pro you can run open models easily for free that perform close enough to OpenAI that I can use them as my primary LLM for effectively free with complete privacy and lots of room for improvement if I want to dive into the details. Ollama today is pretty much easier to install than just logging into ChatGPT and the performance feels a bit more responsive for most tasks. If I'm doing a serious LLM project I most certainly won't use proprietary models because the control I have over the model is too limited.

At this point I have completely stopped using proprietary LLMs despite working with LLMs everyday. Honestly can't understand any serious software engineer who wouldn't use open models (again the control and tooling provided is just so much better), and for less technical users it's getting easier and easier to just run open models locally.


In the long run maybe but it's going to take probably 5 years or more before laptops such as Macbook M3 with 64 GB RAM will be mainstream. Also it's going going to take a while before such models with 70B params will be bundled in Windows and Mac with system update. Even more time before you will have such models inside your smartphone.

OpenAI did a good move with making GPTo mini so dirty cheap that it's faster and cheaper to run than LLama 3.1 70B. Most consumers will interact with LLM via some apps using LLM API, Web Panel on desktop or native mobile app for the same reason most people use GMail etc. instead of native email client. Setting up IMAP, POP etc is for most people out of reach the same like installing Ollama + Docker + OpenWebUI

App developers are not gonna bet on local LLM only as long they are not mainstream and preinstalled on 50%+ devices.


I think their desktop app still runs the actual LLM queries remotely.


This. It's a mac port of the iOS app. Using the API.


Totally. I wrote about this when they announced their dev-day stuff.

In my opinion, they've found that intelligence with current architecture is actually an S-curve and not an exponential, so trying to make progress in other directions: UX and EQ.

https://nicholascharriere.com/blog/thoughts-openai-spring-re...


indeed. I pointed out in https://buttondown.email/ainews/archive/ainews-llama-31-the-... that the frontier model curve is currently going down 1 OoM every 4 months, meaning every model release has a very short half life[0]. however this progress is still worth it if we can deploy it to improve millions and eventually billions of people's lives. a commenter pointed out that the amoutn spent on Llama 3.1 was only like 60% of the cost of Ant Man and the Wasp Quantumania, in which case I'd advocate for killing all Marvel slop and dumping all that budget on LLM progress.

[0] not technically complete depreciation, since for example 4o mini is widely believed to be a distillation of 4o, so 4o's investment still carries over into 4o mini


All that Marvel slop was created by the first real LLM: <https://marvelcinematicuniverse.fandom.com/wiki/K.E.V.I.N.>


> however this progress is still worth it if we can deploy it to improve millions and eventually billions of people's lives

Has there been any indication that we're improving the lives of millions of people?


Just me coding 30% faster is worth it


I haven't found a single coding problem where any of these coding assistants where anything but annoying.

If I need to babysit a junior developer fresh out of school and review every single line of code it spits out, I can find them elsewhere


Yes, just like internet, power users have found use cases. It'll take education / habit for general users


Ah yes. We're in the crypto stages of "it's like the internet".


Agreed on everything, but calling the marvel movies slop…I think that word has gone too far.


Not all Marvel films are slop. But, as a fan who comes from a family of fans and someone who has watched almost all of them; lets be real. That particular film, really and most of them, contain copious amounts of what is absolutely slop.

I don't know if the utility is worse than an LLM that is SOTA for 2 months that no one even bothers switching to however - at least the marvel slop is being used for entertainment by someone. I think the market is definitely prioritizing the LLM researcher over Disney's latest slop sequel though so whoever made that comparison can rest easy, because we'll find out.


>really and most of them, contain copious amounts of what is absolutely slop.

I thought that was the allure, something that's camp funny and an easy watch.

I have only watched a few of them so I am not fully familiar?


Not only are Marvel movies slop, they are very concentrated slop. The only way to increase the concentration of slop in a Marvel movie would be to ask ChatGPT to write the next one.


The marvel movies are the genesis for this use of the word slop.


Can you back that claim up with a link or similar?


It’s junk food. No one is disputing how tasty it is though (including the recent garbage).


The thing I don't understand is why everyone is throwing money at LLMs for language, when there are much simpler use cases which are more useful?

For example, has anyone ever attempted image -> html/css model? Seems like it be great if I can draw something on a piece of paper and have it generate a website view for me.


Perhaps if we think of LLMs as search engines (Google, Bing etc) then there's more money to be made by being the top generic search engine than the top specialized one (code search, papers search etc)


This is the real PVP of LLM for me. Compressing google search AND the internet into 8 GB and download is something unfathomable to me a two decades ago.

My hope now is that someone will figure out a way to separate intelligence from knowledge - i.e. train a model that knows how to interpret the wights of other models - so that training new intelligent models wouldn't require training them on a petabyte of data every run.


> has anyone ever attempted image -> html/css model?

I had a discussion with a friend about doing this, but for CNC code. The answer was that a model trained on a narrow data set underperforms one trained on a large data set and then fine tuned with the narrow one.


All of the multi-modal LLMs are reasonably good at this.


They did that in the chatgpt 4 demo 1.5 year ago. https://www.youtube.com/watch?v=GylMu1wF9hw


I was under the impression that you could more or less do something like that with the existing LLMs?

(May work poorly of course, and the sample I think I saw a year ago may well be cherry picked)


>For example, has anyone ever attempted image -> html/css model?

Have you tried upload the image to a LLM with vision capabilities like GPT-4o or Claude 3.5 Sonnet?


I tried and sonnet 3.5 can copy most of common UIs


> For example, has anyone ever attempted image -> html/css model?

There are already companies selling services where they generate entire frontend applications from vague natural language inputs.

https://vercel.com/blog/announcing-v0-generative-ui


Not sure why you think interpreting a hand drawing is "simpler" than parsing sequential text.


That's a thought I had. For example, could a model be trained to take a description, and create a Blender (or whatever other software) model from it? I have no idea how LLMs really work under the hood, so please tell me if this is nonsense.


I'm waiting exactly for this, gpt4 trips up a lot with blender currently (nonsensical order of operations etc.)


I think GPT5 will be the signal of whether or not we have hit a plateau. The space is still rapidly developing, and while large model gains are getting harder to pick apart, there have been enormous gains in the capabilities of light weight models.


> I think GPT5 will be the signal of whether or not we have hit a plateau.

I think GPT5 will tell if OpenAI hit a plateau.

Sam Altman has been quoted as claiming "GPT-3 had the intelligence of a toddler, GPT-4 was more similar to a smart high-schooler, and that the next generation will look to have PhD-level intelligence (in certain tasks)"

Notice the high degree of upselling based on vague claims of performance, and the fact that the jump from highschooler to PhD can very well be far less impressive than the jump from toddler to high schooler. In addition, notice the use of weasel words to frame expectations regarding "the next generation" to limit these gains to corner cases.

There's some degree of salesmanship in the way these models are presented, but even between the hyperboles you don't see claims of transformative changes.


>some degree of salesmanship

buddy every few weeks one of these bozos is telling us their product is literally going to eclipse humanity and we should all start fearing the inevitable great collapse.

It's like how no one owns a car anymore because of ai driving and I don't have to tell you about the great bank disaster of 2019, when we all had to accept that fiat currency is over.

You've got to be a particular kind of unfortunate to believe it when sam altman says literally anything.


PhD level-of-task-execution sounds like the LLM will debate whether the task is ethical instead of actually doing it


I wish I could frame this comment


lol! Producing academic papers for future training runs then.


Basically every single word out of Mr Worldcoin's mouth is a scam of some sort.


I’m waiting for the same signal. There are essentially 2 vastly different states of the world depending on whether GPT-5 is an incremental change vs a step change compared to GPT-4.


Which is why they'll keep calling the next few models GPT4.X


The next iteration depends on NVIDIA & co, what we need is sparse libs. Most of the weights in llms are 0, once we deal with those more efficiently we will get to the next iteration.


> Most of the weights in llms are 0,

that's interesting. Do you have a rough percentage of this?

Does this mean these connections have no influence at all on output?


My uneducated guess is that with many layers you can implement something akin to graph in brain by nulling lots of previous later outputs. I actually suspect that current models aren’t optimal with layers all of the same size but i know shit


This is quite intuitive. We know that a biological neural net is a graph data structure. And ML systems on GPUs are more like layers of bitmaps in Photoshop (it's a graphics processor). So if most of the layers are akin to transparent pixels, in order to build a graph by stacking, that's hyper memory inefficient.


What else can be done?

If you are sitting on 1 billions $ of GPU capex, what's $50 million in energy/training cost for another incremental run that may beat the leaderboard?

Over the last few years the market has placed its bets that this stuff will make gobs of money somehow. We're all not sure how. They're probably thinking -- it's likely that whoever has a few % is going to sweep and take most of this hypothetical value. What's another few million, especially if you already have the GPUs?

I think you're right -- we are towards the right end of the sigmoid. And with no "killer app" in sight. It is great for all of us that they have created all this value, because I don't think anyone will be able to capture it. They certainly haven't yet.


and even if there is another breakthrough all of these companies will implement it more or less simultaneously and they will remain in a dead heat


Presuming the breakthrough is openly shared. It remains surprising how transparent many of these companies are about new approaches that push the SoTa forward, and I suspect we're going to see a change. That companies won't reveal the secret sauce so readily.

e.g. Almost the entire market relies upon Attention Is All You Need paper detailing transformers, and it would be an entirely different market if Google had held that as a trade secret.


Given how absolutely pitiful the proprietary advancements in AI have been, I would posit we have little to worry about.


OTOH the companies who are sharing their breakthroughs openly aren't yet making any money, so something has to give. Their research is currently being bankrolled by investors who assume there will be returns eventually, and eventually can only be kicked down the road for so long.


Sort of yes, sort of no.

Of course, I agree that Stability AI made Stable Diffusion freely available and they're worth orders of magnitude less than OpenAI. To the point they're struggling to keep the lights on.

But it doesn't necessarily make that much difference whether you openly share the inner technical details. When you've got a motivated and well financed competitor, merely demonstrating a given feature is possible, showing the output and performance and price, might be enough.

If OpenAI adds a feature, who's to say Google and Facebook can't match it even though they can't access the code?


Well, that's because the potential reward from picking the right horse is MASSIVE and the cost of potentially missing out is lifelong regret. Investors are driven by FOMO more than anything else. They know most of these will be duds but one of these duds could turn out to be life changing. So they will keep bankrolling as long as they have the money.


Eventually can be (and has been) bankrolled by Nvidia. They did a lot of ground-floor research on GANs and training optimization, which only makes sense to release as public research. Similarly, Meta and Google are both well-incentivized to share their research through Pytorch and Tensorflow respectively.

I really am not expecting Apple or Microsoft to discover AGI and ferret it away for profitability purposes. Strictly speaking, I don't think superhuman intelligence even exists in the domain of text generation.


Anthropic has been very secretive about the supposed synthetic data they used to train 3.5 Sonnet.

Given how good the model is terms of the quality vs speed tradeoff, they must have something.


>Attention Is All You Need paper detailing transformers, and it would be an entirely different market if Google had held that as a trade secret.

I would guess that in that timeline, Google would never have been able to learn about the incredible capabilities of transformer models outside of translation, at least not until much later.


For some time, we have been at a plateau because everyone has caught up, which essentially means that everyone now has good training datasets and uses similar tweaks to the architecture. It seems that, besides new modalities, transformers might be a dead end as an architecture. Better scores on benchmarks result from better training data and fine-tuning. The so-called 'agents' and 'function calling' also boil down to training data and fine-tuning.


Benchmarks scores aren't good because they apply to previous generations of LLMs. That 2.23% uptick can actually represent a world of difference in subjective tests and definitely be worth the investment.

Progress is not slowing down but it gets harder to quantify.


There is different directions AI have lots to improve: multi modal which branch into robotics, single modal like image, video, and sound generation and understanding. Also would check back when openAI releases 5


We always needed a tock to see real advancement, like with the last model generation. The tick we had with the h100 was enough to bring these models to market but that's it.


For this model, it seems like the point is that it uses way less parameters than at least the large Llama model while having near identical performance. Given how large these models are getting, this is an important thing to do before making performance better again.


And with the increasing parameter size, the main winner will be Nvidia.

Frankly I just don't understand the economics of training a foundation model. I'd rather own an airline. At least I can get a few years out of the capital investment of a plane.


But billionaires already have that, they want a chance of getting their own god.


I think it’s impressive that they’re doing it on a single (large) node. Costs matter. Efficiency improvements like this will probably increase capabilities eventually.

I’m also optimistic about building better (rather than bigger) datasets to train on.


I don't think we can conclude that until someone trains a model that is significantly bigger than GPT-4.


This is already what the chinchilla paper surmised, it's no wonder that their prediction now comes to fruition. It is like an accelerated version of Moore's Law, because software development itself is more accelerated than hardware development.


> It seems increasingly apparent that we are reaching the limits of throwing more data at more GPUs;

I think you're just seeing the "make it work" stage of the combo "first make it work, then make it fast".

Time to market is critical, as you can attest by the fact you framed the situation as "on par with GPT-4o and Claude Opus". You're seeing huge investments because being the first to get a working model stands to benefit greatly. You can only assess models that exist, and for that you need to train them at a huge computational cost.


ChatGPT is like Google now. It is the default. Even if Claude becomes as good as ChatGPT or even slightly better it won't make me switch. It has to be like a lot better. Way better.

It feels like ChatGPT won the time to market war already.


But plenty people switched to Claude, esp. with Sonnet 3.5. Many of them in this very thread.

You may be right with the average person on the street, but I wonder how many have lost interest in LLM usage and cancelled their GPT plus sub.


-1: I know many people who are switching to Claude. And Google makes it near-zero friction to adopt Gemini with Gsuite. And more still are using the top-N of them.

This is similar to the early days of the search engine wars, the browser wars, and other categories where a user can easily adopt, switch between and use multiple. It's not like the cellphone OS/hardware war, PC war and database war where (most) users can only adopt one platform at a time and/or there's a heavy platform investment.


If ChatGPT fails to do a task you want, your instinct isn't "I'll run the prompt through Claude and see if it works" but "oh well, who needs LLMs?"


Please don't assume your experience applies to everyone. If ChatGPT can't do what I want, my first reaction is to ask Claude for the same thing. Often to find out that Claude performs much better. I've already cancelled ChaptGPT Plus for exactly that reason.


You just did that Internet thing where someone reads the reply someone wrote without the comment they are replying to, completely misunderstanding the conversation.


Eh, with the degradation of coding performance in ChatGPT I made the switch. Seems much better to work with on problems, and I have to do way less hand holding to get good results.

I'll switch again soon as something better is out.


The models are converging slowly. In the end, it will come down to the user experience and the "personality." I have been enjoying the new Claude Sonnet. It feels sharper than the others, even though it is not the highest-scoring one.

One thing that `exponentialists` forget is that each step also requires exponentially more energy and resources.


I have been paying for OpenAI since they started accepting payment, but to echo your comment, Claude is so good I am primarily relying on it now for LLM driven work and cancelled my OpenAI subscription. Genuine kudos to Mistral, they are a worthy competitor in the space against Goliaths. They make someone mediocre at writing code less so, so I can focus on higher value work.


And a factor for Mistral typically is it will give you less refusals and can be uncensored. So if I have to guess any task that requires creative output could be better suited for this.


I love how much AI is bringing competition (and thus innovation) back to tech. Feels like things were stagnant for 5-6 years prior because of the FAANG stranglehold on the industry. Love also that some of this disruption is coming at out of France (HuggingFace and Mistral), which Americans love to typecast as incapable of this.


Sorry for the slightly off topic question, but can someone enlighten me which Claude model is more capable, Opus or Sonnet 3.5? I am confused because I see people fuzzing about Sonnet 3.5 being the best and yet somehow I seem to read again and again in factual texts and some benchmarks that Claude Opus is the most capable. Is there a simple answer to the question, what do I not understand? Please, thank you.


I think this image explains it best: https://www-cdn.anthropic.com/images/4zrzovbb/website/1f0441...

I.e. Opus is the largest and best model of each family but Sonnet is the first model of the 3.5 family and can beat 3's Opus in most tasks. When 3.5 Opus is released it will again outpace the 3.5 Sonnet model of the same family universally (in terms of capability) but until then it's a comparison of two different families without a universal guarantee, just a strong lean towards the newer model.


Thank you for clearing this out to me :)


Sonnet 3.5.

Opus is the largest model, but of the Claude 3 family. Claude 3.5 is the newest family of models, with Sonnet being the middle sized 3.5 model - and also the only available one. Regardless, it's better than Opus (the largest Claude 3 one).

Presumably, a Claude 3.5 Opus will come out at some point, and should be even better - but maybe they've found that increasing the size for this model family just isn't cost effective. Or doesn't improve things that much. I'm unsure if they've said anything about it recently.


Thank you :)


I still prefer ChatGPT-4o and use Claude if I have issues but never does any better


This is super interesting to me.

Claude Sonnet 3.5 outperforms GPT-4o by a significant margin on every one of my use cases.

What do you use it for?


All evals we have are just far too easy! <1% difference is just noise/bad data

We need to figure out how to measure intelligence that is greater than human.


Give it problems most/all humans can't solve on their own, but that are easy to verify.

Math problems being one of them, if only LLMs were good at pure math. Another possibility is graph problems. Haven't tested this much though.


"Mistral Large 2 is equipped with enhanced function calling and retrieval skills and has undergone training to proficiently execute both parallel and sequential function calls, enabling it to serve as the power engine of complex business applications."

Why does the chart below say the "Function Calling" accuracy is about 50%? Does that mean it fails half the time with complex operations?


Mistral forgot to say which benchmark they were using for that chart, without that information it's impossible to determine what it actually means.


Relatedly, what does "parallel" function calling mean in this context?


That's when the LLM can respond with multiple functions it wants you to call at once. You might send it:

    Location and population of Paris, France
A parallel function calling LLM could return:

    {
      "role": "assistant",
      "content": "",
      "tool_calls": [
        {
          "function": {
            "name": "get_city_coordinates",
            "arguments": "{\"city\": \"Paris\"}"
          }
        }, {
          "function": {
            "name": "get_city_population",
            "arguments": "{\"city\": \"Paris\"}"
          }
        }
      ]
    }
Indicating that you should execute both of those functions and return the results to the LLM as part of the next prompt.


Ah, thank you!


Sharing PyLLMs [1] reasoning benchmark results for some of the recent models. Surprised by nemo (speed/quality) and mistral large is actually pretty good (but painfully slow).

AnthropicProvider('claude-3-haiku-20240307') Median Latency: 1.61 | Aggregated speed: 122.50 | Accuracy: 44.44%

MistralProvider('open-mistral-nemo') Median Latency: 1.37 | Aggregated speed: 100.37 | Accuracy: 51.85%

OpenAIProvider('gpt-4o-mini') Median Latency: 2.13 | Aggregated speed: 67.59 | Accuracy: 59.26%

MistralProvider('mistral-large-latest') Median Latency: 10.18 | Aggregated speed: 18.64 | Accuracy: 62.96%

AnthropicProvider('claude-3-5-sonnet-20240620') Median Latency: 3.61 | Aggregated speed: 59.70 | Accuracy: 62.96%

OpenAIProvider('gpt-4o') Median Latency: 3.25 | Aggregated speed: 53.75 | Accuracy: 74.07% |

[1] https://github.com/kagisearch/pyllms


The graphs seem to indicate their model trades blows with Llama 3.1 405B, which has more than 3x the number of tokens and (presumably) a much bigger compute budget. It's kind of baffling if this is confirmed.

Apparently Llama 3.1 relied on artificial data, would be very curious about the type of data that Mistral uses.


Anyone know what caused the very big performance jump from Large1 to Large2 in just a few months?

Besides, parameter redundancy seems evidenced. Front-tier models used to be 1.8T, then 405B, and now 123B. Would front-tier models in the future be <10B or even <1B, that would be a game changer.


Counter-intuitively, larger models are cheaper to train. However, smaller models are cheaper to serve. At first, everyone was focusing on training, so the models were much larger. Now, so many people are using AI everyday, so companies spend more on training smaller models to save on serving.


Lots and lots of synthetic data from the bigger models training the smaller ones would be my guess.


Nice, they finally got the memo that GPT4 exists and include it in their benchmarks.


> A significant effort was also devoted to enhancing the model’s reasoning capabilities. One of the key focus areas during training was to minimize the model’s tendency to “hallucinate” or generate plausible-sounding but factually incorrect or irrelevant information. This was achieved by fine-tuning the model to be more cautious and discerning in its responses, ensuring that it provides reliable and accurate outputs.

Is there a benchmark or something similar that compares this "quality" across different models?


Unfortunately not, as it captures such a wide spectrum of use cases and scenarios. There are some benchmarks to measure this quality in specific settings, e.g. summarization, but AFAIK nothing general.


Thanks, any ideas why it's not possible to build a generic eval for this? Since it's about asking a set of questions that's not public knowledge (or making stuff up) and check if the model says "I don't know"?


I'm really glad these guys exist


I kind of wonder why a lot of these places don't give "amateur" sized models anymore at around the 18B & 30B parameter sizes that you can run on a single 3090 or M2 Max at reasonable speeds and RAM requirements? It's all 7B, 70B, 400B sizing nowadays.


Just a few days ago, Mistral released a 12B model: https://mistral.ai/news/mistral-nemo/


Because you can just quantise the 70B model to 3-4 bits and it'll perform better than a 30B model but be a similar size.


A 70B 4bit model does not fit in a 24GB VRAM card, 30B models are the sweet spot for that size of card at 20GB, with 4GB left for the system to still function.


The question I (and I suspect most other HN readers) have is which model is best for coding? While I appreciate the advances in open weights models and all the competition from other companies, when it comes to my professional use I just want the best. Is that still GPT-4?


My personal experience says Claude 3.5 Sonnet.


The benchmarks agree as well.



Have you used the actual aider command line tool? That looks interesting, just wondering if it works well.


important to note that this time around weights are available https://huggingface.co/mistralai/Mistral-Large-Instruct-2407


A side note about the ever increasing costs to advance the models. I feel certain that some branch of what may be connected to the NSA is running and advancing models that probably exceed what the open market provides today.

Maybe they are running it on proprietary or semi proprietary hardware but if they dont, how much does the market no where various shipments of NVIDEA processors ends up?

I imagine most intelligence agencies are in need of vast quantities.

I presume is M$ announces new availability of AI compute it means they have received and put into production X Nvidiam, which might make it possible to guesstimate within some bounds how many.

Same with other open market compute facilities.

Is it likely that a significant share of NVIDEA processors are going to government / intelligent / fronts?


The name just makes me think of the screaming cowboy song. https://youtu.be/rvrZJ5C_Nwg?t=138


I like Claude 3.5 Sonnet, but despite paying for a plan, I run out of tokens after about 10 minutes. Text only, I'm typing everything in myself.

It's almost useless because I literally can't use it.

Update: https://support.anthropic.com/en/articles/8325612-does-claud...

45 messages per 5 hours is the limit for Pro users, less if Claude is wordy in its responses—which it always is. I hit that limit so fast when I'm investigating something. So annoying.

They used to let you select another, worse model but I don't see that option anymore. le sigh


Super looking forward to this.

I tried Codestral and nothing came close. Not even slightly. It was the only LLM that consistently put out code for me that was runnable and idiomatic.


the way these models are being pushed, it seems like more of one-upping each other through iterative improvements than actual breakthroughs.

these benchmarks are as good as random hardware ones apple or intel pushes to sell their stuff. in the real world, most people will end up with some modifications for their specific use case anyways. for those, i argue, we already have "capable enough" models for the job.



Just in case you haven't RTFA. Mistral 2 is 123b.


I tested it with my claude prompt history, the results are as good as Claude 3.5 Sonnet, but it's 2 or 3 times slower


What doe they mean by "single-node inference"?

Do they mean inference done on a single machine?


Yes, albeit a really expensive one. Large models like GPT-4 are rumored to run inference on multiple machines because they don't fit in VRAM for even the most expensive GPUs.

(I wouldn't be surprised if GPT-4o mini is small enough to fit on a single large instance though, would explain how they could drop the price so much.)


Yeah that’s how I read it. Probably means 8 x 80 GB GPUs.


can anyone explain why the % success rates are so different between these programming languages? is this a function of amount of training data available for each language or due to complexity of language or what?


Improves from 17.7 for Mistral Large to 20.0 on the NYT Connections benchmark.


How does their API pricing compare to 4o and 3.5 Sonnet?


3 USD per 1M input tokens, so the same as 3.5 Sonnet but worse quality


Does any one of the top models have access to the internet and googling things? I want an LLM to look things up and do casual research for me when I’m lazy.


I'd suggest using Perplexity.


is it possible to run Large 2 on ollama?


Imagine bragging about 74% accuracy in any other field of software. You'd be laughed out of the room. But somehow it's accepted in "AI".


Well, we had close to 0% a few years ago (for general purpose AI). I think it's not bad...


"It's not the size that matters, but how you use it."


There are now more AI models than javascript framework!


Every day a new model better than the previous one lol


Personally, language diversity should be the last thing on the list. If we had optimized every software from the get-go for a dozen languages our forward progress would have been dead in the water.


You'd think so, but 3.5-turbo was multilingual from the get go and benefitted massively from it. If you want to position yourself as a global leader, then excluding 95% of the world who aren't English native speakers seems like a bad idea.


Yeah clearly, OpenAI is rocketing forward and beyond.


Constant infighting and most of the competent people leaving will do that to a company.

I mean more on a model performance level though. It's been shown that something trained in one language trains the model to be able to output it in any other language it knows. There's quality human data being left on the table otherwise. Besides, translation is one of the few tasks that language models are by far the best at if trained properly, so why not do something you can sell as a main feature?


Language diversity means access to more training data, and you might also hope that by learning the same concept in multiple languages it does a better job of learning the underlying concept independent of the phrase structure...

At least from a distance it seems like training a multilingual state of the art model might well be easier than a monolingual one.


Multiple input and output processes in different languages has zero effect on associative learning and creative formulation in my estimations. We've already done studies that show there is no correlation between human intelligence and knowing multiple languages, after having to put up with decades of "Americans le dumb because..." and this is no different. The amount of discourse on a single topic has a limited degree of usability before redundancies appear. Such redundancies would necessarily increase the processing burden, which could actually limit the output potential for novel associations.


Google mentioned this in one of their papers, they found for large enough models including more languages did indeed lead to an overall increase in performance.


Considering Googles progress and censorship history, I'm inclined to take their assessments with a grain of salt.


Humans also don't learn by reading the entire internet... assuming human psych studies apply to LLMs at all is just wrong.


When I see this "© 2024 [Company Name], All rights reserved", it's a tell that the company does not understand how hopelessly behind they are about to be.


Could you elaborate on this? Would love to understand what leads you to this conclusion.


E = T/A! [0]

A faster evolving approach to AI is coming out this year that will smoke anyone who still uses the term "license" in regards to ideas [1].

[0] https://breckyunits.com/eta.html [1] https://breckyunits.com/freedom.html


Even if we accept the premise, and the conclusion that this equation 'explains the triumph of open source and public domain software', it explains very little about the relative success of open source/public domain work vs. commercial closed source work.

I think history indicates that some commercial works can also triumph.


> it explains very little about the relative success of open source/public domain work vs. commercial closed source work.

It explains everything.

Open source evolves faster, and always out competes closed source.

> I think history indicates that some commercial works can also triumph.

Nope. Not in the long run. I can't think of a single exception.


So it's made up?





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: