I still feel like the difference between Sonnet and Opus is a bit unclear. Somew...

hobofan · 2024-10-22T15:32:22 1729611142

Opus hasn't yet gotten an update from 3 to 3.5, and if you line up the benchmarks, the Sonnet "3.5 New" model seems to beat it everywhere.

I think they originally announced that Opus would get a 3.5 update, but with every product update they are doing I'm doubting it more and more. It seems like their strategy is to beat the competition on a smaller model that they can train/tune more nimbly and pair it with outside-the-model product features, and it honestly seems to be working.

diggan · 2024-10-22T15:42:47 1729611767

> Opus hasn't yet gotten an update from 3 to 3.5, and if you line up the benchmarks, the Sonnet "3.5 New" model seems to beat it everywhere

Why isn't Anthropic clearer about Sonnet being better then? Why isn't it included in the benchmark if new Sonnet beats Opus? Why are they so ambiguous with their language?

For example, https://www.anthropic.com/api says:

> Sonnet - Our best combination of performance and speed for efficient, high-throughput tasks.

> Opus - Our highest-performing model, which can handle complex analysis, longer tasks with many steps, and higher-order math and coding tasks.

And Opus is above/after Sonnet. That to me implies that Opus is indeed better than Sonnet.

But then you go to https://docs.anthropic.com/en/docs/about-claude/models and it says:

> Claude 3.5 Sonnet - Most intelligent model

- Claude 3 Opus - Powerful model for highly complex tasks

Does that mean Sonnet 3.5 is better than Opus for even highly complex tasks, since it's the "most intelligent model"? Or just for everything except "highly complex tasks"

I don't understand why this seems purposefully ambiguous?

dragonwriter · 2024-10-22T16:49:35 1729615775

> Why isn't Anthropic clearer about Sonnet being better then?

They are clear that both: Opus > Sonnet and 3.5 > 3.0. I don't think there is a clear universal better/worse relationship between Sonnet 3.5 and Opus 3.0; which is better is task dependent (though with Opus 3.0 being five times as expensive as Sonnet 3.5, I wouldn't be using Opus 3.0 unless Sonnet 3.5 proved clearly inadequate for a task.)

hobofan · 2024-10-22T15:51:56 1729612316

> I don't understand why this seems purposefully ambiguous?

I wouldn't attribute this to malice when it can also be explained by incompetence.

Sonnet 3.5 New > Opus 3 > Sonnet 3.5 is generally how they stack up against each other when looking at the total benchmarks.

"Sonnet 3.5 New" has just been announced, and they likely just haven't updated the marketing copy across the whole page yet, and maybe also haven't figured out how to graple with the fact that their new Sonnet model was ready faster than their next Opus model.

At the same time I think they want to keep their options open to either:

A) drop a Opus 3.5 soon that will bring the logic back in order again

B) potentially phase out Opus, and instead introduce new branding for what they called a "reasoning model" like OpenAI did with o1(-preview)

diggan · 2024-10-22T15:59:30 1729612770

> I wouldn't attribute this to malice when it can also be explained by incompetence.

I don't think it's malice either, but if Opus costs more to them to run, and they've already set a price they cannot raise, it makes sense they want people to use models they have a higher net return on, that's just "business sense" and not really malice.

> and they likely just haven't updated the marketing copy across the whole page yet

The API docs have been updated though, which is the second page I linked. It mentions the new model by it's full name "claude-3-5-sonnet-20241022" so clearly they've gone through at least that page. Yet the wording remains ambiguous.

> Sonnet 3.5 New > Opus 3 > Sonnet 3.5 is generally how they stack up against each other when looking at the total benchmarks.

Which ones are you looking at? Since the benchmark comparison in the blogpost itself doesn't include Opus at all.

hobofan · 2024-10-22T16:07:37 1729613257

> Which ones are you looking at? Since the benchmark comparison in the blogpost itself doesn't include Opus at all.

I manually compared it with the values from the benchmarks they published when they originally announced the Claude 3 model family[0].

Not all rows have a 1:1 row in the current benchmarks, but I think it paints a good enough picture.

[0]: https://www.anthropic.com/news/claude-3-family

dotancohen · 2024-10-22T16:17:03 1729613823

> B) potentially phase out Opus, and instead introduce new branding for what they called a "reasoning model" like OpenAI did with o1(-preview)

When should we be using the -o OpenAI models? I've not been keeping up and the official information now assumes far too much familiarity to be of much use.

hobofan · 2024-10-22T16:50:24 1729615824

I think it's first important to note that there is a huge difference between -o models (GPT 4o; GPT 4o mini) and the o1 models (o1-preview; o1-mini).

The -o models are "just" stronger versions of their non-suffixed predecessors. They are the latest (and maybe last?) version of models in the lineage of GPT models (roughly GPT-1 -> GPT-2 -> GPT-3 -> GPT-3.5 -> GPT-4 -> GPT-4o).

The o1 models (not sure what the naming structure for upcoming models will be) are a new family of models that try to excel at deep reasoning, by allowing the models to use an internal (opaque) chain-of-thought to produce better results at the expense of higher token usage (and thus cost) and longer latency.

Personally, I think the use cases that justify the current cost and slowness of o1 are incredibly narrow (e.g. offline analysis of financial documents or deep academic paper research). I think in most interactive use-cases I'd rather opt for GPT-4o or Sonnet 3.5 instead of o1-preview and have the faster response time and send a follow-up message. Similarly for non-interactive use-cases I'd try to add a layer of tool calling with those faster models than use o1-preview.

I think the o1-like models will only really take off, if the prices for it are coming down, and it is clearly demonstrated that more "thinking tokens" correlate to predictably better results, and results that can compete with highly tuned prompts/fine tuned models that or currently expensive to produce in terms of development time.

jcheng · 2024-10-22T19:02:45 1729623765

Agreed with all that, and also, when used via API the o1 models don't currently support system prompts, streaming, or function calling. That rules them out for all of the uses I have.

maeil · 2024-10-23T06:37:33 1729665453

> The -o models are "just" stronger versions of their non-suffixed predecessors.

Cheaper and faster, but not notably "stronger" at real-world use.

dotancohen · 2024-10-23T07:36:20 1729668980

Thank you.

ryandvm · 2024-10-23T14:04:30 1729692270

Jesus, maybe they should let the AIs run the product naming.

wavemode · 2024-10-22T18:38:47 1729622327

I think the practical economics of the LLM business are becoming clearer in recent times. Huge models are expensive to train and expensive to run. As long as it meets the average user's everyday needs, it's probably much more profitable to just continue with multimodal and fine-tuning development on smaller models.

a9dhalaan · 2024-10-23T15:39:23 1729697963

I think the main reason is they tried training a heavy weight model that was supposed to be opus 3.5, but it didn't yield large enough improvements to 3.5 sonnet to justify them releasing it. (They had it on their page for a while that opus was coming soon, and now they've scrapped that.)

This theory is consistent with the other two top players, Open AI and Google, they both were expected to release a heavy model, but instead have just released multiple medium and small tier models. It's been so long since google released gemini ultimate 1.0 (the naming clearly implying that they were planning on upgrading it to 1.5 like they did with Pro)

Not seeing anyone release a heavyweight model, but at the same time releasing many small and medium sized models makes me think that improving models will be much more complicated than scaling it with more compute, and that there likely are diminishing returns with that regard.

Workaccount2 · 2024-10-22T15:42:57 1729611777

Opus 3.5 will likely be the answer to GPT-5. Same with Gemini 1.5 Ultra.

HarHarVeryFunny · 2024-10-22T16:49:13 1729615753

Maybe - would make sense not to release their latest greatest (Opus 4.0) until competition forces them to, and Amodei has previously indicated that they would rather respond to match frontier SOTA than themselves accelerate the pace of advance by releasing first.

danw1979 · 2024-10-23T09:33:46 1729676026

That begs the question: why am I still paying for access to Opus 3 ?

Honestly I don’t know. I’ve not been using Sonnet 3.5 up to now and I’m a fairly light user so I doubt I’ll run into the free tier limits. I’ll probably cancel my subscription until Opus 3.5 comes out (if it ever does).

wmf · 2024-10-22T15:28:23 1729610903

Opus is a larger and more expensive model. Presumably 3.5 Opus will be the best but it hasn't been released. 3.5 Sonnet is better than 3.0 Opus kind of like how a newer i5 midrange processor is faster and cheaper than an old high-end i7.

gwd · 2024-10-23T12:16:19 1729685779

Makes me wonder if perhaps they do have 3.5 Opus trained, but that they're not releasing it because 3.5 Sonnet is already enough to beat the competition, and some combination of "don't want to contribute to an arms race" and "it has some scary capabilities they weren't sure were ready to publish yet".

HarHarVeryFunny · 2024-10-22T16:41:03 1729615263

Anthropic use the names Haiku/Sonnet/Opus for the small/medium/large versions of each generation of their models, so within-generation that is also their performance (& cost) order. Evidentially Sonnet 3.5 outperforms Opus 3.0 on at least some tasks, but that is not a same-generation comparison.

I'm wondering at this point if they are going to release Opus 3.5 at all, or maybe skip it and go straight to 4.0. It's possible that Haiku 3.5 is a distillation of Opus 3.5.

kalkin · 2024-10-22T16:29:51 1729614591

By reputation -- I can't vouch for this personally, and I don't know if it'll still be true with this update -- Opus is still often better for things like creative writing and conversations about emotional or political topics.

aoeusnth1 · 2024-10-22T19:25:59 1729625159

Yes, (old) 3.5 Sonnet is distinctly worse at emotional intelligence, flexibility, expressiveness and poetry.

a9dhalaan · 2024-10-23T15:42:16 1729698136

Are you also implying that new 3.5 sonnet is better at those things?

aoeusnth1 · 2024-10-25T02:43:22 1729824202

No, Opus is better. I have no experience with 3.5.new.

smallerize · 2024-10-22T15:26:37 1729610797

Opus has been stuck on 3.0, so Sonnet 3.5 is better for most things as well as cheaper.

diggan · 2024-10-22T15:29:53 1729610993

> Opus has been stuck on 3.0, so Sonnet 3.5 is better

So for example, Perplexity is wrong here implying that Opus is better than Sonnet?

https://i.imgur.com/N58I4PC.png

hobofan · 2024-10-22T15:33:36 1729611216

I think as of this announcement that is indeed outdated information.

diggan · 2024-10-22T15:44:19 1729611859

So Opus that costs $15.00/$75.00 for 1mil tokens (input/output) is now worse than the model that costs $3.00/$15.00?

That's according to https://docs.anthropic.com/en/docs/about-claude/models which has "claude-3-5-sonnet-20241022" as the latest model (today's date)

hobofan · 2024-10-22T16:01:58 1729612918

Yes, you will find similar things at essentially all other model providers.

The older/bigger GPT4 runs at $30/$60 and peforms about on par with GPT4o-mini which costs only $0.15/$0.60.

If you are currently, or have been integrating AI models in the past ~2 years, you should definitely keep up with model capability/pricing development. If you are staying on old models you are certainly overpaying/leaving performance on the table. It's essentially a tax on agility.

diggan · 2024-10-22T16:11:03 1729613463

> The older/bigger GPT4 runs at $30/$60 and peforms about on par with GPT4o-mini which costs only $0.15/$0.60.

I don't think GPT-4o Mini has comparable performance to GPT-4 at all, where are you finding the benchmarks claiming this?

Everywhere I look says GPT-4 is more powerful, but GPT-4o Mini is most cost-effective, if you're OK with worse performance.

Even OpenAI themselves about GPT-4o Mini:

> Our affordable and intelligent small model for fast, lightweight tasks. GPT-4o mini is cheaper and more capable than GPT-3.5 Turbo.

If it was "on par" with GPT-4 they would surely say this.

> should definitely keep up with model capability/pricing development

Yeah, I mean that's why we're both here and why we're discussing this very topic, right? :D

cootsnuck · 2024-10-22T19:00:24 1729623624

Just switch out gpt-4o-mini for gpt-4o, the point stands. Across the board, these foundational model companies have comparable, if not more powerful, models that are cheaper than their older models.

OpenAI's own words: "GPT-4o is our most advanced multimodal model that’s faster and cheaper than GPT-4 Turbo with stronger vision capabilities."

gpt-4o:

$2.50 / 1M input tokens $10.00 / 1M output tokens

gpt-4-turbo:

$10.00 / 1M input tokens $30.00 / 1M output tokens

gpt-4:

$30.00 / 1M input tokens $60.00 / 1M ouput tokens

https://openai.com/api/pricing/

chillfox · 2024-10-23T01:54:23 1729648463

I found that gpt-4-turbo beat gpt-4o pretty consistently for coding tasks, but claude-3.5-sonnet beat both of them, so it's what I have been using most of the time. gpt-4o-mini is adequate for summarizing text.

hobofan · 2024-10-22T16:25:26 1729614326

> Yeah, I mean that's why we're both here and why we're discussing this very topic, right? :D

That wasn't specifically directed at "you", but more as a plea to everyone reading that comment ;)

I looked at a few benchmarks, comparing the two, which like in the case of Opus 3 vs Sonnet 3.5 is hard, as the benchmarks the wider community is interested in shifts over time. I think this page[0] provides the best overview I can link to.

Yes, GPT4 is better in the MMLU benchmark, but in all other benchmarks and the LMSys Chatbot Arena scores[1], GPT4o-mini comes out ahead. Overall, the margin between is so thin that it falls under my definition of "on par". I think OpenAI is generally a bit more conservative with the messaging here (which is understandable), and they only advertise a model as "more capable", if one model beats the other one in every benchmark they track, which AFAIK is the case when it comes to 4o mini vs 3.5 Turbo.

[0]: https://context.ai/compare/gpt-4o-mini/gpt-4

[1]: https://artificialanalysis.ai/models?models_selected=gpt-4o-...

apsec112 · 2024-10-22T15:34:47 1729611287

Basically yeah

bloedsinnig · 2024-10-22T15:35:30 1729611330

Big models / huge models take weeks / month longer than the smaller ones.

Thats why they release them with that skew

a9dhalaan · 2024-10-23T15:50:58 1729698658

I don't think that's quite it. They had it on their website before this, that opus 3.5 was coming soon, now they've removed that from the webpage.

Also, Gemini ultra 1.0, was released like 8 months ago, 1.5 pro released soon after, with this wording "The first Gemini 1.5 model we’re releasing for early testing is Gemini 1.5 Pro"

Still no ultra 1.5, despite many mid and small sized models being released in that time frame. This isn't just an issue of "the training time takes longer", or a "skew" to release dates. There's a better theory to explain why all SoTA LLM companies have not released a heavy model in many months.

JamesBarney · 2024-10-22T16:29:24 1729614564

Sonnet is better for most things. But I do prefer Opus's writing style to Sonnet.

karmasimida · 2024-10-22T20:50:22 1729630222

Opus the biggest and slowest and most expensive one

Not most advanced

kqr · 2024-10-23T04:56:20 1729659380

The models "3.5 Sonnet" and "3 Opus" are in my experience nearly at the same level. Once in my last 250 prompts did I run into a problem that 3 Opus was able to solve, but 3.5 Sonnet could not. (I forget the details but it was a combination of logic and trivia knowledge. It is highly likely 3.5 Sonnet would have done a better job with better prompting and richer context, but this was a problem where also I lacked the context and understanding to prompt well.)

Given that 3.5 Sonnet is cheaper and faster than 3 Opus, I default to 3.5 Sonnet so I don't know what the number for the reverse is. How many problems do 3.5 Sonnet get which 3 Opus does not? ¯\_(ツ)_/¯

My best guess would be that it's something in the same kind of range.

inquisitor27552 · 2024-10-23T14:05:27 1729692327

yes it baffles they cant semver the shit out of them properly (anthtopic, meta, openai, lol)