I still feel like the difference between Sonnet and Opus is a bit unclear. Somewhere on Anthropic's website it says that Opus is the most advanced, but on other parts it says Sonnet is the most advanced and also the fastest. The UI doesn't make the distinction clear either. Then on Perplexity, Perplexity says that Opus is the most advanced, compared to Sonnet.
And finally, in the table in the blogpost, Opus isn't even included? It seems to me like Opus is the best model they have, but they don't want people to default using it, maybe the ROI is lower on Opus or something?
When I manually tested it, I feel like Opus gives slightly better replies compared to Sonnet, but I'm not 100% it's just placebo.
Opus hasn't yet gotten an update from 3 to 3.5, and if you line up the benchmarks, the Sonnet "3.5 New" model seems to beat it everywhere.
I think they originally announced that Opus would get a 3.5 update, but with every product update they are doing I'm doubting it more and more. It seems like their strategy is to beat the competition on a smaller model that they can train/tune more nimbly and pair it with outside-the-model product features, and it honestly seems to be working.
> Opus hasn't yet gotten an update from 3 to 3.5, and if you line up the benchmarks, the Sonnet "3.5 New" model seems to beat it everywhere
Why isn't Anthropic clearer about Sonnet being better then? Why isn't it included in the benchmark if new Sonnet beats Opus? Why are they so ambiguous with their language?
- Claude 3 Opus - Powerful model for highly complex tasks
Does that mean Sonnet 3.5 is better than Opus for even highly complex tasks, since it's the "most intelligent model"? Or just for everything except "highly complex tasks"
I don't understand why this seems purposefully ambiguous?
> Why isn't Anthropic clearer about Sonnet being better then?
They are clear that both: Opus > Sonnet and 3.5 > 3.0. I don't think there is a clear universal better/worse relationship between Sonnet 3.5 and Opus 3.0; which is better is task dependent (though with Opus 3.0 being five times as expensive as Sonnet 3.5, I wouldn't be using Opus 3.0 unless Sonnet 3.5 proved clearly inadequate for a task.)
> I don't understand why this seems purposefully ambiguous?
I wouldn't attribute this to malice when it can also be explained by incompetence.
Sonnet 3.5 New > Opus 3 > Sonnet 3.5 is generally how they stack up against each other when looking at the total benchmarks.
"Sonnet 3.5 New" has just been announced, and they likely just haven't updated the marketing copy across the whole page yet, and maybe also haven't figured out how to graple with the fact that their new Sonnet model was ready faster than their next Opus model.
At the same time I think they want to keep their options open to either:
A) drop a Opus 3.5 soon that will bring the logic back in order again
B) potentially phase out Opus, and instead introduce new branding for what they called a "reasoning model" like OpenAI did with o1(-preview)
> I wouldn't attribute this to malice when it can also be explained by incompetence.
I don't think it's malice either, but if Opus costs more to them to run, and they've already set a price they cannot raise, it makes sense they want people to use models they have a higher net return on, that's just "business sense" and not really malice.
> and they likely just haven't updated the marketing copy across the whole page yet
The API docs have been updated though, which is the second page I linked. It mentions the new model by it's full name "claude-3-5-sonnet-20241022" so clearly they've gone through at least that page. Yet the wording remains ambiguous.
> Sonnet 3.5 New > Opus 3 > Sonnet 3.5 is generally how they stack up against each other when looking at the total benchmarks.
Which ones are you looking at? Since the benchmark comparison in the blogpost itself doesn't include Opus at all.
> B) potentially phase out Opus, and instead introduce new branding for what they called a "reasoning model" like OpenAI did with o1(-preview)
When should we be using the -o OpenAI models? I've not been keeping up and the official information now assumes far too much familiarity to be of much use.
I think it's first important to note that there is a huge difference between -o models (GPT 4o; GPT 4o mini) and the o1 models (o1-preview; o1-mini).
The -o models are "just" stronger versions of their non-suffixed predecessors. They are the latest (and maybe last?) version of models in the lineage of GPT models (roughly GPT-1 -> GPT-2 -> GPT-3 -> GPT-3.5 -> GPT-4 -> GPT-4o).
The o1 models (not sure what the naming structure for upcoming models will be) are a new family of models that try to excel at deep reasoning, by allowing the models to use an internal (opaque) chain-of-thought to produce better results at the expense of higher token usage (and thus cost) and longer latency.
Personally, I think the use cases that justify the current cost and slowness of o1 are incredibly narrow (e.g. offline analysis of financial documents or deep academic paper research). I think in most interactive use-cases I'd rather opt for GPT-4o or Sonnet 3.5 instead of o1-preview and have the faster response time and send a follow-up message. Similarly for non-interactive use-cases I'd try to add a layer of tool calling with those faster models than use o1-preview.
I think the o1-like models will only really take off, if the prices for it are coming down, and it is clearly demonstrated that more "thinking tokens" correlate to predictably better results, and results that can compete with highly tuned prompts/fine tuned models that or currently expensive to produce in terms of development time.
Agreed with all that, and also, when used via API the o1 models don't currently support system prompts, streaming, or function calling. That rules them out for all of the uses I have.
I think the practical economics of the LLM business are becoming clearer in recent times. Huge models are expensive to train and expensive to run. As long as it meets the average user's everyday needs, it's probably much more profitable to just continue with multimodal and fine-tuning development on smaller models.
I think the main reason is they tried training a heavy weight model that was supposed to be opus 3.5, but it didn't yield large enough improvements to 3.5 sonnet to justify them releasing it. (They had it on their page for a while that opus was coming soon, and now they've scrapped that.)
This theory is consistent with the other two top players, Open AI and Google, they both were expected to release a heavy model, but instead have just released multiple medium and small tier models. It's been so long since google released gemini ultimate 1.0 (the naming clearly implying that they were planning on upgrading it to 1.5 like they did with Pro)
Not seeing anyone release a heavyweight model, but at the same time releasing many small and medium sized models makes me think that improving models will be much more complicated than scaling it with more compute, and that there likely are diminishing returns with that regard.
Maybe - would make sense not to release their latest greatest (Opus 4.0) until competition forces them to, and Amodei has previously indicated that they would rather respond to match frontier SOTA than themselves accelerate the pace of advance by releasing first.
That begs the question: why am I still paying for access to Opus 3 ?
Honestly I don’t know. I’ve not been using Sonnet 3.5 up to now and I’m a fairly light user so I doubt I’ll run into the free tier limits. I’ll probably cancel my subscription until Opus 3.5 comes out (if it ever does).
Opus is a larger and more expensive model. Presumably 3.5 Opus will be the best but it hasn't been released. 3.5 Sonnet is better than 3.0 Opus kind of like how a newer i5 midrange processor is faster and cheaper than an old high-end i7.
Makes me wonder if perhaps they do have 3.5 Opus trained, but that they're not releasing it because 3.5 Sonnet is already enough to beat the competition, and some combination of "don't want to contribute to an arms race" and "it has some scary capabilities they weren't sure were ready to publish yet".
Anthropic use the names Haiku/Sonnet/Opus for the small/medium/large versions of each generation of their models, so within-generation that is also their performance (& cost) order. Evidentially Sonnet 3.5 outperforms Opus 3.0 on at least some tasks, but that is not a same-generation comparison.
I'm wondering at this point if they are going to release Opus 3.5 at all, or maybe skip it and go straight to 4.0. It's possible that Haiku 3.5 is a distillation of Opus 3.5.
By reputation -- I can't vouch for this personally, and I don't know if it'll still be true with this update -- Opus is still often better for things like creative writing and conversations about emotional or political topics.
Yes, you will find similar things at essentially all other model providers.
The older/bigger GPT4 runs at $30/$60 and peforms about on par with GPT4o-mini which costs only $0.15/$0.60.
If you are currently, or have been integrating AI models in the past ~2 years, you should definitely keep up with model capability/pricing development. If you are staying on old models you are certainly overpaying/leaving performance on the table. It's essentially a tax on agility.
Just switch out gpt-4o-mini for gpt-4o, the point stands. Across the board, these foundational model companies have comparable, if not more powerful, models that are cheaper than their older models.
OpenAI's own words: "GPT-4o is our most advanced multimodal model that’s faster and cheaper than GPT-4 Turbo with stronger vision capabilities."
I found that gpt-4-turbo beat gpt-4o pretty consistently for coding tasks, but claude-3.5-sonnet beat both of them, so it's what I have been using most of the time. gpt-4o-mini is adequate for summarizing text.
> Yeah, I mean that's why we're both here and why we're discussing this very topic, right? :D
That wasn't specifically directed at "you", but more as a plea to everyone reading that comment ;)
I looked at a few benchmarks, comparing the two, which like in the case of Opus 3 vs Sonnet 3.5 is hard, as the benchmarks the wider community is interested in shifts over time. I think this page[0] provides the best overview I can link to.
Yes, GPT4 is better in the MMLU benchmark, but in all other benchmarks and the LMSys Chatbot Arena scores[1], GPT4o-mini comes out ahead. Overall, the margin between is so thin that it falls under my definition of "on par". I think OpenAI is generally a bit more conservative with the messaging here (which is understandable), and they only advertise a model as "more capable", if one model beats the other one in every benchmark they track, which AFAIK is the case when it comes to 4o mini vs 3.5 Turbo.
I don't think that's quite it. They had it on their website before this, that opus 3.5 was coming soon, now they've removed that from the webpage.
Also, Gemini ultra 1.0, was released like 8 months ago, 1.5 pro released soon after, with this wording "The first Gemini 1.5 model we’re releasing for early testing is Gemini 1.5 Pro"
Still no ultra 1.5, despite many mid and small sized models being released in that time frame. This isn't just an issue of "the training time takes longer", or a "skew" to release dates. There's a better theory to explain why all SoTA LLM companies have not released a heavy model in many months.
The models "3.5 Sonnet" and "3 Opus" are in my experience nearly at the same level. Once in my last 250 prompts did I run into a problem that 3 Opus was able to solve, but 3.5 Sonnet could not. (I forget the details but it was a combination of logic and trivia knowledge. It is highly likely 3.5 Sonnet would have done a better job with better prompting and richer context, but this was a problem where also I lacked the context and understanding to prompt well.)
Given that 3.5 Sonnet is cheaper and faster than 3 Opus, I default to 3.5 Sonnet so I don't know what the number for the reverse is. How many problems do 3.5 Sonnet get which 3 Opus does not? ¯\_(ツ)_/¯
My best guess would be that it's something in the same kind of range.
And finally, in the table in the blogpost, Opus isn't even included? It seems to me like Opus is the best model they have, but they don't want people to default using it, maybe the ROI is lower on Opus or something?
When I manually tested it, I feel like Opus gives slightly better replies compared to Sonnet, but I'm not 100% it's just placebo.