Damn, that looks like a big jump.

deisteve · 2024-09-12T17:24:01 1726161841

so o1 seems like it has real measurable edge, crushing it in every single metric, i mean 1673 elo is insane, and 89th percentile is like a whole different league, and it looks like it's not just a one off either, it's consistently performing way better than gpt-4o across all the datasets, even in the ones where gpt-4o was already doing pretty well, like math and mmlu, o1 is just taking it to the next level, and the fact that it's not even showing up in some of the metrics, like mmmu and mathvista, just makes it look even more impressive, i mean what's going on with gpt-4o, is it just a total dud or what, and btw what's the deal with the preview model, is that like a beta version or something, and how does it compare to o1, is it like a stepping stone to o1 or something, and btw has anyone tried to dig into the actual performance of o1, like what's it doing differently, is it just a matter of more training data or is there something more going on, and btw what's the plan for o1, is it going to be released to the public or is it just going to be some internal tool or something

farresito · 2024-09-12T17:27:06 1726162026

> like what's it doing differently, is it just a matter of more training data or is there something more going on

Well, the model doesn't start with "GPT", so maybe they have come up with something better.

rvnx · 2024-09-12T17:58:25 1726163905

It sounds like GPT-4o with a long CoT prompt no ?

spaceman_2020 · 2024-09-12T19:50:48 1726170648

1673 ELO is wild

If its actually true in practice, I sincerely cannot imagine a scenario where it would be cheaper to hire actual junior or mid-tier developers (keyword: "developers", not architects or engineers).

1,673 ELO should be able to build very complex, scalable apps with some guidance

usaar333 · 2024-09-12T20:13:48 1726172028

I'm not sure how well codeforces percentiles correlate to software engineering ability. Looking at all the data, it still isn't. Key notes:

1. AlphaCode 2 was already at 1650 last year.

2. SWE-bench verified under an agent has jumped from 33.2% to 35.8% under this model (which doesn't really matter). The full model is at 41.4% which still isn't a game changer either.

3. It's not handling open ended questions much better than gpt-4o.

deisteve · 2024-09-12T20:51:05 1726174265

i think you are right now actually initially i got excited but now i think OpenAI pulled the hype card again to seem relevant as they struggle to be profitable

Claude on the other hand has been fantastic and seems to do similar reasoning behind the scenes with RL

usaar333 · 2024-09-12T21:22:41 1726176161

The model is really impressive to be fair. It's just how economically relevant it is.

deisteve · 2024-09-12T20:06:33 1726171593

currently my workflow is generate some code, run it, if it doesn't run i tell LLM what I expected, it will then produce code and I frequently tell it how to reason about the problem.

with O1 being in the 89th percentile would mean it should be able to think at junior to intermediate level with very strong consistency.

i dont think people in the comments realize the implication of this. previously LLMs were able to only "pattern match" but now its able to evaluate itself (with some guidance ofc) essentially, steering the software into depth of edge cases and reason about it in a way that feels natural to us.

currently I'm copying and pasting stuff and notifying LLM the results but once O1 is available its going to significantly lower that frequency.

For example, I expect it to self evaluate the code its generate and think at higher levels.

ex) oooh looks like this user shouldn't be able to escalate privileges in this case because it would lead to security issues or it could conflict with the code i generated 3 steps ago, i'll fix it myself.