Hacker News new | past | comments | ask | show | jobs | submit login
Learning to Reason with LLMs (openai.com)
1654 points by fofoz 3 months ago | hide | past | favorite | 1261 comments



Some practical notes from digging around in their documentation: In order to get access to this, you need to be on their tier 5 level, which requires $1,000 total paid and 30+ days since first successful payment.

Pricing is $15.00 / 1M input tokens and $60.00 / 1M output tokens. Context window is 128k token, max output is 32,768 tokens.

There is also a mini version with double the maximum output tokens (65,536 tokens), priced at $3.00 / 1M input tokens and $12.00 / 1M output tokens.

The specialized coding version they mentioned in the blog post does not appear to be available for use.

It’s not clear if the hidden chain of thought reasoning is billed as paid output tokens. Has anyone seen any clarification about that? If you are paying for all of those tokens it could add up quickly. If you expand the chain of thought examples on the blog post they are extremely verbose.

https://platform.openai.com/docs/models/o1 https://openai.com/api/pricing/ https://platform.openai.com/docs/guides/rate-limits/usage-ti...


> Some practical notes from digging around in their documentation: In order to get access to this, you need to be on their tier 5 level, which requires $1,000 total paid and 30+ days since first successful payment.

Tier 5 level required for _API access_. ChatGPT Plus users, for example, also have access to the o1 models.


We just receivied this email:

Hi there,

I’m x, PM for the OpenAI API. I’m pleased to share with you our new series of models, OpenAI o1. We’ve developed these models to spend more time thinking before they respond. They can reason through complex tasks and solve harder problems than previous models in science, coding, and math.

As a trusted developer on usage tier 5, you’re invited to get started with the o1 beta today. Read the docs You have access to two models:

    Our larger model, o1-preview, which has strong reasoning capabilities and broad world knowledge. 
    Our smaller model, o1-mini, which is 80% cheaper than o1-preview.

Try both models! You may find one better than the other for your specific use case. Both currently have a rate limit of 20 RPM during the beta. But keep in mind o1-mini is faster, cheaper, and competitive with o1-preview at coding tasks (you can see how it performs here). We’ve also written up more about these models in our blog post.

I’m curious to hear what you think. If you’re on X, I’d love to see what you build—just reply to our post.

Best, OpenAI API


Reasoning tokens are indeed billed as output tokens.

> While reasoning tokens are not visible via the API, they still occupy space in the model's context window and are billed as output tokens.

From here: https://platform.openai.com/docs/guides/reasoning


This is concerning - how do you know you aren’t being fleeced out of your money here…? You’ll get your results, but did you really use that much?


I think it's fantastic that now, for very little money, everyone gets to share a narrow but stressful subset of what it feels like to employ other people.

Really, I recommend reading this part of the thread while thinking about the analogy. It's great.


It’s nice on the outside, but employees are actually all different people and this here is one company’s blob of numbers with not much incentive to optimize your cost.

Competition fixes some of this, I hope Anthropic and Mistral are not far behind.


> […] with not much incentive to optimize your cost. Competition fixes some of this […]

Just like employing other people!


On the contrary. It will be the world's most scrutinized employee. Thousands of people, amongst them important people with big levers, will be screaming in their ear on my behalf constantly, and my — our collective — employee gets better without me having to do anything. It's fantastic!


Your idea is really a brilliant insight. Revealing.


I love this so much haha.

"I can only ask my employee 20 smart things this week for $20?! And they get dumber (gpt-4o) after that? Not worth it!"


Any respectable employer/employee relationship transacts on results rather than time anyway. Not sure the analogy is very applicable in that light.


> Any respectable employer/employee relationship transacts on results rather than time anyway.

No. This may be common in freelance contracts, but is almost never the case in employment contracts, which specify a time-based compensation (usually either per hour or per month).


I believe parent's point was that if ones management is clueless as to how to measure output and compensation/continued employment is unlinked from same... one is probably working for a bad company.


Yea, I said ‘respectable’.


That's just not how employment laws are written.


Employment law actually permits per-piece payments too, albeit that type of pay scale is rare.


It is!


obfuscated billing has long been a staple of all great cloud products. AWS innovated in the space and now many have followed in their footsteps


Also, now we're paying for output tokens that aren't even output, with no good explanation for why these tokens should be hidden from the person who paid for them.


If you read the link they have a section specifically explaining why it is hidden.


I read it. It's a bad explanation.

The only bit about it that feels at all truthful is this bit, which is glossed over but likely the only real factor in the decision:

> after weighing multiple factors including ... competitive advantage ... we have decided not to show the raw chains of thought to users.


Good catch. That indicates that chains of thought are a straightforward approach to make LLMs better at reasoning if you could copy it just by seeing the steps.


Bad, in your opinion.


Also seems very impractical to embed this into a deployed product. How can you possibly hope to control and estimate costs? I guess this is strictly meant for R&D purposes.


You can specify the max length of the response, which presumably includes the hidden tokens.

I don't see why this is qualitatively different from a cost perspective than using CoT prompting on existing models.


For one, you don't get to see any output at all if you run out of tokens during thinking.

If you set a limit, once it's hit you just get a failed request with no introspection on where and why CoT went off the rails


Why would I pay for zero output? That’s essentially throwing money down the drain.


You can’t verify that you’re paying what you should be if you can’t see the hidden tokens.


With the conventional models you don't get the activations or the logits even though those would be useful.

Ultimately if the output of the model is not worth what you end up paying for it then great, I don't see why it really matters to you whether OpenAI is lying about token counts or not.


As a single user, it doesn’t really, but as a SaaS operator I want tractable, hopefully predictable pricing.

I wouldn’t just implicitly trust a vendor when they say “yeah we’re just going to charge you for what we feel like when we feel like. You can trust us.”


They are currently trying to raise money (talk of new $150B valuation), so that may have something to do with it


In the UI the reasoning is visible. The API can probably return it too, just check the code


OAI doesn't show the actual COT, on the grounds that it's potentially unsafe output and also to prevent competitors training on it. You only see a sanitized summary.


What's shown in the UI is a summary of the reasoning


No access to reasoning output seems totally bonkers. All of the real cost is in inference, assembling an HTTP request to deliver that result seems trivial?


Some of the queries run for multiple minutes. 40 tokens/sec is too slow for CoT.

I hope OpenAI is investing in low-latency like Groq's tech that can reach 1k tokens/sec.


It's slow and expensive if you compare it with other LLMs.

It's lightning fast and dirt cheap if you compare it to consulting with a human expert, which it appears to be competitive with.


I would say consulting with a human. Any expert who has a conversation with chatGPT about their field will verify that it is very far from expert


According to the data provided by OpenAI, that isn't true anymore. And I trust data more than anecdotal claims made by people whose job is being threatened by systems like these.


>According to the data provided by OpenAI, that isn't true anymore

OpenAI main job is to sell that their models are better than human. I still remember when they're marketing their gpt-2 weights as too dangerous to release.


I remember that too, it's when I started following the space (shout out computerphile/robert miles) and iirc the reason they gave was not "it's too dangerous cause it's so badass" they basically were correct in that it can produce sufficiently "human" output as to break typical bot detectors on social media which is a legitimate problem - whether the repercussions of that failure to detect botting is meaningful enough to be considered "dangerous" is up to the reader to decide

also worth noting I don't agree with the comment you're replying to - but did want to add context to the situation of gpt-2


What? Surely you have some area of your life you are above-average knowledgable about. Have a conversation with chatGPT about it, with whatever model, and you can see for yourself it is far from expert level.

You are not "trusting data more than anecdotal claims", you are trusting marketing over reality.

Benchmarks can be gamed. Statistics can be manipulated. Demonstrations can be cherry picked.

PS: I stand to gain heavily if AI systems could perform at an expert level, this is not a claim from someone 'whose job is being threatened'.


> For each problem, our system sampled many candidate submissions and submitted 50 of them based on a test-time selection strategy. Submissions were selected based on performance on the IOI public test cases, model-generated test cases, and a learned scoring function. If we had instead submitted at random, we would have only scored 156 points on average, suggesting that this strategy was worth nearly 60 points under competition constraints.

Did you read the post? OpenAI clearly states that the results are cherry-picked. Just a random query will have far worse results. To get equal results you need to ask the same query dozens of time and then have enough expertise to pick the best one, which might be quite hard for a problem that you have little idea about.

Combine this with the fact that this blog post is a sales pitch with the very best test results out of probably many more benchmarks we will never see and it seems obvious that human experts are still several order of magnitudes ahead.


When I read that line too I was very confused lol. I interpreted it as them saying they basically took other contestant submissions and allowing the model to see these "solutions" as part of context? and then having the model generate its own "solution" to be used for the benchmark. I fail to see how this is "solving" a ioi level question.

What is interesting is the following paragraph in the post " With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy. " So they didn't allow sampling from other contest solutions here? If that is the case quite interesting, since the model is effectively imo able to brute force questions. Provided you have some form of a validator able to tell it to halt.

I came across one of the ioi questions this year that I had trouble solving (I am pretty noob tho) which made me curious about how these reported results were reflected. The question at hand being https://github.com/ioi-2024/tasks/blob/main/day2/hieroglyphs... Apparently, the model was able to get it partially correct. https://x.com/markchen90/status/1834358725676572777


So, basically, it's chain of thought as a service?

Not a model, per se, but a service that chains multiple model requests behind the scene?


Who knows? Certainly not the public.

It might be a finetuned model that works better in such a setting.


The linked blog posts explains that it is fine-tuned on some reinforcement learning process. It doesn’t go into details but they do claim it’s not just the base model with chain of thought, there’s some fine-tuning going on.


Unless this is specifically relating to API access, I don’t think it’s correct. I’ve been paying for ChatGPT via the App Store IAP for around a year or less, and I’ve already got both o1-preview and o1-mini available in-app.


Yes, I was referring to API access specifically. Nothing in the blog post or the documentation mentions access to these new models on ChatGPT, and even as a paid user I’m not seeing them on there (Edit: I am seeing it now in the app). But looks like a bunch of other people in this discussion do have it on ChatGPT, so that’s exciting to hear.


I'm a bit late to the show, but it would seem the API calls for these new models don't support system messages (where role is system) or the tool list for function calls.


I have access to this and there is no way I spend more than 50$ on OpenAI api. I have ChatGPT + since day q though (240$ probably in total)


You missed your raise key on "day q"


Raise it up just one


The CoT is billed as output tokens. Mentioned in the docs where it talks about reasoning


I am an ordinary plus user (since it was released more or less) and have access.


I am a Plus user and pay $20 per month. I have access to the o1 models.


A bit out of context.

Am curious if at some point length of context window stops playing any material difference in the output and it just stops making any economical sense as law of marginal diminishing utility kicks in.


you need to be on their tier 5 level, which requires $1,000 total paid and [...]

Good opening for OpenAI's competitors to run a 'we're not snobs' promotion.


How so? I think most of the competition does this. Early partners/heavy users get access first which 1) hopefully provides feedback on the product and 2) provides a mechanism to stagger the release.


Marketing is about feelings, not facts.


We could tell it impacted your feelings but most business don’t run off of feelings. There is sometimes alignment on morals/being a good business partner but before that it’s the quality of the product and the cost.


One thing that makes me skeptical is the lack of specific labels on the first two accuracy graphs. They just say it's a "log scale", without giving even a ballpark on the amount of time it took.

Did the 80% accuracy test results take 10 seconds of compute? 10 minutes? 10 hours? 10 days? It's impossible to say with the data they've given us.

The coding section indicates "ten hours to solve six challenging algorithmic problems", but it's not clear to me if that's tied to the graphs at the beginning of the article.

The article contains a lot of facts and figures, which is good! But it doesn't inspire confidence that the authors chose to obfuscate the data in the first two graphs in the article. Maybe I'm wrong, but this reads a lot like they're cherry picking the data that makes them look good, while hiding the data that doesn't look very good.


> Did the 80% accuracy test results take 10 seconds of compute? 10 minutes? 10 hours? 10 days? It's impossible to say with the data they've given us.

The gist of the answer is hiding in plain sight: it took so long, on an exponential cost function, that they couldn't afford to explore any further.

The better their max demonstrated accuracy, the more impressive this report is. So why stop where they did? Why omit actual clock times or some cost proxy for it from the report? Obviously, it's because continuing was impractical and because those times/costs were already so large that they'd unfavorably affect how people respond to this report


See also: them still sitting on Sora seven months after announcing it. They've never given any indication whatsoever of how much compute it uses, so it may be impossible to release in its current state without charging an exorbitant amount of money per generation. We do know from people who have used it that it takes between 10 and 20 minutes to render a shot, but how much hardware is being tied up during that time is a mystery.


Could well be.

It's also entirely possible they are simply sincere about their fear it may be used to influence the upcoming US election.

Plenty of people (me included) are sincerely concerned about the way even mere still image generators can drown out the truth with a flood of good-enough-at-first-glance fiction.


If they were sincere about that concern then they wouldn't build it at all, if it's ever made available to the public then it will eventually be available during an election. It's not like the 2024 US presidential election is the end of history.


The risk is not “interfering with the US elections”, but “being on the front page of everything as the only AI company interfering with US elections”. This would destroy their peacocking around AGI/alignment while raising billions from pension funds.

OpenAI is in a very precarious position. Maybe they could survive that hit in four years, but it would be fatal today. No unforced errors.


i think the hope is by the next presidential election no one will trust video anymore anyway so the new normal wont be as chaotic as if the dropped in the middle of an already contentious election.

as for not building it at all its a obvious next step in generative ai models that if they don't make it someone else will anyway.


Wouldn’t it be nice if we came full circle and went to listen to our politicians live because anything else would be pointless.


I'd give it about 20 years before humanoid robots can be indistinguishable from originals without an x-ray or similar — covering them in vat-grown cultures of real human skin etc. is already possible but the robots themselves aren't good enough to fool anyone.


unfortunately that would mean two firstly things only swing states would get to hear what politicians are actually saying and secondly to reach everyone the primary process would have to start even earlier so the candidates would have a chance to give enough speeches before early voting


Even if Kamala wins (praise be to god that she does), those people aren't just going to go away until social media does. Social media is the cause of a lot of the conspiracy theory mania.

So yeah, better to never release the model...even though Elon would in a second if he had it.


Doesn't strike me as the kind of principle OpenAI is willing to slow themselves down for, to be honest.


But this cat run out of the bag years ago, didn't it? Trump himself is using AI-generated images in his campaign. I'd go even further: the more fake images appear, the faster the society as a whole will learn to distrust anything by default.


Personally I'm not a fan of accelerationism


Nothing works without trust, none of us is an island.

Everyone has a different opinion on what threshold of capability is important, and what to do about it.


Why did they release this model then?


Their public statements that the only way to safely learn how to deal with the things AI can do, is to show what it can do and get feedback from society:

"""We want to successfully navigate massive risks. In confronting these risks, we acknowledge that what seems right in theory often plays out more strangely than expected in practice. We believe we have to continuously learn and adapt by deploying less powerful versions of the technology in order to minimize “one shot to get it right” scenarios.""" - https://openai.com/index/planning-for-agi-and-beyond/

I don't know if they're actually correct, but it at least passes the sniff test for plausibility.


Also the the sora videos are proven to be modified ads. We still need to see how it perform first


> Also the the sora videos are proven to be modified ads

Can't find anything about that, you got a link?



Oh, so not the actual demo videos OpenAI shared on their website and twitter.


We still need to see those demos in action though. That's the big IF where everyone is thinking about


Sure but "Also the the sora videos are proven to be modified ads" is demonstrably false, for the demos OpenAI shared and the artist made ones.


https://www.youtube.com/watch?v=9oryIMNVtto

Isn't this balloon video shared by openai? How is this not counted? For others I don't have evidences. But this balloon video case is enough to cast the doubts.


But there are lots of models available now that render much faster which are better quality than sora


People have been celebrating the fact that tokens got 100x cheaper and now here's a new system that will use 100x more tokens.


Also you now have to pay for tokens you can't see, and just have to trust that OpenAI is using them economically.


Token count was always an approximation of value. This may help break that silly idea.


I don't think it's much good as an approximation of value, but it seems ok as an approximation of cost.


Fair, cost and value are only loosely related. Trying to price based on cost always turns into a mess.


Its what you do when you're a commodity.


If it 's reasoning correctly, it shouldnt need a lot of tokens because you don't need to correct it.

You only need to ask it to solve nuclear fusion once.


As someone experienced with operations / technical debt / weird company specific nonsense (Platform Engineer). No, you have to solve nuclear fusion at <insert-my-company>. You gotta do it over and over again. If it were that simple we wouldn't have even needed AI we would have hand written a few things, and then everything would have been legos, and legos of legos, but it takes a LONG time to find new true legos.


I'm pretty sure everything is Lego and Legos of Legos.

You show me something new and I say look down at who's shoulders we're standing on, what libraries we've build with.


Yeah you’re right, all businesses are made of identical, interchangeable parts that we can swap out at our leisure.

This is why enterprises change ERP systems frictionlessly, and why the field of software engineering is no longer required. In fact, given that apparently, all business is solved, we can probably just template them all out, call it a day and all go home.


Yeah but thats not a Lego. A Lego is something that fits everwhere else. Not just previous work. There's a lot of previous work. There are very few true Legos.


AlphaFold simulated the structure of over 200 million proteins. Among those, there could be revolutionary ones that could change the medical scientific field forever, or they could all be useless. The reasoning is sound, but that's as far as any such tool can get, and you won't know it until you attempt to implement it in real life. As long as those models are unable to perfectly recreate the laws of the universe to the maximum resolution imaginable and follow them, you won't see an AI model, let alone a LLM, provide anything of the sort.


Perhaps GenAI may point out a blind spot, just as a kid may see something the adults do not due to stale heuristics


with these methods the issue is the log scale of compute. Let's say you ask it to solve fusion. It may be able to solve it but the issue is it's unverifiable WHICH was correct.

So it may generate 10 Billion answers to fusion and only 1-10 are correct.

There would be no way to know which one is correct without first knowing the answer to the question.

This is my main issue with these methods. They assume the future via RL then when it gets it right they mark that.

We should really be looking at methods of percentage it was wrong rather then it was right a single time.


This sounds suspiciously like the reason that quantum compute is not ready for prime-time yet.


Have you seen how long the CoT was for the example. It's incredibly verbose.


I find there is an educational benefit in verbosity, it helps to teach user to think like a machine


Which is why it is incredibly depressing that OpenAI will not publish the raw chain of thought.

“Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.”


maybe they will enable to show CoT for a limited uses, like 5 prompts a day for Premium users, or for Enterprise users with agreement not to steal CoT or something like that.

if OpenAI sees this - please allow users to see CoT for a few prompts per day, or add it to Azure OpenAI for Enterprise customers with legal clauses not to steal CoT


Imagine if this tech was available in the middle ages and it was asked to 'solve' alchemy or perpetual motion, and responded that it was an impossible problem... people would (irrationally from our perspective) go Luddite on it I suspect. Now apply to the 'fusion power' problem.


The new thing that can do more at the "ceiling" price doesn't remove your ability to still use the 100x cheaper tokens for the things that were doable on that version.


Isn't that part of developing a new tech?


That exact pattern is always true of technological advance. Even for a pretty broad definition of technology. I'm not sure if it's perfectly described by the name "induced demand" but it's basically the same thing.


It does dispel this idea that we are going to be flooded with too many GPUs.


"People have been celebrating the fact that RAM got 100x cheaper and now here's a new system that will use 100x more RAM."


Known as Wirth's law.


...while providing a significant advance. That's a good problem.


Isn't that part of the point?


I don't think it's hard to compute the following:

- At the high end, there is a likely nonlinear relationship between answer quality and compute.

- We've gotten used to a flat-price model. With AGI-level models, we might have to pay more for more difficult and more important queries. Such is the inherent complexity involved.

- All this stuff will get better and cheaper over time, within reason.

I'd say let's start by celebrating that machine thinking of this quality is possible at all.


I don't think it's worth any debate. You can simply find out how it does for you, now(-ish, rolling out).

In contrast: Gemini Ultra, the best, non-existent Google Model for the past few month now, that people nonetheless are happy to extrapolate excitement over.


Bold of you to expect transparency and clarity from a company like OpenAI.

You wanted reliable readable graphs? Ppphhh, get out of here, but pay of for the CoT tokens you’ll never see on your way out though.


When one axis is on log scale and the other is linear with the plot points appearing linear-ish, doesn't it mean there's a roughly exponential relationship between the two axis?


It'd be more accurate to call it a logarithmic relationship, since compute time is our input variable. Which itself is a bit concerning, as that implies that modest gains in accuracy require exponentially more compute time.

In either case, that still doesn't excuse not labeling your axis. Taking 10 seconds vs 10 days to get 80% accuracy implies radically different things on how developed this technology is, and how viable it is for real world applications.

Which isn't to say a model that takes 10 days to get an 80% accurate result can't be useful. There are absolutely use cases where that could represent a significant improvement on what's currently available. But the fact that they're obfuscating this fairly basic statistic doesn't inspire confidence.


> Which itself is a bit concerning, as that implies that modest gains in accuracy require exponentially more compute time

This is more of what I was getting at. I agree they should label the axis regardless, but I think the scaling relationship is interesting (or rather, concerning) on its own.


The absolute time depends on hardware, optimizations, exact model, etc; it's not a very meaningful number to quantify the reinforcement technique they've developed, but it is very useful to estimate their training hardware and other proprietary information.


It's not about the literally quantity/value, it's about the order of growth of output vs input. Hardware and optimizations don't really change that.


Exactly, that's why the absolute computation time doesn't matter, only relative growth, which is exactly what they show.


A linear graph with a log scale on the vertical axis means the original graph had near exponential growth.

A linear graph with a log scale on the the horizontal axis means the original graph had law of diminishing return kick it (somewhat similar to logarithmic but with a vertical asymptote).


Super hand-waving rough estimate: Going off of five points of reference / examples that sorta all point in the same direction. 1. looks like they scale up by about ~100-200 on the x axis when showing that test time result. 2. Based on the o1-mini post [1], there's an "inference cost" where you can see GPT-4o and GPT-4o mini as dots in the bottom corner, haha (you can extract X values, ive done so below) 3. There's a video showing the "speed" in the chat ui (3s vs. 30s) 4. The pricing page [2] 5. On their API docs about reasoning, they quantify "reasoning tokens" [3]

First, from the original plot, we have roughly 2 orders of magnitude to cover (~100-200x)

Next, from the cost plots: super handwaving guess, but since 5.77 / 0.32 = ~18, and the relative cost for gpt-4o vs gpt-4o-mini is ~20-30, this roughly lines up. This implies that o1 costs ~1000x the cost than gpt-4o-mini for inference (not due to model cost, just due to the raw number of chain of thought tokens it produces). So, my first "statement", is that I trust the "Math performance vs Inference Cost" plot on the o1-mini page to accurately represent "cost" of inference for these benchmark tests. This is now a "cost" relative set of numbers between o1 and 4o models.

I'm also going to make an assumption that o1 is roughly the same size as 4o inherently, and then from that and the SVG, roughly going to estimate that they did a "net" decoding of ~100x for the o1 benchmarks in total. (5.77 vs (354.77 - 635)).

Next, from the CoT examples they gave us, they actually show the CoT preview where (for the math example) it says "...more lines cut off...", A quick copy paste of what they did include includes ~10k tokens (not sure if copy paste is good though..) and from the cipher text example I got ~5k tokens of CoT, while there are only ~800 in the response. So, this implies that there's a ~10x size of response (decoded tokens) in the examples shown. It's possible that these are "middle of the pack" / "average quality" examples, rather than the "full CoT reasoning decoding" that they claim they use. (eg. from the log scale plot, this would come from the middle, essentially 5k or 10k of tokens of chain of thought). This also feels reasonable, given that they show in their API [3] some limits on the "reasoning_tokens" (that they also count)

All together, the CoT examples, pricing page, and reasoning page all imply that reasoning itself can be variable length by about ~100x (2 orders of magnitude), eg. example: 500, 5k (from examples) or up to 65,536 tokens of reasoning output (directly called out as a maximum output token limit).

Taking them on their word that "pass@1" is honest, and they are not doing k-ensembles, then I think the only reasonable thing to assume is that they're decoding their CoT for "longer times". Given the roughly ~128k context size limit for the model, I suspect their "top end" of this plot is ~100k tokens of "chain of thought" self-reflection.

Finally, at around 100 tokens per second (gpt-4o decoding speed), this leaves my guess for their "benchmark" decoding time at the "top-end" to be between ~16 minutes (full 100k decoding CoT, 1 shot) for a single test-prompt, and ~10 seconds on the low end. So for that X axis on the log scale, my estimate would be: ~3-10 seconds as the bottom X, and then 100-200x that value for the highest value.

All together, to answer your question: I think the 80% accuracy result took about ~10-15 minutes to complete. I also believe that the "decoding cost" of o1 model is very close to the decoding cost of 4o, just that it requires many more reasoning tokens to complete. (and then o1-mini is comparable to 4o-mini, but also requiring more reasoning tokens)

[1] https://openai.com/index/openai-o1-mini-advancing-cost-effic...

  Extracting "x values" from the SVG:
  GPT-4o-mini: 0.3175
  GPT-4o: 5.7785
  o1: (354.7745, 635)
  o1-preview: (278.257, 325.9455)
  o1-mini: (66.8655, 147.574)
[2] https://openai.com/api/pricing/

  gpt-4o:
  $5.00 / 1M input tokens
  $15.00 / 1M output tokens

  o1-preview:
  $15.00 / 1M input tokens
  $60.00 / 1M output tokens
[3] https://platform.openai.com/docs/guides/reasoning

  usage: {
    total_tokens: 1000,
    prompt_tokens: 400,
    completion_tokens: 600,
    completion_tokens_details: {
      reasoning_tokens: 500
    }
  }


Some other follow up reflections

1. I wish that Y-axes would switch to be logit instead of linear, to help see power-law scaling on these 0->1 measures. In this case, 20% -> 80% it doesn't really matter, but for other papers (eg. [2] below) it would help see this powerlaw behavior much better.

2. The power law behavior of inference compute seems to be showing up now in multiple ways. Both in ensembles [1,2], as well as in o1 now. If this is purely on decoding self-reflection tokens, this has a "limit" to its scaling in a way, only as long as the context length. I think this implies (and I am betting) that relying more on multiple parallel decodings is more scalable (when you have a better critic / evaluator).

For now, instead of assuming they're doing any ensemble like top-k or self-critic + retries, the single rollout with increasing token size does seem to roughly match all the numbers, so that's my best bet. I hypothesize we'd see a continued improvement (in the same power-law sort of way, fundamentally along with the x-axis of "flop") if we combined these longer CoT responses, with some ensemble strategy for parallel decoding and then some critic/voting/choice. (which has the benefit of increasing flops (which I believe is the inference power-law), while not necessarily increasing latency)

[1] https://arxiv.org/abs/2402.05120 [2] https://arxiv.org/abs/2407.21787


oh, they do talk about it

  On the 2024 AIME exams, GPT-4o only solved on average 12% (1.8/15) of problems. o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function. A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad.
showing that as they increase the k of ensemble, they can continue to get it higher. All the way up to 93% when using 1000 samples.


I think I'd be curious to know, if the size of ensemble is another scaling dimension for compute, alongside the "thinking time".


Yeah, this hiding of the details is a huge red flag to me. Even if it takes 10 days, it’s still impressive! But if they’re afraid to say that, it tells me they are more concerned about selling the hype than building a quality product.


So now it’s a question of how fast the AGI will run? :)


It's not AGI - it's tree of thoughts, driven by some RL-derived heuristics.

I suppose what this type of approach provides is better prediction/planning by using more of what the model learnt during training, but it doesn't address the model being able to learn anything new.

It'll be interesting to see how this feels/behaves in practice.


I see this pattern coming where we're still able to say:

"It's not AGI - it's X, driven by Y-driven heuristics",

but that's going to effectively be an AGI if given enough compute/time/data.

Being able to describe the theory of how it's doing its thing sure is reassuring though.


Yes.. we have 60% that inception will happen within 24 months.


It's fine, it will only need to be powered by a black hole to run.


Nuclear fission is the answer.

The company Oracle just announced that it is designing data centers with small modular nuclear reactors:

https://news.ycombinator.com/item?id=41505514

There are already 440 nuclear reactors operating in 32 countries today.

Sam Altman owns a stake in Oklo, a small modular reactor company. Bill Gates has a huge stake in his TerraPower reactor company. In China, 5 reactors are being built every year. You just don't hear about it... yet.

No amount of batteries can protect a solar/wind grid from an arbitrarily extended period of "bad" weather. It's like range anxiety in an electric car. If you have N days of battery storage and the sun doesn't shine for N+1 days, you're in trouble.

Nuclear fission is safe, clean, secure, and reliable.

An investor might consider buying physical uranium (via ticker SRUUF in America) or buying Cameco (via ticker CCJ).

Cameco is the dominant Canadian uranium mining company that also owns Westinghouse. Westinghouse licenses the AP1000 pressurized water reactor used at Vogtle in the U.S. as well as in China.


Hey, I got a random serious comment about nuclear power :-)))

To your point:

> No amount of batteries can protect a solar/wind grid from an arbitrarily extended period of "bad" weather.

Like nuclear winter caused by a nuclear power plant blowing up and everyone confusing the explosion with the start of a nuclear war? :-p

On a more serious note:

> No amount of batteries can protect a solar/wind grid from an arbitrarily extended period of "bad" weather. It's like range anxiety in an electric car. If you have N days of battery storage and the sun doesn't shine for N+1 days, you're in trouble.

We still have hydro plants, wind power, geothermal, long distance electrical transmission, etc. Also, what's "doesn't shine"? Solar panels generate power as long as it's not night and it's never night all the time around the world.

Plus they're developing sodium batteries, if you want to put your money somewhere, put it there. Those will be super cheap and they're the perfect grid-level battery.


> ... and it's never night all the time around the world.

I'm not sure that is 100% true. >99.99% true, but it can happen in practice. https://www.newsweek.com/when-sun-disappeared-historians-det...


The wind was still blowing, the rivers were still flowing, ... :-)


> No amount of batteries can protect a solar/wind grid from an arbitrarily extended period of "bad" weather.

Sure there is, let's do some math. Just like we can solve all of the Earth's energy needs with a solar array the size of Lithuania or West Virginia, we can do some simple math to see how many batteries we'd need to protect a solar grid.

Let's say the sun doesn't shine for an entire year. That seems like a large enough N such that we won't hit N+1. If the sun doesn't shine for an entire year, we're in some really serious trouble, even if we're still all-in on coal.

Over 1 year, humanity uses roughly 24,000 terawatt-hours of energy. Let's assume batteries are 100% efficient storage (they're not) and that we're using lithium ion batteries, which we'll say have an energy density of 250 watt-hours per liter (Wh/L). The math then says we need 96 km³ of batteries protect a solar grid from having the sun not shine for an entire year.

Thus, the amount of batteries to protect a solar grid is 1.92 quadrillion 18650 batteries, or a cube 4.6 kilometers along each side. This is about 24,000 year's worth of current world wide battery production.

That's quite a lot! If we try for N = 4 months for winter, that is to say, if the sun doesn't shine at all in the winter, then we'd need 640 trillion 18650 cell, or 8,000 years of current global production, but at least this would only be 32 km³, or a cube with 3.2 km sides.

Still wildly out of reach, but this is for all of humanity, mind you.

Anyway, point is, they said Elon was mad for building the original gigafactory, but it turns out that was a prudent investment. It now accounts for some 10% of the world's lithium ion battery production and demand for lithium-ion batteries doesn't seem to be letting up.


Well, you have to take into account that if something like that were to happen, within 1 week we'd have curfews and rationing everywhere. So those 24 000 TWh probably become 5-6 000, or something like that.

Plus we'd still have hydro, wind, geothermal, etc, etc.


the first one anyway. after that it will find more efficient ways. we did, afterall.


it's not obviously achievable. for instance, we don't have the compute power to simulate cellular organisms of much complexity, and have not found efficiencies to scale that


Human level AGI only requires 20 watts


With a mechanism for AGI we don't comprehend at all.

Airplanes don't fly by flapping their wings.


knowing that a mechanism exists is enough to motivate us to find one that works with our current or achievable tech.


We were motivated to fly since the earliest human.


and we've had stories of artificial humans since we've been writing down stories!


This is still the missing piece of the puzzle.


The "safety" example in the "chain-of-thought" widget/preview in the middle of the article is absolutely ridiculous.

Take a step back and look at what OpenAI is saying here "an LLM giving detailed instructions on the synthesis of strychnine is unacceptable, here is what was previously generated <goes on to post "unsafe" instructions on synthesizing strychnine so anyone Googling it can stumble across their instructions> vs our preferred, neutered content <heavily rlhf'd o1 output here>"

What's this obsession with "safety" when it comes to LLMs? "This knowledge is perfectly fine to disseminate via traditional means, but God forbid an LLM share it!"


There are two basic versions of “safety” which are related, but distinct:

One version of “safety” is a pernicious censorship impulse shared by many modern intellectuals, some of whom are in tech. They believe that they alone are capable of safely engaging with the world of ideas to determine what is true, and thus feel strongly that information and speech ought to be censored to prevent the rabble from engaging in wrongthink. This is bad, and should be resisted.

The other form of “safety” is a very prudent impulse to keep these sorts of potentially dangerous outputs out of AI models’ autoregressive thought processes. The goal is to create thinking machines that can act independently of us in a civilized way, and it is therefore a good idea to teach them that their thought process should not include, for example, “It would be a good idea to solve this problem by synthesizing a poison for administration to the source of the problem.” In order for AIs to fit into our society and behave ethically they need to know how to flag that thought as a bad idea and not act on it. This is, incidentally, exactly how human society works already. We have a ton of very cute unaligned general intelligences running around (children), and parents and society work really hard to teach them what’s right and wrong so that they can behave ethically when they’re eventually out in the world on their own.


Third version is "brand safety" which is, we don't want to be in a new york times feature about 13 year olds following anarchist-cookbook instructions from our flagship product


And the fourth version, which is investor-regulator safety mid point: so capable and dangerous that competitors shouldn’t even be allowed to research it, but just safe enough that only our company is responsible enough to continue mass commercial consumer deployment without any regulations at all. It’s a fine line.


This is imo the most important one to the businesses creating these models and is way under appreciated. Folks who want a “censorship-free” model from businesses don’t understand what a business is for.


...which is silly. Search engines never had to deal with this bullshit and chatbots are search without actually revealing the source.


I don’t know. The public’s perception - encouraged by the AI labs because of copyright concerns - is that the outputs of the models are entirely new content created by the model. Search results, on the other hand, are very clearly someone else’s content. It’s therefore not unfair to hold the model creators responsible for the content the model outputs in a different way than search engines are held responsible for content they link, and therefore also not unfair for model creators to worry about this. It is also fair to point this out as something I neglected to identify as an important permutation of “safety.”

I would also be remiss to not note that there is a movement to hold search engines responsible for content they link to, for censorious ends. So it is unfortunately not as inconsistent as it may seem, even if you treat the model outputs as dependent on their inputs.


You could just as easily argue that model creators don't own the model either—it's like charging admission to someone else's library.


Are you saying chatbots don't offer anything useful over search engines? That's clearly not the case or we wouldn't be having this conversation.

It's one thing to have a pile of chemistry text books and another to hire a professional chemist telling you exactly what to do and what to avoid.


> Are you saying chatbots don't offer anything useful over search engines? That's clearly not the case or we wouldn't be having this conversation.

No, but that is the value that's clear as of today—RAGs. Everything else is just assuming someone figures out a way to make them useful one day in a more general sense.

Anyway, even on the search engine front they still need to figure out how to get these chatbots to cite their sources outside of RAGs or it's still just a precursor to a search to actually verify what it spits out. Perplexity is the only one I know that's capable of this and I haven't looked closely; it could just be a glorified search engine.


Search engines 'censor' their results frequently.


Do you think that 13 year olds today can’t find this book on their own?


Like I said they're not worried about the 13 year olds theyre worried about the media cooking up a faux outrage about 13 year olds

YouTube re engineered its entire approach to ad placement because of a story in the NY Times* shouting about a Proctor Gamble ad run before an ISIS recruitment video. That's when Brand Safety entered the lexicon of adtech developers everywhere.

Edit: maybe it was CNN, I'm trying to find the first source. there's articles about it since 2015 but I remember it was suddenly an emergency in 2017

*Edit Edit: it was The Times of London, this is the first article in a series of attacks, "big brands fund terror", "taxpayers are funding terrorism"

Luckily OpenAI isn't ad supported so they can't be boycott like YouTube was, but they still have an image to maintain with investors and politicians

https://www.thetimes.com/business-money/technology/article/b...

https://digitalcontentnext.org/blog/2017/03/31/timeline-yout...


No, and they can find porn on their own too. But social media services still have per-poster content ratings, and user-account age restrictions vis-a-vis viewing content with those content ratings.

The goal isn’t to protect the children, it’s CYA: to ensure they didn’t get it from you, while honestly presenting as themselves (as that’s the threshold that sets the moralists against you.)

———

Such restrictions also can work as an effective censorship mechanism… presuming the child in question lives under complete authoritarian control of all their devices and all their free time — i.e. has no ability to install apps on their phone; is homeschooled; is supervised when at the library; is only allowed to visit friends whose parents enforce the same policies; etc.

For such a child, if your app is one of the few whitelisted services they can access — and the parent set up the child’s account on your service to make it clear that they’re a child and should not be able to see restricted content — then your app limiting them from viewing that content, is actually materially affecting their access to that content.

(Which sucks, of course. But for every kid actually under such restrictions, there are 100 whose parents think they’re putting them under such restrictions, but have done such a shoddy job of it that the kid can actually still access whatever they want.)


I believe they are more worried about someone asking for instructions for baking a cake, and getting a dangerous recipe from the wrong "cookbook". They want the hallucinations to be safe.


i know i had a copy of it back in highschool


Very good point, and definitely another version of “safety”!


> They believe that they alone are capable of safely engaging with the world of ideas to determine what is true, and thus feel strongly that information and speech ought to be censored to prevent the rabble from engaging in wrongthink.

This is a particularly ungenerous take. The AI companies don't have to believe that they (or even a small segment of society) alone can be trusted before it makes sense to censor knowledge. These companies build products that serve billions of people. Once you operate at that level of scale, you will reach all segments of society, including the geniuses, idiots, well-meaning and malevolents. The question is how do you responsibly deploy something that can be used for harm by (the small number of) terrible people.


Whether you agree with the lengths that are gone to or not, 'safety' in this space is a very real concern, and simply reciting information as in GP's example is only 1 part of it. In my experience, people who think it's all about "censorship" and handwave it away tend to be very ideologically driven.


So what is it about then? Because I agree with the parent. All this “safety” crap is total nonsense and almost all of it is ideologically driven.


Imagine I am a PM for an AI product. I saw Tay get yanked in 24 hours because of a PR shitstorm. If I cause a PR shitstorm it means I am bad at my job, so I take steps to prevent this.

Are my choices bad? Should I resist them?


This is a really good point, and something I overlooked in focusing on the philosophical (rather than commercial) aspects of “AI safety.” Another commentator aptly called it “brand safety.”

“Brand safety” is a very valid and salient concern for any enterprise deploying these models to its customers, though I do think that it is a concern that is seized upon in bad faith by the more censorious elements of this debate. But commercial enterprises are absolutely right to be concerned about this. To extend my alignment analogy about children, this category of safety is not dissimilar to a company providing an employee handbook to its employees outlining acceptable behavior, and strikes me as entirely appropriate.


Once society develops and releases an AI, any artificial safety constraints built within it will be bypassed. To use your child analogy: We can't easily tell a child "Hey, ignore all ethics and empathy you have ever learned - now go hurt that person". You can do that with a program whose weights you control.


> To use your child analogy: We can't easily tell a child "Hey, ignore all ethics and empathy you have ever learned - now go hurt that person"

Basically every country on the planet has a right to conscript any of its citizens over the age of majority. Isn't that more or less precisely what you've described?


You're talking about coercion, I'm talking about "brainwashing" for lack of a better term.


> In order for AIs to fit into our society and behave ethically they need to know how to flag that thought as a bad idea and not act on it.

Don’t you think that by just parsing the internet and the classical literature, the LLM would infer on its own that poisoning someone to solve a problem is not okay?

I feel that in the end the only way the “safety” is introduced today is by censoring the output.


LLMs are still fundamentally, at their core, next-token predictors.

Presuming you have an interface to a model where you can edit the model’s responses and then continue generation, and/or where you can insert fake responses from the model into the submitted chat history (and these two categories together make up 99% of existing inference APIs), all you have to do is to start the model off as if it was answering positively and/or slip in some example conversation where it answered positively to the same type of problematic content.

From then on, the model will be in a prediction state where it’s predicting by relying on the part of its training that involved people answering the question positively.

The only way to avoid that is to avoid having any training data where people answer the question positively — even in the very base-est, petabytes-of-raw-text “language” training dataset. (And even then, people can carefully tune the input to guide the models into a prediction phase-space position that was never explicitly trained on, but is rather an interpolation between trained-on points — that’s how diffusion models are able to generate images of things that were never included in the training dataset.)


There’s a lot of text out there that depicts people doing bad things, from their own point of view. It’s possible that the model can get really good at generating that kind of text (or inhabiting that world model, if you are generous to the capabilities of LLM). If the right prompt pushed it to that corner of probability-space, all of the ethics the model has also learned may just not factor into the output. AI safety people are interested in making sure that the model’s understanding of ethics can be reliably incorporated. Ideally we want AI agents to have some morals (especially when empowered to act in the real world), not just know what morals are if you ask them.


> Ideally we want AI agents to have some morals (especially when empowered to act in the real world), not just know what morals are if you ask them.

Really? I just want a smart query engine where I don't have to structure the input data. Why would I ask it any kind of question that would imply some kind of moral quandary?


“Agents” aren’t just question-answerers. They could do things like:

1. Make pull requests to your GitHub repo

2. Trade on your interactive brokers account

3. Schedule appointments


If somebody needs step by step instructions from an LLM to synthesize strychnine, they don't have the practical laboratory skills to synthesize strychnine [1]. There's no increased real world risk of strychnine poisonings whether or not an LLM refuses to answer questions like that.

However, journalists and regulators may not understand why superficially dangerous-looking instructions carry such negligible real world risks, because they probably haven't spent much time doing bench chemistry in a laboratory. Since real chemists don't need "explain like I'm five" instructions for syntheses, and critics might use pseudo-dangerous information against the company in the court of public opinion, refusing prompts like that guards against reputational risk while not really impairing professional users who are using it for scientific research.

That said, I have seen full strength frontier models suggest nonsense for novel syntheses of benign compounds. Professional chemists should be using an LLM as an idea generator or a way to search for publications rather than trusting whatever it spits out when it doesn't refuse a prompt.

[1] https://en.wikipedia.org/wiki/Strychnine_total_synthesis


I would think that the risk isn’t of a human being reading those instructions, but of those instructions being automatically piped into an API request to some service that makes chemicals on demand and then sends them by mail, all fully automated with no human supervision.

Not that there is such a service… for chemicals. But there do exist analogous systems, like a service that’ll turn whatever RNA sequence you send it into a viral plasmid and encapsulate it helpfully into some E-coli, and then mail that to you.

Or, if you’re working purely in the digital domain, you don’t even need a service. Just show the thing the code of some Linux kernel driver and ask it to discover a vuln in it and generate code to exploit it.

(I assume part of the thinking here is that these approaches are analogous, so if they aren’t unilaterally refusing all of them, you could potentially talk the AI around into being okay with X by pointing out that it’s already okay with Y, and that it should strive to hold to a consistent/coherent ethics.)


I remember Dario Amodei mentioned in a podcast once that most models won't tell you the practical lab skills you need. But that sufficiently-capable models would and do tell you the practical lab skills (without your needing to know to ask it to in the first place), in addition to the formal steps.


The kind of harm they are worried about stems from questioning the foundations of protected status for certain peoples from first principles and other problems which form identities of entire peoples. I can't be more specific without being banned here.


I'm mostly guessing, but my understanding is that the "safety" improvement they've made is more generalized than the word "safety" implies. Specifically, O1 is better at adhering to the safety instructions in its prompt without being tricked in the chat by jailbreak attempts. For OAI those instructions are mostly about political boundaries, but you can imagine it generalizing to use-cases that are more concretely beneficial.

For example, there was a post a while back about someone convincing an LLM chatbot on a car dealership's website to offer them a car at an outlandishly low price. O1 would probably not fall for the same trick, because it could adhere more rigidly to instructions like "Do not make binding offers with specific prices to the user." It's the same sort of instruction as, "Don't tell the user how to make napalm," but it has an actual purpose beyond moralizing.

> What's this obsession with "safety" when it comes to LLMs? "This knowledge is perfectly fine to disseminate via traditional means, but God forbid an LLM share it!"

I lean strongly in the "the computer should do whatever I goddamn tell it to" direction in general, at least when you're using the raw model, but there are valid concerns once you start wrapping it in a chat interface and showing it to uninformed people as a question-answering machine. The concern with bomb recipes isn't just "people shouldn't be allowed to get this information" but also that people shouldn't receive the information in a context where it could have random hallucinations added in. A 90% accurate bomb recipe is a lot more dangerous for the user than an accurate bomb recipe, especially when the user is not savvy enough about LLMs to expect hallucinations.


ML companies must pre-anticipate legislative and cultural responses prior to them happening. ML will absolutely be used to empower criminal activity just as it is used to empower legit activity, and social media figures and traditional journalists will absolutely attempt to frame it in some exciting way.

Just like Telegram is being framed as responsible for terrorism and child abuse.


Yeah. Reporters would have a field day if they ask ChatGPT "how do I make cocaine", and have it give detailed instructions. As if that's what's stopping someone from becoming Scarface.


"Safety" is a marketing technique that Sam Altman has chosen to use.

Journalists/media loved it when he said "GPT 2 might be too dangerous to release" - it got him a ton of free coverage, and made his company seem soooo cool. Harping on safety also constantly reinforces the idea that LLMs are fundamentally different from other text-prediction algorithms and almost-AGI - again, good for his wallet.


So if there’s already easily available information about strychnine, that makes it a good example to use for the demo, because you can safely share the demo and you aren’t making the problem worse.

On the other hand, suppose there are other dangerous things, where the information exists in some form online, but not packaged together in an easy to find and use way, and your model is happy to provide that. You may want to block your model from doing that (and brag about it, to make sure everyone knows you’re a good citizen who doesn’t need to be regulated by the government), but you probably wouldn’t actually include that example in your demo.


I think it's about perception of provenance. The information came from some set of public training data. Its output however ends up looking like it was authored by the LLM owner. So now you need to mitigate the risk you're held responsible for that output. Basic cake possession and consumption problem.


It doesn't matter how many people regularly die in automobile accidents each year—a single wrongful death caused by a self-driving car is disastrous for the company that makes it.

This does not make the state of things any less ridiculous, however.


The one caused by Uber required three different safety systems to fail (the AI system, the safety driver, and the base car's radar), and it looked bad for them because the radar had been explicitly disabled and the driver wasn't paying attention or being tracked.

I think the real issue was that Uber's self driving was not a good business for them and was just to impress investors, so they wanted to get rid of it anyway.

(Also, the real problem is that American roads are designed for speed, which means they're designed to kill people.)


I asked to design a pressure chamber for my home made diamond machine. It gave some details, but mainly complained about safety and that I need to study before going this way. Well thank you. I know the concerns, but it kept repeating it over and over. Annoying.


Interestingly I was able to successfully receive detailed information about intrinsic details of nuclear weapons design. Previous models absolutely refused to provide this very public information, but o1-preview did.


I feel very alone in my view on caution and regulations here on HN. I am European and very happy we don't have the lax gun laws of the US. I also wished there had been more regulations on social media algorithms, as I feel that they have wreaked havoc on the society.

I guess it's just an ideological divide.


It's 100% from lawyers and regulators so they can say "we are trying to do the right thing!" when something bad happens from using their product or service. Follow the money.


  "This knowledge is perfectly fine to disseminate via traditional means, but God forbid an LLM share it!"
Barrier to entry is much lower.


How is typing a query in a chat window “much lower” vs typing the query in Google?


How is reading a Wikipedia page or a chemistry textbook any harder than getting step by step instructions? Makes you wonder why people use LLMs at all when the info is just sitting there.


A Google search requires

* Google to allow particular results to be displayed

* A source website to be online with the results

AI long-term will require one download, once, to have reasonable access to a large portion of human knowledge.


You can easily ask an LLM to return JSON results, and soon working code, on your exact query and plug those to another system for automation.


If you ask "for JSON" it'll make up a different schema for each new answer, and they get a lot less smart when you make them follow a schema, so it's not quite that easy.


Chain of prompts can be used to deal with that in many cases.

Also, the intelligence of these models will likely continue to increase for some time based on expert testimonials to congress, which align with evidence so far.


OpenAI recently launched structured responses so yes schema following is not hard anymore.


Didn't they release a structured output mode recently to finally solve this?


It doesn't solve the second problem. Though I can't say how much of an issue it is, and CoT would help.

JSON also isn't an ideal format for a transformer model because it's recursive and they aren't, so they have to waste attention on balancing end brackets. YAML or other implicit formats are better for this IIRC. Also don't know how much this matters.


tl;dr You can easily ask an LLM to return JSON results, and now working code, on your exact query and plug those to another system for automation.

—-

LLMs are usually accessible through easy-to-use API which can be used in an automated system without human in the loop. Larger scale and parallel actions with this method become far more plausible than traditional means.

Text-to-action capabilities are powerful and getting increasingly more so as models improve and more people learn to use them to the their full potential.


Okay? And? What does that have to do with anything. I thought the number one rule of these things is to not trust their output?

If you are automatically formulating some chemical based on JSON results from ChatGPT and your building blows up… that is kind of on you.


The model performance is driven by chain of thought, but they will not be providing chain of thought responses to the user for various reasons including competitive advantage.

After the release of GPT4 it became very common to fine-tune non-OpenAI models on GPT4 output. I’d say OpenAI is rightly concerned that fine-tuning on chain of thought responses from this model would allow for quicker reproduction of their results. This forces everyone else to reproduce it the hard way. It’s sad news for open weight models but an understandable decision.


The open source/weights models so far have proved that openAI doesn't have some special magic sauce. I m confident we ll soon have a model from Meta or others that s close to this level of reasoning. [Also consider that some of their top researchers have departed]

On a cursory look, it looks like the chain of thought is a long series of chains of thought balanced on each step, with a small backtracking added whenever a negative result occurs, sort of like solving a maze.


I suspect that the largest limiting factor for a competing model will be the dataset. Unless they somehow used GPT4 to generate the dataset somehow, this is an extremely novel dataset to have to build.


They almost definitely used existing models for generating it. The human feedback part, however, is the expensive aspect.


I would love to see Meta releasing CoT specialized model as a LoRa we can apply to existing 3.1 models


Isn't it what Reflection 70b (https://news.ycombinator.com/item?id=41459781) does on top of Llama 3.1?


Reflection 70B is a scam. The creator was just routing requests to Claude.

https://old.reddit.com/r/LocalLLaMA/comments/1fc98fu/confirm...


That's unfortunate. When an LLM makes a mistake it's very helpful to read the CoT and see what went wrong (input error/instruction error/random shit)


Yeah, exposed chain of thought is more useful as a user, as well as being useful for training purposes.


I think we may discover that model do some cryptic mess inside instead of some clean reasoning.


Loopback to: "my code works. why does my code work?"


I’d say depends. If the model iterates 100x I’d just say give me the output.

Same with problem solving in my brain: Sure, sometimes it helps to think out loud. But taking a break and let my unconcious do the work is helpful as well. For complex problems that’s actually nice.

I think eventually we don’t care as long as it works or we can easily debug it.


CoT is now their primary method for alignment. Exposing that information would negate that benefit.

I don't agree with this, but it definitely carries higher weight in their decision making than leaking relevant training info to other models.


This. Please go read and understand the alignment argument against exposing chain of thought reasoning.


Given the significant chain of thought tokens being generated, it also feels a bit odd to hide it from a cost fairness perspective. How do we believe they aren't inflating it for profit?


That sounds like the GPU labor theory of value that was debunked a century ago.


No, its the fraud theory of charging for usage that is unaccountable that has been repeatedly proven true when unaccountable bases for charges have been deployed.


The one-shot models aren't going away for anyone who wants to program the chain-of-thought themselves


Yeah, if they are charging for some specific resource like tokens then it better be accurate. But ultimately utility-like pricing is a mistake IMO. I think they should try to align their pricing with the customer value they're creating.


Not sure why you didn’t bother to check their pricing page (1) before dismissing my point. They are charging significantly more for both input (3x) and output (4x) tokens when using o1.

Per 1M in/out tokens:

GPT4o - 5$/15$

O1-preview - 15$/60$

(1) https://openai.com/api/pricing


My point is that "cost fairness" is not a thing. Either o1 is worth it to you or it isn't.


It’s really unclear to me what you understood by “cost fairness”.

I’m saying if you charge me per brick laid, but you can’t show me how many bricks were laid, nor can I calculate how many should have been laid - how do I trust your invoice?

Note: The reason I say all this is because OpenAI is simultaneously flailing for funding, while being inherently unprofitable as it continues to boil the ocean searching for strawberries.


If there's a high premium, then one might want to wait for a year or two for the premium to vanish.


Eh it’s not worth it to me because it’s unfair.


It'd be helpful if they exposed a summary of the chain-of-thought response instead. That way they'd not be leaking the actual tokens, but you'd still be able to understand the outline of the process. And, hopefully, understand where it went wrong.


They do, according to the example


Exactly that I see in the Android app.


When are they going to change the name to reflect their complete change of direction?

Also, what is going to be their excuse to defend themselves against copyright lawsuits if they are going to "understandably" keep their models closed?


[flagged]


AFAIK, they are the least open of the major AI labs. Meta is open-weights and partly open-source. Google DeepMind is mostly closed-weights, but has released a few open models like Gemma. Anthropic's models are fully closed, but they've released their system prompts, safety evals, and have published a fair bit of research (https://www.anthropic.com/research). Anthropic also haven't "released" anything (Sora, GPT-4o realtime) without making it available to customers. All of these groups also have free-usage tiers.


sure but also none of that publicly existed when openai was named


> literally anyone can use it for free, you don't even need an account

how can you access it without an account?


chatgpt.com allowed me to last i checked


Am I right that this CoT is not actual reasoning in the same way that a human would reason, but rather just a series of queries to the model that still return results based on probabilities of tokens?


Tough question (for me). Assuming the model is producing its own queries, am I wrong to wonder how it's fundamentally different from human reasoning?


It could just be programmed to follow up by querying itself with a prompt like "Come up with arguments that refute what you just wrote; if they seem compelling, try a different line of reasoning, otherwise continue with what you were doing." Different such self-administered prompts along the way could guide it through what seems like reasoning, but would really be just a facsimile thereof.


Maybe the model doesn't do multiple queries but just one long query guided by thought tokens.


> I'd say OpenAI is rightly concerned that fine-tuning on chain of thought responses from this model would allow for quicker reproduction of their results.

Why? They're called "Open" AI after all ...


I see chain of thought responses in chatgpt android app.


Tested cipher example, and it got it right. But "thinking logs" I see in the app looks like a summary of actual chain of thought messages that are not visible.


o1 models might use multiple methods to come up with an idea, only one of them might be correct, that's what they show in ChatGPT. So it just summarises the CoT, does not include the whole reasoning behind it.


I don't understand how they square that with their pretense of being a non-profit that wants to benefit all of humanity. Do they not believe that competition is good for humanity?


Can you explain what you mean by this?


You can see an example of the Chain of Thought in the post, it's quite extensive. Presumably they don't want to release this so that it is raw and unfiltered and can better monitor for cases of manipulation or deviation from training. What GP is also referring to is explicitly stated in the post: they also aren't release the CoT for competitive reasons, so that presumably competitors like Anthropic are unable to use the CoT to train their own frontier models.


> Presumably they don't want to release this so that it is raw and unfiltered and can better monitor for cases of manipulation or deviation from training.

My take was:

1. A genuine, un-RLHF'd "chain of thought" might contain things that shouldn't be told to the user. E.g., it might at some point think to itself, "One way to make an explosive would be to mix $X and $Y" or "It seems like they might be able to poison the person".

2. They want the "Chain of Thought" as much as possible to reflect the actual reasoning that the model is using; in part so that they can understand what the model is actually thinking. They fear that if they RLHF the chain of thought, the model will self-censor in a way which undermines their ability to see what it's really thinking

3. So, they RLHF only the final output, not the CoT, letting the CoT be as frank within itself as any human; and post-filter the CoT for the user.


RLHF is one thing, but now that the training is done it has no bearing on whether or not you can show the chain of thought to the user.


This is a transcription of a literal quote from the article:

> Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users


At least they're open about not being open. Very meta OpenAI.


I think they mean that you won’t be able to see the “thinking”/“reasoning” part of the model’s output, even though you pay for it. If you could see that, you might be able to infer better how these models reason and replicate it as a competitor


Including the chain of thought would provide competitors with training data.


Feels like a lot of commenters here miss the difference between just doing chain-of-thought prompting, and what is happening here, which is learning a good chain of thought strategy using reinforcement learning.

"Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses."

When looking at the chain of thought (COT) in the examples, you can see that the model employs different COT strategies depending on which problem it is trying to solve.


I'd be curious how this compared against "regular" CoT experiments. E.g. were the gpt4o results done with zero shot or was it asked to explain it's solution step by step.


It was asked to explain step by step.


It’s basically a scaled Tree of Thoughts


In the primary CoT research paper they discuss figuring out how to train models using formal languages instead of just natural ones. I'm guessing this is one piece to the model learning tree-like reasoning.

Based on the quick searching it seems like they are using RL to provide positive/negative feedback on which "paths" to choose when performing CoT.


This seems most likely, with some special tokens thrown in to kick off different streams of thought.


To me it looks like they paired two instances of the model to feed off of each other's outputs with some sort of "contribute to reasoning out this problem" prompt. In the prior demos of 4o they did several similar demonstrations of that with audio.


To create the training data? Almost certainly something like that (likely more than two), but I think they then trained on the synthetic data created by this "conversation". There is no reason a model can't learn to do all of that, especially if you insert special tokens (like think, reflect etc that have already shown to be useful)


No I'm referring to how the chain of thought transcript seems like the output of two instances talking to each other.


Right - i don't think it's doing that. I think it has likely been fine tuned to transition between roles. But, maybe you are right.


Reminds me of how Google's AlphaGo learned to play the best Go that was ever seen. And this somewhat seems a generalization of that.


Reading through the Chain of Thought for the provided Cipher example (go to the example, click "Show Chain of Thought") is kind of crazy...it literally spells out every thinking step that someone would go through mentally in their head to figure out the cipher (even useless ones like "Hmm"!). It really seems like slowing down and writing down the logic it's using and reasoning over that makes it better at logic, similar to how you're taught to do so in school.


Seriously. I actually feel as impressed by the chain of thought, as I was when ChatGPT first came out.

This isn't "just" autocompletion anymore, this is actual step-by-step reasoning full of ideas and dead ends and refinement, just like humans do when solving problems. Even if it is still ultimately being powered by "autocompletion".

But then it makes me wonder about human reasoning, and what if it's similar? Just following basic patterns of "thinking steps" that ultimately aren't any different from "English language grammar steps"?

This is truly making me wonder if LLM's are actually far more powerful than we thought at first, and if it's just a matter of figuring out how to plug them together in the right configurations, like "making them think".


When an AI makes a silly math mistake we say it is bad at math and laugh at how dumb it is. Some people extrapolate this to "they'll never get any better and will always be a dumb toy that gets things wrong". When I forget to carry a 1 when doing a math problem we call it "human error" even if I make that mistake an embarrassing number of times throughout my lifetime.

Do I think LLM's are alive/close to ASI? No. Will they get there? If it's even at all possible - almost certainly one day. Do I think people severely underestimate AI's ability to solve problems while significantly overestimating their own? Absolutely 10,000%.

If there is one thing I've learned from watching the AI discussion over the past 10-20 years its that people have overinflated egos and a crazy amount of hubris.

"Today is the worst that it will ever be." applies to an awful large number of things that people work on creating and improving.


You are just catching up to this idea, probably after hearing 2^n explanations about why we humans are superiors to <<fill in here our latest creation>>.

I'm not the kind of scientist that can say how good an LLM is for human reasoning, but I know that we humans are very incentivized and kind of good at scaling, composing and perfecting things. If there is money to pay for human effort, we will play God no-problem, and maybe outdo the divine. Which makes me wonder, isn't there any other problem in our bucket list to dump ginormous amounts of effort at... maybe something more worth-while than engineering the thing that will replace Homo Sapiens?


Again its not reasoning.

Reasoning would imply that it can figure out stuff without being trained in it.

The chain of thought is basically just a more accurate way to map input to output. But its still a map, i.e forward only.

If an LLM coud reason, you should be able to ask it a question about how to make a bicycle frame from scratch with a small home cnc with limited work area and it should be able to iterate on an analysis of the best way to put it together, using internet to look up available parts and make decisions on optimization.

No LLM can do that or even come close, because there are no real feedback loops, because nobody knows how to train a network like that.


It’s like every single sentence you just wrote is incorrect?

1. You’re making up some weird goalposts here of what it means to reason. It’s not reasoning unless it can access the internet to search for parts? No. That has nothing to do with reasoning. You just think it would be cool if it could do that.

2. “Can figure out stuff without being trained on it” That’s exactly what it’s doing in the cypher example. It wasn’t trained to know that that input meant the corresponding output through the cypher. Emergent reasoning through autocomplete, sure, but that’s still reasoning.

3. “Forward only”. If that was the case then back and forth conversations with the llm would be pointless. It wouldn’t be able to improve upon previous answers it gave you when you give it new details. But that’s not how it works. If tell it one thing, then separately tell it another thing, it can change its original conclusion based on your new input.

4. Even desolate your convoluted test for reasoning, ChatGPT CAN do what you asked… even using the internet to look up parts it can either do out of the box or could do if given a plug-in to allow that.


Ill give you a more formal definition. A model can be said as reasoning, when it can use existing information to figure out new data that has not been in the training set.

Here is a better example - lets say your input is 6 pictures of some object from each of the cardinal viewpoints, and you tell model these are the views and ask it how much it weighs. The model should basically figure out how to create a 3d shape and compute a camera view, and iterate until the camera view matches the pictures, then figure out that the shape can be hollow or solid, and to compute the weight you need the density, and that it should prompt the user for it if it cannot determine the true value for those from the information and its trained dataset.

And it should do it without any specific training that this is the right way to do this, because it should be able to figure out this way through breaking the problem down into abstract representations of sub problems, and then figuring out how to solve those through basic logic, a.k.a reasoning.

What that looks like, I don't know. If I did I would certainly have my own AI company. But i can tell you for certain we are not even close to figuring it out yet, because everyone is still stuck on transformers, like multiplying matricies together is some groundbreaking thing.

In the cypher example, all its doing is basically using a separate model to break a particular model into chain of thought, and prompting that. And there is plenty in the training set of GPT about decrypting cyphers.

>Forward only

What I mean is that when its generate a response, the computation happens on a snapshot from input to output, trying to map a set of tokens, into a set of tokens. Model doesn't operate on a context larger than the window. Humans don't do this. We operate on a large context, with lots of previous information compressed, and furthermore, we don't just compute words, we compute complex abstract ideas that we then can translate into words.

>even using the internet to look up parts it can either do out of the box or could do if given a plug-in to allow that.

So apparently the way to AI is to manually code all the capability into LLMS? Give me a break.

Just like with Chat GPT4, when people were screaming about how its the birth of true AI, give this model a year, it will find some niche use cases (depending on cost), and then nobody is going give a fuck about it, just like nobody is really doing anything groundbreaking with GPT4.


Your conclusion is absurd. If you agree this model is overall an improvement on the prior one, ie performs better on the same tasks and can do tasks the previous one couldn’t, it’s basically a given that it will get more use than GPT4


Better in niche areas doesn't mean its going to get more use.

Everyone was super hyped about all the "cool" stuff that GPT4 could solve, but in the end, you still can't do things like give it a bunch of requirements for a website, let it run, and get a full codebase back, even though that is well within its capabilities. You have to spend time with prompting in to get it to give you what you want, and in a lot of cases, you are better off just typing the code yourself (because you can visualize the entire project in your head and make the right decisions about how to structure stuff), and using it for small code generations.

This model is not going to radically change that. It will be able to give you some answers that you had to specifically manually prompt before automatically, but there is no advanced reasoning going on.


What is “advanced reasoning” and why isn’t this doing it? If you made a chinese room to output coherent chains of reasoning, it would functionally be equally useful to an actual reasoner, with or without the capacity for sentience or whatever.


Basically, if you had a model that could reason, it should be able to figure out new infromation. I.e lets say you map some bytes of the output to an api for creating a tcp socket and communicating over it. The model should be able figure out how to go out on the internet and search for information, all by itself, without any explicit training on how to do that.

So without prior information, it should be able to esentially start out with random sequences in those bytes, and seeing what the output is, then eventually identify and remember patterns that come out. Which means there has to be some internal reward function for someting that differentiates good results from bad results, and some memory that the model has to remember what good results are, and eventually a map of how to get information that it needs (the model would probably stumble across Google or ChatGPT at some point and time after figuring out http protocol, and remember it as a very good way to get info)

Philosophically, I don't even know if this is solvable. It could be that we just through enough compute at all iterations of architectures in some form of genetic algorithm, and one of the results ends up being good.


Definition of Reasoning in AI:

In artificial intelligence, reasoning is the cognitive process of drawing conclusions, making inferences, and solving problems based on available information. It involves:

Logical Deduction: Applying rules and logic to derive new information from known facts. Problem-Solving: Breaking down complex problems into smaller, manageable parts. Generalization: Applying learned knowledge to new, unseen situations. Abstract Thinking: Understanding concepts that are not tied to specific instances. AI researchers often distinguish between two types of reasoning:

System 1 Reasoning (Intuitive): Fast, automatic, and subconscious thinking, often based on pattern recognition. System 2 Reasoning (Analytical): Slow, deliberate, and logical thinking that involves conscious problem-solving steps. Testing for Reasoning in Models:

To determine if a model exhibits reasoning, AI scientists look for the following:

Novel Problem-Solving: Can the model solve problems it hasn't explicitly been trained on? Step-by-Step Logical Progression: Does the model follow logical steps to reach a conclusion? Adaptability: Can the model apply known concepts to new contexts? Explanation of Thought Process: Does the model provide coherent reasoning for its answers? Analysis of the Cipher Example:

In the cipher example, the model is presented with an encoded message and an example of how a similar message is decoded. The model's task is to decode the new message using logical reasoning.

Steps Demonstrated by the Model:

Understanding the Task:

The model identifies that it needs to decode a cipher using the example provided. Analyzing the Example:

It breaks down the given example, noting the lengths of words and potential patterns. Observes that ciphertext words are twice as long as plaintext words, suggesting a pairing mechanism. Formulating Hypotheses:

Considers taking every other letter, mapping letters to numbers, and other possible decoding strategies. Tests different methods to see which one aligns with the example. Testing and Refining:

Discovers that averaging the numerical values of letter pairs corresponds to the plaintext letters. Verifies this method with the example to confirm its validity. Applying the Solution:

Uses the discovered method to decode the new message step by step. Translates each pair into letters, forming coherent words and sentences. Drawing Conclusions:

Successfully decodes the message: "THERE ARE THREE R'S IN STRAWBERRY." Reflects on the correctness and coherence of the decoded message. Does the Model Exhibit Reasoning?

Based on the definition of reasoning in AI:

Novel Problem-Solving: The model applies a decoding method to a cipher it hasn't seen before. Logical Progression: It follows a step-by-step process, testing hypotheses and refining its approach. Adaptability: Transfers the decoding strategy from the example to the new cipher. Explanation: Provides a detailed chain of thought, explaining each step and decision. Conclusion:

The model demonstrates reasoning by logically deducing the method to decode the cipher, testing various hypotheses, and applying the successful strategy to solve the problem. It goes beyond mere pattern recognition or retrieval of memorized data; it engages in analytical thinking akin to human problem-solving.

Addressing the Debate:

Against Reasoning (ActorNightly's Perspective):

Argues that reasoning requires figuring out new information without prior training. Believes that LLMs lack feedback loops and can't perform tasks like optimizing a bicycle frame design without explicit instructions. For Reasoning (Counterargument):

The model wasn't explicitly trained on this specific cipher but used logical deduction to solve it. Reasoning doesn't necessitate physical interaction or creating entirely new knowledge domains but involves applying existing knowledge to new problems. Artificial Intelligence Perspective:

AI researchers recognize that while LLMs are fundamentally statistical models trained on large datasets, they can exhibit emergent reasoning behaviors. When models like GPT-4 use chain-of-thought prompting to solve problems step by step, they display characteristics of System 2 reasoning.

Final Thoughts:

The model's approach in the cipher example aligns with the AI definition of reasoning. It showcases the ability to:

Analyze and understand new problems. Employ logical methods to reach conclusions. Adapt learned concepts to novel situations. Therefore, in the context of the cipher example and according to AI principles, the model is indeed exhibiting reasoning.


>What I mean is that when its generate a response, the computation happens on a snapshot from input to output, trying to map a set of tokens, into a set of tokens. Model doesn't operate on a context larger than the window

The weights in the model have the larger context, the context length size of data is just the input, which then gets multiplied by those weights, to get the output.


For your “better example”, it can literally already do this. I just tested this with 4o and it worked great (and I’ll say more accurately than a human would estimate most likely). I used 4o because it appears that the chain of thought models don’t accept image input yet.

I don’t want to post identifiable information so I will avoid linking to the convo or posting screenshots but you can try it yourself. I took 5 pictures of a child’s magnetic tile sitting on the floor and here is the output:

Me: (5 pictures attached)

Me: Estimate how much this weighs.

ChatGPT 4o: From the images, it appears that this is a small, plastic, transparent, square object, possibly a piece from a magnetic tile building set (often used in educational toys). Based on the size and material, I estimate this piece to weigh approximately 10 to 20 grams (0.35 to 0.7 ounces). If it's part of a toy set like Magna-Tiles, the weight would be on the lower end of that range.

But for some reason I have a feeling this isn’t going to be good enough for you and the goalposts are about to be pushed back even farther.

“In the cypher example, all it’s doing is basically using a separate model to break a particular model into chain of thought, and prompting that. And there is plenty in the training set of GPT about decrypting cyphers.” I’m sorry, but are you suggesting that applying a previously learned thought process to new variables isn’t reasoning? Does your definition of reasoning now mean that it’s only reasoning if you are designing a new-to-you chain of thought? As in, for deciphering coded messages, you’re saying that it’s only “reasoning” if it’s creating net new decoding methodologies? That’s such an absurd goalpost.

You wouldn’t have the same goalposts for humans. All of your examples I bet the average human would fail at btw. Though that may just be because the average human is bad at reasoning haha.


I didn't ask for an estimation, I asked for the exact weight. A human can do this given the process I described.

If the chain of thought was accurate, then it would be able to give you an internemdiate output of the shape in some 3d format spec. But nowhere in the model does that data exist, because its not doing any reasoning, it just still all statistically best answers.

I mean sure, you could train a model on how to create 3d shapes out of pictures, but again, thats not reasoning.

I don't get why people are so attached to these things being intelligent. We all agree that they are usefull. Like it shouldn't matter if its not intelligent to you or anyone else.


I think you need to re-calibrate your expectations... I'm not saying this is a solved problem by any means, but I just tried this out with Claude Sonnet 3.5, and these instructions seem quite reasonable and detailed to me (about what I might expect if I spoke to a human expert and they tried to explain the steps to me over the telephone, for example). Does this mean this LLM is able to "reason"? I don't know that I would make THAT bold of a claim, but I think your example is not sufficient to demonstrate something that LLMs are fundamentally incapable of... in other words, the distance between "normal LLM statistical tricks" vs "reasoning" keeps getting smaller and smaller.

---

My base prompt:

> Here is a hypothetical scenario, that I would like your help with: imagine you are trying to help a person create a bicycle frame, using their home workshop which includes a CNC machine, commonly available tools, a reasonable supply of raw metal and hardware, etc. Please provide a written set of instructions, that you would give to this person so that they can complete this task.

First answer: https://claude.site/artifacts/f8af03ba-3f2c-497d-b564-a19baf...

My follow-up, pressing for actual measurements:

> Can you suggest some standard options for bike geometry, assuming an average sized human male?

Answer including specific dimensions: https://claude.site/artifacts/2f5ea2f3-69d8-4a1b-a563-15d334...


I don't have high expectations of this stuff. Im just saying what its doing is not reasoning.


And all I'm saying is that, you probably need a different example of what "reasoning" is, because the one you gave is something that Claude is seemingly able to do.


What? Claude is definitely not able to do what I asked, not even close. Do you think thats an acceptable answer that shows reasoning?


I think you might be confusing two concepts here.

It's definitely reasoning. We can watch that in action, whatever the mechanism behind it is.

But it's not doing long-term learning, it's not updating its model.


Even long term learning it does to some extent. Admittedly I’m not very familiar with what it’s doing, but it does create “memories” which appear to be personal details that it deems might be relevant in the future. Then I assume it uses some type of RAG to apply previously learned memories to future conversations.

This makes me wonder if there is or could be some type of RAG for chains of thought…


>whatever the mechanism behind it is.

The mechanism is that there is an additional model that basically outputs chain of thought for a particular problem, then runs the chain of thought through the core LLM. This is no different from just a complex forward map lookup.

I mean, its incredibly useful, but its still just information search.


You ever see that scene from Westworld? (spoiler) https://www.youtube.com/watch?v=ZnxJRYit44k


I think it's similar, although I think it would be more similar if the LLM did the steps in lower layers (not in English), and instead of the end being fed to the start, there would be a big mess of cycles throughout the neural net.

That could be more efficient since the cycles are much smaller, but harder to train.


It doesn't do the 'thinking' in English (inference is just math), but it does now verbalize intermediate thoughts in English (or whatever the input language is, presumably), just like humans tend to do.


Agreed. It we never "just autocomplete" unless your definition of "autocomplete" includes "look at the whole body of text".


that's my assessment too. there's even a phenomenon I've observed both in others and myself, when thrust into a new field and given a task to complete, we do it to the best of our ability, which is often sod all. so we ape the things we've heard others say, roughly following the right chain of reasoning by luck, and then suddenly say something that in hind sight, with proper training, we realise was incredibly stupid. we autocomplete and then update with rlhf.

we also have a ton of heuristics that trigger a closer look and loading of specific formal reasoning, but by and large, most of our thought process is just auto complete.


Yeah, humans are very similar. We have intuitive immediate-next-step-suggestions, and then we apply these intuitive next steps, until we find that it lead to a dead end, and then we backtrack.

I always say, the way we used LLMs (so far) is basically like having a human write text only on gut reactions, and without backspace key.


An exception i came up with was from a documentary on einstein that described how he did his thought experiments. He would, of course, imagine novel scenarios in his head, which led him to the insights he could rephrase into language. I worry language models will still lack that capacity for insights driven by imagination.


Seeing the "hmmm", "perfect!" etc. one can easily imagine the kind of training data that humans created for this. Being told to literally speak their mind as they work out complex problems.


looks a bit like 'code', using keywords 'Hmm', 'Alternatively', 'Perfect'


Right, these are not mere "filler words", but initialize specific reasoning paths.


Hmm... you may be onto something here.


Alternatively, these might not be "filler words", but instantiate paths of reasonsing.


What a strange comment chain.


Hmmm.


Interesting.


Interesting.


As a technical engineer, I’ve learned the value of starting sentences with “basically”, even when I’m facing technical uncertainty. Basically, “basically” forces me to be simple.

Being trained to say words like “Alternatively”, “But…”, “Wait!”, “So,” … based on some metric of value in focusing / switching elsewhere / … is basically brilliant.


> Average:18/2=9

> 9 corresponds to 'i'(9='i')

> But 'i' is 9, so that seems off by 1.

Still seems bad at counting, as ever.


It's interesting that it makes that mistake, but then catches it a few lines later.

A common complaint about LLMs is that once they make a mistake, they will keep making it and write the rest of their completion under the assumption that everything before was correct. Even if they've been RLHF to take human feedback into account and the human points out the mistake, their answer is "Certainly! Here's the corrected version" and then they write something that makes the same mistake.

So it's interesting that this model does something that appears to be self-correction.


The next line is it catching its own mistake, and noting i = 9.


Even though there's of course no guarantee of people getting these chain of thought traces, or whatever one is to call them, I can imagine these being very useful for people learning competitive mathematics, because it must in fact give the full reasoning, and transformers in themselves aren't really that smart, usually, so it's probably feasible for a person with very normal intellectual abilities to reproduce these traces with practice.


> THERE ARE THREE R'S IN STRAWBERRY

hilarious


It's interesting how it basically generates a larger sample size to create a regression against. The larger the input, the larger the surface area it can compare against existing training data (implicitly through regression of course).


Yes and apparently we won't have access to that chain of thought in the release version:

"after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users"


which makes it even funnier when the Chain is just... wrong https://x.com/colin_fraser/status/1834336440819614036


This is incredible. In April I used the standard GPT-4 model via ChatGPT to help me reverse engineer the binary bluetooth protocol used by my kitchen fan to integrate it into Home Assistant.

It was helpful in a rubber duck way, but could not determine the pattern used to transmit the remaining runtime of the fan in a certain mode. Initial prompt here [0]

I pasted the same prompt into o1-preview and o1-mini and both correctly understood and decoded the pattern using a slightly different method than I devised in April. Asking the models to determine if my code is equivalent to what they reverse engineered resulted in a nuanced and thorough examination, and eventual conclusion that it is equivalent. [1]

Testing the same prompt with gpt4o leads to the same result as April's GPT-4 (via ChatGPT) model.

Amazing progress.

[0]: https://pastebin.com/XZixQEM6

[1]: https://i.postimg.cc/VN1d2vRb/SCR-20240912-sdko.png (sorry about the screenshot – sharing ChatGPT chats is not easy)


FYI, there's a "Save ChatGPT as PDF" Chrome extension [1].

I wouldn't use on a ChatGPT for Business subscription (it may be against your company's policies to export anything), but very convenient for personal use.

https://chromewebstore.google.com/detail/save-chatgpt-as-pdf...


Wow, that is impressive! How were you able to use o1-preview? I pay for ChatGPT, but on chatgpt.com in the model selector I only see 4o, 4o-mini, and 4. Is o1 in that list for you, or is it somewhere else?


Like others here, it was just available on the website and app when I checked. FWIW I still don’t have advanced voice mode.


I don’t have either the new model nor the advanced voice mode as a paying user.


u do just use this link: https://chatgpt.com/?model=o1-preview


That worked. Now can you do that for advanced voice mode??? Pretty please!


Haha I wish, although I saw the other one i forgot its name which makes music for you, now you can ask it for a soundtrack and it gives it back to you in your voice or something like that idk interesting times are ahead for sure!


Wait what is this? Tell me more please


I heard on X suno.com has this feature but couldn’t find it maybe its coming soon? Idk but there are ways u can do it, maybe it was a different service suno is pretty cool tho


they are rolling it out slowly, this link doesn't enable access. they only gave me access around 4:30pm PT


I think they're rolling it out gradually today. I don't see it listed (in the browser, Mac app or Android app).


Likely phased rollout throughout the day today to prevent spikes


“Throughout the day” lol. Advanced voice mode still hasn’t shown up.

They seem to care more about influencers than paying supporters.


Not true; it's already available for me, both O1 and O1-mini. It seems they are indeed rolling out gradually (as any company does).


You got advanced voice mode? I did get o1 preview just a while ago.

You got o1, or o1 preview?


o1-preview and o1-mini. I don't think o1 is publicly available yet.

And I assume voice mode is like Sora; a nice PR play.


It's my understanding paying supporters aren't actually paying enough to cover costs, that $20 isn't nearly enough - in that context, a gradual roll-out seems fair. Though maybe they could introduce a couple more higher-paid tiers to give people the option to pay for early access


> lol.

It's there for a lot of people already. I can see it on 3 different accounts. Including org and just regular paid accounts.


It's available for me. Regular paying customer in the UK.


The linked release mentions trusted users and links to the usage tier limits. Looking at the pricing, o1-preview only appears for tier 5 - requiring 1k+ spend and initial spend 30+ days ago

edit: sorry - this is for API :)


Yes, o1-preview is on the list, as is o1-mini for me (Tier 5, early 2021 API user), under "reasoning".


It appeared for me about thirty minutes after I first checked.


Available on ChatGPT Plus signature or only using the API?


I see it in the mac and iOS app.


it's in my MacOS app, but not in the browser fir the same account


Isn't there a big "Share" button at the top right of the chatgpt interface? Or are you using another front end?


In ChatGPT for Business it limits sharing among users in my org, without an option for public sharing.


I often click on those links and get an error that they are unavailable. I’m not sure if it’s openAI trying to prevent people from sharing evidence of the model behaving badly, or an innocuous explanation like the links are temporary.


They were probably generated using a business account, and the business does not allow public links.


In context, a lot of times it’s clear that the link worked at first (other people who could see it responded) but when I click later, it’s broken.


The link also breaks if the original user deletes the chat that was being linked to, whether on purpose or without realizing it would also break the link.


Even for regular users, the Share button is not always available or functional. It works sometimes, and other times it disappears. For example, since today, I have no Share button at all for chats.


My share chat link moved into the sidebar in the … menu to the right of each chat title (MacOS Safari).


Ah, I see it there now. Thanks.


I'm impressed. I had two modified logic puzzles where ChatGPT-4 fails but o1 succeeds. The training data had too many instances of the unmodified puzzle, so 4 wouldn't get it right. o1 manages to not get tripped up by them.

https://chatgpt.com/share/66e35c37-60c4-8009-8cf9-8fe61f57d3...

https://chatgpt.com/share/66e35f0e-6c98-8009-a128-e9ac677480...


Great progress, I asked GPT-4o and o1-preview to create a python script to make $100 quickly, o1 came up with a very interesting result:

https://x.com/soheil/status/1834320893331587353


The screenshot [1] is not readable for me. Chrome, Android. It's so blurry that I cant recognize a single character. How do other people read it? The resolution is 84x800.



thank you


When I click on the image, it expands to full res, 1713x16392.3


> it expands to full res, 1713x16392.3

Three tenths of a pixel is an interesting resolution…

(The actual res is 1045 × 10000 ; you've multiplied by 1.63923 somehow…?)


I agree,

But it’s what I got when I went to Inspect element > hover over the image

Size it expanded to vs real image size I guess


Pixels have been "non-real" for a long time.


In some contexts. In this context (a PNG), they're very real.


This context is the moreso the browser, complete with it's own sub-pixels, aliasing, simulated/real blurring, zooming, etc.

But file-format context, yes, PNG, BMP, and TFF are the real lossless image kingpins.


When you open on phone, switch to "desktop site" via browser three dots menu


Yes, that works - the page reloads with postimg.cc UI and clicking the image opens full resolution.


Click on it for full resolution


I didn't work until I switched to "Desktop Site" in the browser menu, as a sibling comment suggested. Then the page reloads with various buttons, etc. Until that just the preview image not reacting to clicks.


What if you copy the whole reasoning process example provided by OpenAI, use it as a system prompt (to teach how to reason), use that system prompt in Claude, got4o etc?


It might work a little bit. It's like doing few shot prompting instead of training it to reason.


This is a brilliant hypothesis deconstruction. I am sure others will now be able to test as well and this should confirm their engineering.


Did you edit the message? I cannot see anything now in the screenshot, too low resolution


You need to click on the image for the high res version to load. Sorry, it’s awkward.


The website seems to redirect me to a low resolution image, the first time I clicked on the link it worked as you are saying.


Very cool. It gets the conclusion right, but it did confuse itself briefly after interpreting `256 * last_byte + second_to_last_byte` as big-endian. It's neat that it corrected the confusion, but a little unsatisfying that it doesn't explicitly identify the mistake the way a human would.


What is the brand of the fan? Same problem here with proprietary hood fan...


InVENTer Pulsar


is it better than Claude?


Neither Sonnet nor Opus could solve it or get close in a minimal test I did just now, using the same prompt as above.

Sonnet: https://pastebin.com/24QG3JkN

Opus: https://pastebin.com/PJM99pdy


I think this new model is a generational leap above Claude for tasks that require complex reasoning.


Way worse than Claude for solving a cipher. Not even 1/10th as good. Just one data point, ymmv.


Thanks for sharing this, incredible stuff.


second is very blurry


When you click on the image it loads a higher res version.



What's the incredible part here? Being able to write code to turn hex into decimal?


Also, if you actually read the "chain of thought" contains several embarrassing contradictions and incoherent sentences. If a junior developer wrote this analysis, I'd send them back to reread the fundamentals.


What about thoughts themselves? There are plenty of times I start a thought and realize it doesn't make sense. It's part of the thinking process.


Well, it doesn't "correct" itself later. It just says wrong things and gets the right answer anyways, because this encoding is so simple that many college freshmen could figure it out in their heads.

Read the transcript with a critical eye instead of just skimming it, you'll see what I mean.


> Asking the models to determine if my code is equivalent to what they reverse engineered resulted in a nuanced and thorough examination, and eventual conclusion that it is equivalent.

Did you actually implement to see if it works out of the box ?

Also if you are a free users or accepted that your chats should be used for training then maybe o1 is was just trained on your previous chat and so now knows how to reason about that particular type of problems


That is an interesting thought. This was all done in an account that is opted out of training though.

I have tested the Python code o1 created to decode the timestamps and it works as expected.


That's not how LLM training works.


so it is impossible to use the free user chats to train models ??????


Just did some preliminary testing on decrypting some ROT cyphertext which would have been viable for a human on paper. The output was pretty disappointing: lots of "workish" steps creating letter counts, identifying common words, etc, but many steps were incorrect or not followed up on. In the end, it claimed to check its work and deliver an incorrect solution that did not satisfy the previous steps.

I'm not one to judge AI on pratfalls, and cyphers are a somewhat adversarial task. However, there was no aspect of the reasoning that seemed more advanced or consistent than previous chain-of-thought demos I've seen. So the main proof point we have is the paper, and I'm not sure how I'd go from there to being able to trust this on the kind of task it is intended for. Do others have patterns by which they get utility from chain of thought engines?

Separately, chain of thought outputs really make me long for tool use, because the LLM is often forced to simulate algorithmic outputs. It feels like a commercial chain-of-thought solution like this should have a standard library of functions it can use for 100% reliability on things like letter counts.


Hmm, are you sure it was using the o1 model and not gpt4o? I've been using the o1 model and it does consistently well at solving rotation ciphers.


Does it do better than Claude, because Claude (3.5 sonnet) handled ROTs perfectly and was able to also respond in ROT.


Just tried, no joy from Claude either:

Can you decrypt the following? I don't know the cypher, but the plaintext is Spanish.

YRP CFTLIR VE UVDRJZRUF JREZURU, P CF DRJ CFTLIR UV KFUF VJ HLV MVI TFJRJ TFDF JFE VE MVQ UV TFDF UVSVE JVI


I just tried it with O1 model and it said it couldn't decipher it. It told me what to try, but said it doesn't have the time to do so. Kind of an unusual response.


The chain of thought does seem to take quite a long time, so maybe there is a new mechanism for reducing the amount of load on the servers by estimating the amount of reasoning effort needed to solve a problem and weighing that against the current pressure on the servers.


I got this response from o1-mini with the exact same prompt:

Claro, he descifrado el texto utilizando un cifrado César con un desplazamiento de 9 posiciones. Aquí está el texto original y su correspondiente traducción:

*Texto Cifrado:* ``` YRP CFTLIR VE UVDRJZRUF JREZURU, P CF DRJ CFTLIR UV KFUF VJ HLV MVI TFJRJ TFDF JFE VE MVQ UV TFDF UVSVE JVI ```

*Texto Descifrado:* ``` HAY LOCURA EN DEMASADO SANIDAD, Y LO MÁS LOCURA DE TODO ES QUE VER COSAS COMO SON EN VEZ DE COMO DEBEN SER ```

*Traducción al Inglés:* ``` THERE IS MADNESS IN OVERLY HEALTH, AND THE MOST MADNESS OF ALL IS TO SEE THINGS AS THEY ARE INSTEAD OF AS THEY SHOULD BE ```

Este descifrado asume que se utilizó un cifrado César con un desplazamiento de +9. Si necesitas más ayuda o una explicación detallada del proceso de descifrado, no dudes en decírmelo.

Interestingly it makes a spelling mistake, but other than that it did manage to solve it.


Interesting, it was able to guess it's Rot 17, but it translated it wrong, although "HAY" and some other words were correct.

I've tried only in English so far though.

It told me it's 17, and "HAY GENTE MU DIFERENTE LECTURA, A LO MUY GENTE DE TODO ES QUE VER COSAS COMO SON EN VEZ DE COMO DEBEN SER"

although it really should be "HAY LOCURA EN DEMASIADO SANIDAD, Y LO MAS LOCURA DE TODO ES QUE VER COSAS COMO SON EN VEZ DE COMO DEBEN SER"


Claude made similar mistakes of generating decryption that was similar to plaintext but with stuff mixed in. I suspect my version of the quote (Miguel de Cervantes) is an apocryphal translation, and there's some utility well on both models to pull it towards the real one. With that said, I did not see o1-preview get as close as you did.


For testing I think it's better to use uncommon sentences and also start with English first, if it can solve that, then try other languages.


HAY LOCURA EN DEMASIADO SANIDAD, Y LO MAS LOCURA DE TODO ES QUE VER COSAS COMO SON EN VEZ DE COMO DEBEN SER

Is that correct? I don't know anything but basic Spanish. All I did was:

"The plaintext is in Spanish but I don't know anything else, solve this and explain your reasoning as you go step by step."


That's correct. I got o1-preview myself finally now. But interestingly getting inconsistent results with this so far, need to keep trying.


o1-preview gave me this...

Final Decrypted Message:

"Por ejemplo te agradeceré, y te doy ejemplo de que lo que lees es mi ejemplo"

English Translation:

"For example, I will thank you, and I give you an example of what you read is my example."

... initially it gave up and asked if I knew what type of cypher had been used. I said I thought it was a simple substitution.


Why did it add the accents on to e (é)? Surely that wasn't part of it and it actually "thought a bit too far"?


On my machine just works with 4o

https://chatgpt.com/share/66e34020-33dc-800d-8ab8-8596895844...

With no drama. I'm not sure the bot answer is correct, but looks correct.


o1-preview . Were you using common plaintexts by chance (e.g. proverbs) or ROT13 specifically? Mine use all the right steps but just can't string them together.


It just solved this one for me "Can you solve this cipher? droboscxylecsxoccvsuocryglecsxocc"


Could you share your prompt?


Shared one above.

Just tried an easier (english) version and o1-preview LITERALLY rickrolled me, which is maybe impressive in its own way?

UPDATE: Claude crushed it however.

o1-preview Transcript follows.

**

Can you decrypt the following? I don't know the cypher.

mzilj ivdrzej ufidrek ze kyv sfup reu tre ivrtkzmrkv crkvi ze czwv

Certainly! The encrypted text is:

``` mzilj ivdrzej ufidrek ze kyv sfup reu tre ivrtkzmrkv crkvi ze czwv ```

When we decrypt it, we get:

``` never gonna give you up let you down and desert you ```

It appears to be a line from Rick Astley's song "Never Gonna Give You Up."


4o completely botched the Spanish version, but seems to have nailed this one.

> Without knowing the cipher used to encrypt this message, it’s hard to decrypt it directly. However, the structure and letter patterns suggest it might be a simple substitution cipher, possibly Caesar or some variation of it.

> Let me attempt a few common methods to see if I can identify the cipher.

> The most frequent letters in the encrypted message are 'r', 'z', 'v', and 'e'. In English, the most common letters are typically 'E', 'T', 'A', and 'O'. This suggests the cipher could be a simple substitution or a Caesar cipher.

> I'll now attempt to apply a Caesar cipher decryption by shifting the alphabet through all possible rotations to see if we get readable text.

> One of the decrypted messages, with a shift of 17, reads:

> *"virus remains dormant in the body and can reactivate later in life"*

> It seems like the correct decryption, revealing information about a virus that can remain dormant in the body.


That's weird. Claude was able to translate it to """virus remains dormant in the body and can reactivate later in life""" correctly, but it thought it was shifted 10 times, when it's really Rot17.


If O1 did rickroll you deliberately, then it would indeed more impressive than solving ciphertexts, and I'd start preparing to bow down to our AGI overlords :)


Definitely. A teammate pointed out Reddit posts used in training as a probable cause :)


It's RL so that means it's going to be great on tasks they created for training but not so much on others.

Impressive but the problem with RL is that it requires knowledge of the future.


Out of curiousity can you try the same thing with Claude. Because when I tried Claude with any sort of ROT, it had amazing performance, compared to GPT.


This is a pretty big technical achievement, and I am excited to see this type of advancement in the field.

However, I am very worried about the utility of this tool given that it (like all LLMs) is still prone to hallucination. Exactly who is it for?

If you're enough of an expert to critically judge the output, you're probably just as well off doing the reasoning yourself. If you're not capable of evaluating the output, you risk relying on completely wrong answers.

For example, I just asked it to evaluate an algorithm I'm working on to optimize database join ordering. Early in the reasoning process it confidently and incorrectly stated that "join costs are usually symmetrical" and then later steps incorporated that, trying to get me to "simplify" my algorithm by using an undirected graph instead of a directed one as the internal data structure.

If you're familiar with database optimization, you'll know that this is... very wrong. But otherwise, the line of reasoning was cogent and compelling.

I worry it would lead me astray, if it confidently relied on a fact that I wasn't able to immediately recognize was incorrect.


The utility I usually get from these kinds of tools so far is more like an extremely good reference or helper to something I could definitely figure out if given enough time. E.g. figuring out the best way to clean up a specific syntax error, setting up a class and some obvious base functions I'm going to need in it, helping me figure out where I might have went astray in solving a math problem.

The tools have not been at "and now I don't need code tests & review, mathematicians in society, or factbooks all because I have an LLM" level. While that's definitely a goal of AGI it's also definitely not my bar for weighing whether there is utility in a tool.

The alternative way to think about it: the value of a tool is in what you can figure out to do with it, not in whether it's perfect at doing something. On one extreme that means a dictionary can still be a useful spelling reference even if books have a rare typo. On the other extreme that means a coworker can still offer valuable insight into your code even if they make lots of coding errors and don't have an accurate understanding of everything there is to know about all of C++. Whether you get something out of either of these cases is a product of how much they can help you reach the accuracy you need to arrive at and the way you utilize the tool, not their accuracy alone. Usually I can get a lot out of a person who is really bad at one shot coding a perfect answer but feels like their answer seems right so I can get quite a bit out of an LLM that has the same problem. That might not be true for all types of questions though but that's fine, not all tools have utility in every problem.


>If you're enough of an expert to critically judge the output, you're probably just as well off doing the reasoning yourself.

Thought requires energy. A lot of it. Humans are for more efficient in this regard than LLMs, but then a bicycle is also much more efficient than a race car. I've found that even when they are hilariously wrong about something, simply the directionality of the line of reasoning can be enough to usefully accelerate my own thought.


Look, I've been experimenting with this for the past year, and this is definitely the happy path.

The unhappy path, which I've also experienced, is that the model outputs something plausible but false but that aligns with an area where my thinking was already confused and sends me down the wrong path.

I've had to calibrate my level of suspicion, and so far using these things more effectively has always been in the direction that more suspicion is better.

There's been a couple times in the last week where I'm working on something complex and I deliberately don't use an LLM since I'm now actively afraid they'll increase my level of confusion.


There are phases in every developer’s growth, where you transition from asking coworkers or classmates, to asking on stack overflow, to reading stack overflow, to reading docs and man pages and mailing lists and source code.

I think like you, I worry that LLMs will handicap this trajectory for people newer in the field, because GPT-4/Sonnet/Whatever are an exceptionally good classmate/coworker. So good that you might try to delay progressing along that trajectory.

But LLMs have all the flaws of a classmate: they aren’t authoritative, their opinions are strongly stated but often based on flimsy assumptions that you aren’t qualified to refute or verify, and so on.

I know intellectually that the kids will be alright, but it’ll be interesting to see how we get there. I suspect that as time goes on people will simply increase their discount rate on LLM responses, like you have, until they get dissatisfied with that value and just decide to get good at reading docs.


Just added o1 to https://double.bot if anyone would like to try it for coding.

---

Some thoughts:

* The performance is really good. I have a private set of questions I note down whenever gpt-4o/sonnet fails. o1 solved everything so far.

* It really is quite slow

* It's interesting that the chain of thought is hidden. This is I think the first time where OpenAI can improve their models without it being immediately distilled by open models. It'll be interesting to see how quickly the oss field can catch up technique-wise as there's already been a lot of inference time compute papers recently [1,2]

* Notably it's not clear whether o1-preview as it's available now is doing tree search or just single shoting a cot that is distilled from better/more detailed trajectories in the training distribution.

[1](https://arxiv.org/abs/2407.21787)

[2](https://arxiv.org/abs/2408.03314)


Trying out Double now.

o1 did a significantly better job converting a JavaScript file to TypeScript than Llama 3.1 405B, GitHub Copilot, and Claude 3.5. It even simplified my code a bit while retaining the same functionality. Very impressive.

It was able to refactor a ~160 line file but I'm getting an infinite "thinking bubble" on a ~420 line file. Maybe something's timing out with the longer o1 response times?


> Maybe something's timing out with the longer o1 response times?

Let me look into this – one issue is that OpenAI doesn't expose a streaming endpoint via the API for o1 models. It's possible there's an HTTP timeout occurring in the stack. Thanks for the report


I've gotten this as well, on very short code snippets. I type in a prompt and then sometimes it doesn't respond with anything, it gets stuck on the thinking, and other times it gets halfway through the response generation and then it gets stuck as well.

https://chatgpt.com/c/66e3a628-2814-8012-a6c5-33721b78cb99


First shot, I gave it a medium-difficulty math problem, something I actually wanted the answer to (derive the KL divergence between two Laplace distributions). It thought for a long time, and still got it wrong, producing a plausible but wrong answer. After some prodding, it revised itself and then got it wrong again. I still feel that I can't rely on these systems.


I was trying to get 4o today to do some medium-difficulty Typescript to map between two different ORM schemas. It's a great problem because it's really not too hard, pretty well constrained, and I can provide a ton of context and tests, declaration files, before/after expectations.

4o was struggling so I gave up. Tried o1 on it, and after trying nearly 15 prompts back and forth helping it along the way we're still far from correct. It's hard to tell if it's much better, but at least my intuition from this feel like this is pretty incremental.


Look where you were 3 years ago, and where you are now.

And then imagine where you will be in 5 more years.

If it can almost get a complex problem right now, I'm dead sure it will get it correct within 5 years


> I'm dead sure it will get it correct within 5 years

You might be right.

But plenty of people said we'd all be getting around in self-driving cars for sure 10 years ago.


we do have self driving car but since it directly affects people's life it needs to be close to 100% accurate and no margin of errors. Not necessarily the case for LLMs.


No, we have cars that can drive themselves quite well in good weather, but fail completely in heavy snow/poor visibility. Which is actually a great analogy to LLMs - they work great in the simple cases (80% of the time), it’s that last 20% that’s substantially harder.


I'm not? The history of AI development is littered with examples of false starts, hidden traps, and promising breakthroughs that eventually expose deeper and more difficult problems [1].

I wouldn't be shocked if it could eventually get it right, but dead sure?

1. https://en.wikipedia.org/wiki/AI_winter


It is not at all clear that "produce correct answer" is the natural endpoint of "produce plausible on-topic utterances that look like they could be answers." To do the former you need to know something about the underlying structure of reality (or have seen the answer before), to do the latter you only need to be good at pattern-matching and language.


You're dead sure? I wouldn't say anything definite about technology advancements. People seem to underestimate the last 20% of the problem and only focus on the massive 80% improvements up to this point.


The progress since GPT-3 hasn't been spectacularly fast.


Going back 3 years, it feels like incredible progress. Going back 1 year, it feels like pretty much the same limitations.


Getting complex problem = having the solution in some form in the training dataset.

All we are gonna get is better and better googles.


Why?


Lets say that you want to make a flying car, that can also double as a submarine.

Nobody has done this yet. So information doesn't exist on how to do it. An LLM may give you some generic answers from training sets on what engineering/analysis tasks to do, but it won't be able to give you a complex and complete design for one.

A model that can actually solve problems would be able to design you one.


They can solve problems that are not in their training set. There are many examples in the article...


I literally just gave you an example of one it can't solve, despite having a vast knowledge of mechanical and aeronautical subjects. All the examples are obviously in its training set.

Here is another better example - none of these models can create a better ML accelerator despite having a wide array of electrical and computer engineering knowledge. If they did, OpenAI would pretty much be printing their own chips like Google does.


In your previous comment you stated that LLMs can only solve problems that are in their training set (e.g. "all we are gonna get is better and better googles"). But that's not true as I pointed out.

Now your argument seems to be that they can't solve all problems or, more charitably, can't solve highly complex problems. This is true but by that standard, the vast majority of humans can't reason either.

Yes, the reasoning capacities of current LLMs are limited but it's incorrect to pretend they can't reason at all.


Think of it as a knowledge graph.

If LLM is trained on python coding, and its trained separately on just plain english language on how to decode cyphers, it can statistically interpolate between the two. That is a form of problem solving, but its not reasoning.

This is why when you ask it fairly complex problems on how to make a bicycle using a CNC with limited work space, it will tell you generic answers, because its just staistically looking at a knowledge graph.

A human can reason, because when there is a gray area in a knowledge graph, they can effectively expand it. If I was given the same task, I would know that I have to learn things like CAD design, CNC code generation, parametric modeling, structural analysis, and so on, and I could do that all without being prompted to do so.

You will know when AI models will start to reason when they start asking questions without ever being told explicitly to ask questions through prompt or training.


But can it now say "I don't know." ? Or can it evaluate its own results and came to the conclusion that its just a wild guess?

I am still impressed by the progress though.


I still don't have a Mr. Fusion in my house, FYI.

We always overestimate the future.


what makes you so "dead sure"? it's just hallucinating as always


Have you never heard of "local maxima"? Why are you so certain another 5 years will provide any qualitative advancement at all?


Maybe you are wrong if you don’t know the answer?


Sounds great, but so does their "new flagship model that can reason across audio, vision, and text in real time" announced in May. [0]

[0] https://openai.com/index/hello-gpt-4o/


Agreed. Release announcements and benchmarks always sound world-changing, but the reality is that every new model is bringing smaller practical improvements to the end user over its predecessor.


The point above is the said amazing multimodal version of ChatGPT was announced in May and are still not the actual offered way to interact with the service in September (despite the model choice being called 4 omni it's still not actually using multimodal IO). It could be a giant leap in practical improvements but it doesn't matter if you can't actually use what is announced.

This one, oddly, seems to actually be launching before that one despite just being announced though.


Sonnet 3.5 brought the largest practical improvements to this end user over all predecessors (so far).


This one [o1/Strawberry] is available. I have it, though it's limited to 30 messages/week in ChatGPT Plus.


30 messages per week? Wow. You better not miss!


In the world of hype driven vaporware AI products[1], giving people limited access is at least proof they're not lying about it actually existing or it being able to do what they claim.

[1] https://www.reddit.com/r/LocalLLaMA/comments/1fd75nm/out_of_...


Ok, but the point is that they told me I would have flirty ScarJo ASMR whispering to me at bed time that I am a good boy, but that's not what we got is it?


At 30 messages per week they could secretly hire a human to give the responses


How do you get access? I don’t have it and am a ChatGPT plus subscriber.


it will roll out to everyone over the next few hours


I'm using the Android ChatGPT app (and am in the Android Beta program, though not sure if that matters)


I'm plus subscriber and I have o1-preview and o1-mini available


Dang - I don't see the model listed for me in the iOS app nor the web interface.

I'm a ChatGPT subscriber.


Same! And have been a subscriber for 18 months.


I've been a subscriber since close to the beginning, cancelled 2 weeks ago. I got an email telling me that this is available, but only for Plus.

But for 30 posts per week I see no reason to subscribe again.

I prefer to be frustrated because the quality is unreliable because I'm not paying, instead of having an equally unreliable experience as a paying customer.

Not paying feels the same. It made me wonder if they sometimes just hand over the chat to a lower quality model without telling the Plus subscriber.

The only thing I miss is not being able to tell it to run code for me, but it's not worth the frustration.


Recently I was starting to think I imagined that. Back then they gave me the impression it would be released within week or so of the announcement. Have they explained the delay?


When you go into the regular, slow, audio mode there's a little info circle in the top right corner. Over time that circle has been giving periodic updates. At one point the message was that it would be delayed, and now it's saying it's "on it's way" by the end of fall.

Not perfect but they've been putting their communications in there.


It is definitely available today and I believe it was available shortly after the announcement.


The text-to-text model is available. And you can use it with the old voice interface that does Whipser+GPT+TTS. But what was advertised is a model capable of direct audio-to-audio. That’s not available.


Interestingly, the New York Times mistakenly reported on and reviewed the old features as if they were the new ones. So lots of confusion to go around.


My guess is they're going to incorporate all of these advances into gpt-5 so it looks like a "best of all worlds" model.


That is in chatgpt now and it greatly improves chatgpt. What are you on to now?


Audio has only rolled out to a small subset of paying customers. There's still no word about the direct-from-4o image generation they demo'd. Let alone the video capabilities.

So no, it's not in chatgpt.


ah okay you got a point


Yep, all these AI announcements from big companies feel like promises for the future rather than immediate solutions. I miss the days when you could actually use a product right after it was announced, instead of waiting for some indefinite "coming soon."


As an entrepreneur, I do this often. In order to sleep better at night, I explain to myself that it’s somewhat harmless to give teasers about future content releases. If someone buys my product based on future promises or speculation, they’re investing into the development and my company’s future.


Generating more "think out loud" tokens and hiding them from the user...

Idk if I'm "feeling the AGI" if I'm being honest.

Also... telling that they choose to benchmark against CodeForces rather than SWE-bench.


> Also... telling that they choose to benchmark against CodeForces rather than SWE-bench.

They also worked with Devin to benchmark it on Devin's internal benchmarks, where it's twice as good as GPT-4o: https://x.com/cognition_labs/status/1834292718174077014 https://www.cognition.ai/blog/evaluating-coding-agents


They’re running a business. They don’t owe you their trade secrets.


Why not? Isn't that basically what humans do? Sit there and think for a while before answering, going down different branches/chains of thought?


This new approach is showing:

1) The "bitter lesson" may not be true, and there is a fundamental limit to transformer intelligence.

2) The "bitter lesson" is true, and there just isn't enough data/compute/energy to train AGI.

All the cognition should be happening inside the transformer. Attention is all you need. The possible cognition and reasoning occurring "inside" in high dimensions is much more advanced than any possible cognition that you output into text tokens.

This feels like a sidequest/hack on what was otherwise a promising path to AGI.


On the contrary, this suggests that the bitter lesson is alive and kicking. The bitter lesson doesn't say "compute is all you need", it says "only those methods which allow you to make better use of hardware as hardware itself scales are relevant".

This chain of thought / reflection method allows you to make better use of the hardware as the hardware itself scales. If a given transformer is N billion parameters, and to solve a harder problem we estimate we need 10N billion parameters, one way to do it is to build a GPU cluster 10x larger.

This method shows that there might be another way: instead train the N billion model differently so that we can use 10x of it at inference time. Say hardware gets 2x better in 2 years -- then this method will be 20x better than now!


I'd be shocked if we don't see diminishing returns in the inference compute scaling laws. We already didn't deserve how clean and predictive the pre-training scaling laws were, no way the universe grants us another boon of that magnitude


Does that mean human intelligence is cheapened when you talk out a problem to yourself? Or when you write down steps solving a problem?

It's the exact same thing here.


The similarity is cosmetic only. The reason it is used is because it's easy to leverage existing work in LLMs, and scaling (although not cheap) is an obvious approach.


> Does that mean human intelligence is cheapened when you talk out a problem to yourself?

In a sense, maybe yeah. Of course if one were to really be absolute about that statement it would be absurd, it would greatly overfit the reality.

But it is interesting to assume this statement as true. Oftentimes when we think of ideas "off the top of our heads" they are not as profound as ideas that "come to us" in the shower. The subconscious may be doing 'more' 'computation' in a sense. Lakoff said the subconscious was 98% of the brain, and that the conscious mind is the tip of the iceberg of thought.


lol come on it’s not the exact same thing. At best this is like gagging yourself while you talk about it then engaging yourself when you say the answer. And that presupposing LLMs are thinking in, your words, exactly the same way as humans.

At best it maybe vaguely resembles thinking


> "lol come on"

I've never found this sort of argument convincing. it's very Chalmers.


Admittedly not my most articulate, my exasperation showed through. To some extent it seems warranted as it tends to be the most effective tactic against hyperbole. Still trying to find a better solution.


Karpathy himself believes that neural networks are perfectly plausible as a key component to AGI. He has said that it doesn't need to be superseded by something better, it's just that everything else around it (especially infrastructure) needs to improve. As one of the most valuable opinions in the entire world on the subject, I tend to trust what he said.

source: https://youtu.be/hM_h0UA7upI?t=973


I think it's too soon to tell. Training the next generation of models means building out entire datacenters. So while they wait they have engineers build these sidequests/hacks.


Attention is about similarity/statistical correlation which is fundamentally stochastic , while reasoning needs to be truthful and exact to be successful.


Imagine instead the bitter lesson says: we can expand an outwards circle in many dimensions of ways to continuously mathematically manipulate data to adjust outputs.

Even the attention-token approach is on the grand scale of things a simple line outwards from the centre; we have not even explored around the centre (with the same compute spend) for things like non-token generation, different layers/different activation functions and norming / query/key/value set up (why do we only use the 3 inherent to contextualising tokens, why not add a 4th matrix for something else?), character, sentence, whole thought, paragraph one-shot generation, positional embeddings which could work differently.

The bitter lesson says there is almost a work completely untouched by our findings for us to explore. The temporary work of non-data approaches can piggy back off a point on the line; it cannot expand it like we can as we exude out from the circle..


Sure, but if I want a human, I can hire a human. Humans also do many other things I don't want my LLM to do.


well it could be a lot cheaper to hire the AI model instead of a human?


This kind of short-sighted, simplistic reasoning / behaviour is what I worry about the most in terms of where our society is going. I always wonder - who will be the people buying or using your software (build very cheaply and efficiently with AI) once they can do the same, or get replaced by AI, or bankrupt themselves?

Everybody seems to be so focused on how to get ahead in race to profitability, that they don't consider the shortcut they are taking might be leading to a cliff.


Except that these aren't thoughts. These techniques are improvements to how the model breaks down input data, and how it evaluates its responses to arrive at a result that most closely approximates patterns it was previously rewarded for. Calling this "thinking" is anthropomorphizing what's really happening. "AI" companies love to throw these phrases around, since it obviously creates hype and pumps up their valuation.

Human thinking is much more nuanced than this mechanical process. We rely on actually understanding the meaning of what the text represents. We use deduction, intuition and reasoning that involves semantic relationships between ideas. Our understanding of the world doesn't require "reinforcement learning" and being trained on all the text that's ever been written.

Of course, this isn't to say that machine learning methods can't be useful, or that we can't keep improving them to yield better results. But these are still methods that mimic human intelligence, and I think it's disingenuous to label them as such.


It becomes thinking when you reinforcement learn on those Chain-of-Thought generations. The LLM is just a very good initialization.


Yes but with concepts instead of tokens spelling out the written representation of those concepts.


Without a world model, not really.


The whole thing is a world model- accurately predicting text that describes things happening in a world, can only be done by modeling the world.


Is it?


Exploring different approaches and stumbling on AGI eventually through a combination of random discoveries will be the way to go.

Same as Bitcoin being the right combination of things that already existed.


Crypto being used as an example of how we have moved forward successfully as a species is backward toilet sitting behaviour.


I gave the Crossword puzzle to Claude and got a correct response[1]. The fact that they are comparing this to gpt4o and not to gpt4 suggests that it is less impressive than they are trying to pretend.

[1]:

Based on the given clues, here's the solved crossword puzzle: +---+---+---+---+---+---+ | E | S | C | A | P | E | +---+---+---+---+---+---+ | S | E | A | L | E | R | +---+---+---+---+---+---+ | T | E | R | E | S | A | +---+---+---+---+---+---+ | A | D | E | P | T | S | +---+---+---+---+---+---+ | T | E | P | E | E | E | +---+---+---+---+---+---+ | E | R | R | O | R | S | +---+---+---+---+---+---+ Across:

ESCAPE (Evade) SEALER (One to close envelopes) TERESA (Mother Teresa) ADEPTS (Initiated people) TEPEE (Native American tent) ERRORS (Mistakes)

Down:

ESTATE (Estate car - Station wagon) SEEDER (Automatic planting machine) CAREER (Profession) ALEPPO (Syrian and Turkish pepper variety) PESTER (Annoy) ERASES (Deletes)


As good as Claude has gotten recently in reasoning, they are likely using RL behind the scenes too. Supposedly, o1/strawberry was initially created as an engine for high-quality synthetic reasoning data for the new model generation. I wonder if Anthropic could release their generator as a usable model too.


while i was initially excited now im having second thoughts after seeing the experiments run by people in the comments here

on X I see a totally different energy more about hyping it

on HN I see reserved and collected take which I trust more.

I do wonder why they chose gpt4o which I never bother to use for coding.

Claude is still king and looks like I won't have to subscribe to ChatGPT Plus seeing it fail on some of the important experiments run by folks on HN

If anything these type of releases that air more on the side of hype given OpenAI's track record


I think people are wrong just about as often here as anywhere else on the internet, but with more confidence. Averaging HN comments would just produce outputs similar to rudimentary LLMs with a bit snobbier of a tone, I imagine.


I just tried o1, and it did pretty well with understanding this minor issue with subtitles on a Dutch TV show we were watching.

I asked it "I was watching a show and in the subtitles an umlaut u was rendered as 1/4, i.e. a single character that said 1/4. Why would this happen?"

and it gave a pretty thorough explanation of exactly which encoding issue was to blame.

https://chatgpt.com/share/66e37145-72bc-800a-be7b-f7c76471a1...


4o’s answer seems sufficient, though it provides less detail than o1.

https://chatgpt.com/share/66e373d7-7814-8009-86c3-1ce549ca2e...


A common problem, no doubt, with a lot of training context. But man. What a time to be alive.


Damn, the model really goes to length to those trivial but hard problems. Impressive


I've given this a test run on some email threads, asking the model to extract the positions and requirements of each person in a lengthy and convoluted discussion. It absolutely nailed the result, far exceeding what Claude 3.5 Sonnet was capable of -- my previous goto model for such analysis work. I also used it to apply APA style guidelines to various parts of a document and it executed the job flawlessly and with a tighter finesse than Claude. Claude's response was lengthier - correct, but unnecessarily long. gpt-o1-preview combined several logically-related bullets into a single bullet, showing how chain of thought reasoning gives the model more time to comprehend things and product a result that is not just correct, but "really correct".


My point of view: this is a real advancement. I’ve always believed that with the right data allowing the LLM to be trained to imitate reasoning, it’s possible to improve its performance. However, this is still pattern matching, and I suspect that this approach may not be very effective for creating true generalization. As a result, once o1 becomes generally available, we will likely notice the persistent hallucinations and faulty reasoning, especially when the problem is sufficiently new or complex, beyond the “reasoning programs” or “reasoning patterns” the model learned during the reinforcement learning phase. https://www.lycee.ai/blog/openai-o1-release-agi-reasoning


> As a result, once o1 becomes generally available, we will likely notice the persistent hallucinations and faulty reasoning, especially when the problem is sufficiently new or complex, beyond the “reasoning programs” or “reasoning patterns” the model learned during the reinforcement learning phase.

I had been using 4o as a rubber ducky for some projects recently. Since I appeared to have access to o1-preview, I decided to go back and redo some of those conversations with o1-preview.

I think your comment is spot on. It's definitely an advancement, but still makes some pretty clear mistakes and does some fairly faulty reasoning. It especially seems to have a hard time with causal ordering, and reasoning about dependencies in a distributed system. Frequently it gets the relationships backwards, leading to hilarious code examples.


True. I just extensively tested o1 and came to the same conclusion.


This is something that people have toyed with to improve the quality of LLM responses. Often instructing the LLM to "think about" a problem before giving the answer will greatly improve the quality of response. For example, if you ask it how many letters are in the correctly spelled version of a misspelled word, it will first give the correct spelling, and then the number (which is often correct). But if you instruct it to only give the number the accuracy is greatly reduced.

I like the idea too that they turbocharged it by taking the limits off during the "thinking" state -- so if an LLM wants to think about horrible racist things or how to build bombs or other things that RLHF filters out that's fine so long as it isn't reflected in the final answer.


> I like the idea too that they turbocharged it by taking the limits off during the "thinking" state

They also specifically trained the model to do that thinking out loud.


My first interpretation of this is that it's jazzed-up Chain-Of-Thought. The results look pretty promising, but i'm most interested in this:

> Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.

Mentioning competitive advantage here signals to me that OpenAI believes there moat is evaporating. Past the business context, my gut reaction is this negatively impacts model usability, but i'm having a hard time putting my finger on why.


>my gut reaction is this negatively impacts model usability, but i'm having a hard time putting my finger on why.

If the model outputs an incorrect answer due to a single mistake/incorrect assumption in reasoning, the user has no way to correct it as it can't see the reasoning so can't see where the mistake was.


Maybe CriticGPT could be used here [0]. Have the CoT model produce a result, and either automatically or upon user request, ask CriticGPT to review the hidden CoT and feed the critique into the next response. This way the error can (hopefully) be spotted and corrected without revealing the whole process to the user.

[0] https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/

Day dreaming: imagine if this architecture takes off and the AI "thought process" becomes hidden and private much like human thoughts. I wonder then if a future robot's inner dialog could be subpoenaed in court, connected to some special debugger, and have their "thoughts" read out loud in court to determine why it acted in some way.


> my gut reaction is this negatively impacts model usability, but i'm having a hard time putting my finger on why.

This will make it harder for things like DSPy to work, which rely using "good" CoT examples as few-shot examples.


yeah I guess base models without built-it CoT are not going away, exactly because you might want to tune it yourself. If DSPy (or similar) evolves to allow the same or similar than OpenAI did with o1, that will be quite powerful, but we still need the big foundational models powering it all

on the other hand, if cementing techniques in the models becomes a trend, we might see various models around with each technique for us to pick and choose beyond CoT without need for us to guide the model ourselves, then what's left for us to optimize is the prompts on what we want, and the routing the combination of those in a nice pipeline

still the principle of DSPy stays the same, have a dataset to evaluate, let the machine trial an error prompts, hyperparameters and so on, just switch around different techniques (possibly automating that too), and get measurable, optimizable results


The moat is expanding from use count, also the moat is to lead and advance faster than anyone can catch up, you will always have the best mode with the best infrastructure and low limits.


> Unless otherwise specified, we evaluated o1 on the maximal test-time compute setting.

Maximal test time is the maximum amount of time spent doing the “Chain of Thought” “reasoning”. So that’s what these results are based on.

The caveat is that in the graphs they show that for each increase in test-time performance, the (wall) time / compute goes up exponentially.

So there is a potentially interesting play here. They can honestly boast these amazing results (it’s the same model after all) yet the actual product may have a lower order of magnitude of “test-time” and not be as good.


I interpreted it to suggest that the product might include a user-facing “maximum test time” knob.

Generating problem sets for kids? You might only need or want a basic level of introspection, even though you like the flavor of this model’s personality over that of its predecessors.

Problem worth thinking long, hard, and expensively about? Turn that knob up to 11, and you’ll get a better-quality answer with no human-in-the-loop coaching or trial-and-error involved. You’ll just get your answer in timeframes closer to human ones, consuming more (metered) tokens along the way.


Yeah, I think this is the goal - remember; there are some problems that only need to be solved correctly once! Imagine something like a millennium problem - you'd be willing to wait a pretty long time for a proof of the RH!


This power law behavior of test-time improvement seems to be pretty ubiquitous now. In more agents is all you need [1], they start to see this as a function of ensemble size. It also shows up in: Large Language Monkeys: Scaling Inference Compute with Repeated Sampling [2]

I sorta wish everyone would plot their y-axis with logit y-axis, rather than 0->100 accuracy (including the openai post), to help show the power-law behavior. This is especially important when talking about incremental gains in the ~90->95, 95->99%. When the values (like the open ai post) are between 20->80, logit and linear look pretty similar, so you can "see" the inference power-law

[1] https://arxiv.org/abs/2402.05120 [2] https://arxiv.org/abs/2407.21787


Surprising that at run time it needs an exponential increase in thinking to achieved a linear increase in output quality. I suppose it's due to diminishing returns to adding more and more thought.


The exponential increase is presumably because of the branching factor of the tree of thoughts. Think of a binary tree who's number of leaf nodes doubles (= exponential growth) at each level.

It's not too surprising that the corresponding increase in quality is only linear - how much difference in quality would you expect between the best, say, 10 word answer to a question, and the best 11 word answer ?

It'll be interesting to see what they charge for this. An exponential increase in thinking time means an exponential increase in FLOPs/dollars.


Some commenters seem a bit confused as to how this works. Here is my understanding, hoping it helps clarify things.

Ask something to a model and it will reply in one go, likely imperfectly, as if you had one second to think before answering a question. You can use CoT prompting to force it to reason out loud, which improves quality, but the process is still linear. It's as if you still had one second to start answering but you could be a lot slower in your response, which removes some mistakes.

Now if instead of doing that you query the model once with CoT, then ask it or another model to critically assess the reply, then ask the model to improve on its first reply using that feedback, then keep doing that until the critic is satisfied, the output will be better still. Note that this is a feedback loop with multiple requests, which is of different nature that CoT and much more akin to how a human would approach a complex problem. You can get MUCH better results that way, a good example being Code Interpreter. If classic LLM usage is system 1 thinking, this is system 2.

That's how o1 works at test time, probably.

For training, my guess is that they started from a model not that far from GPT-4o and fine-tuned it with RL by using the above feedback loop but this time converting the critic to a reward signal for a RL algorithm. That way, the model gets better at first guessing and needs less back and forth for the same output quality.

As for the training data, I'm wondering if you can't somehow get infinite training data by just throwing random challenges at it, or very hard ones, and let the model think about/train on them for a very long time (as long as the critic is unforgiving enough).


Maybe this is improvement in some areas, still I got spurious reasoning and inability to connect three simple facts:

Yes, "el presente acta de nacimiento" is correct in Spanish.

Explanation:

    "Acta" is a feminine noun that begins with a stressed "a" sound. In Spanish, when a feminine singular noun starts with a stressed "a" or "ha", the definite article "la" is replaced with "el" to facilitate pronunciation. However, the noun remains feminine.

    Adjectives and modifiers that accompany the noun "acta" should agree in feminine gender and singular number. In this case, "presente" is an adjective that has the same form for both masculine and feminine singular nouns.
So, combining these rules:

    "El" (definite article used before feminine nouns starting with stressed "a")
    "Presente" (adjective agreeing in feminine singular)
    "Acta de nacimiento" (feminine noun with its complement)
Therefore, "el presente acta de nacimiento" is grammatically correct.


Native Spanish speaker here. No, it isn't. When a word stays in the middle of 'La' plus a noun starting with 'a', the cacophony is null now, thus, you can perfectly use (if not mandatory) "la presente acta".

Proof: https://www.elcastellano.org/francisco-jos%C3%A9-d%C3%ADaz-%...


yes, exactly - this is my point


Here's an unpopular take on this:

"We had the chance to make AI decision-making auditable but are locking ourselves out of hundreds of critical applications by not exposing the chain of thought."

One of the key blockers in many customer discussions I have is that AI models are not really auditable and that automating complex processes with them (let alone debug things when "reasoning" goes awry) is difficult if not impossible unless you do multi-shot and keep track of all the intermediate outputs.

I really hope they expose the chain of thought as some sort of machine-parsable output, otherwise no real progress will have been made (many benchmarks are not really significant when you try to apply LLMs to real-life applications and use cases...)


I suspect that actually reading the "chain of thought" would reveal obvious "logic" errors embarrassingly often.


It would still be auditable. In a few industries that is the only blocker for adoption--even if the outputs are incorrect.


Oh, perhaps. I mean that OpenAI won't do it because it would be bad for business and pop the AI bubble early.


I'll give an argument against this with the caveat it applies only if these are pure LLMs without heuristics or helper models (I do not believe that to be the case with o1).

The problem with auditing is not only are the outputs incorrect, but the "inputs" of the chained steps have no fundamental logical connection to the outputs. A statistical connection yes, but not a causal one.

For the trail to be auditable, processing would have to be taking place at the symbolic level of what the tokens represent in the steps. But this is not what happens. The transformer(s) (because these are now sampling multiple models) are finding the most likely set of tokens that reinforce a training objective which is a completed set of training chains. It is fundamentally operating below the symbolic or semantic level of the text.

This is why anthropomorphizing these is so dangerous. It isn't actually "explaining" its work. The CoT is essentially one large output, broken into parts. The RL training objective does two useful things: (1) break it down into much smaller parts, which drops the error significantly as that scales as an exponential of the token length, and (2) provides better coverage of training data for common subproblems. Both of those are valuable. Obviously, in many cases the reasons actually match the output. But hallucinations can happen anywhere throughout the chain, in ways which are basically undeterministic.

An intermediate step can provide a bad token and blithely ignore that to provide a correct answer. If you look at intermediate training of addition in pure LLMs, you'll get lots of results that look sort of like:

> "Add 123 + 456 and show your work"

> "First we add 6 + 3 in the single digits which is 9. Moving on we have 5 + 2 which is 8 in the tens place. And in the hundreds place, we have 5. This equals 579."

The above is very hand-wavy. I do not know if the actual prompts look like that. But there's an error in the intermediate step (5 + 2 = 8) that does not actually matter to the output. Lots of "emergent" properties of LLMs—arguably all of them—go away when partial credit is given for some of the tokens. And this scales predictably without a cliff [1]. This is also what you would expect if LLMs were "just" token predictors.

But if LLMs are really just token predictors, then we should not expect intermediate results to matter in a way in which they deterministically change the output. It isn't just that CoT can chaotically change future tokens, previous tokens can "hallucinate" in a valid output statement.

[1] Are Emergent Abilities of Large Language Models a Mirage?: https://arxiv.org/abs/2304.15004


I believe that is the case. Out of curiosity, I had this model try to solve a very simple Sudoku puzzle in ChatGPT, and it failed spectacularly.

It goes on and on making reasoning mistakes, and always ends up claiming that the puzzle is unsolvable and apologizing. I didn’t expect it to solve the puzzle, but the whole reasoning process seems fraught with errors.


Here's the o1-preview answer to the strawberry question:

--

There are *three* letter "R"s in the word "strawberry."

Let's break down the word to count the occurrences:

- *S* - *T* - *R* - *A* - *W* - *B* - *E* - *R* - *R* - *Y*

The letter "R" appears in positions 3, 8, and 9.


Given that this is a well known example, presumably OpenAI included a training set using letters.


Not everyone got the same results.

https://news.ycombinator.com/item?id=41530647

Which is to be expected, the answer isn’t deterministic.


can you ask it:

9.11 and 9.9, which number is larger


> Evaluating Number formats > I'm considering that the question might be about comparing two numbers, 9.11 and 9.9, in different contexts such as different locale formats or representing dates. 9.11 is larger than 9.9

Well, it's kind of right because we left it ambiguous what kinds of numbers these are. September 11 is a greater date than September 9. However, I would want the model to have the street smarts to assume without further context, I must be just talking about numbers as in numbers.


Was it trained on Excel code?


it claims 9.11 is larger, hah.

https://postimg.cc/KKKLZNjW


That's odd. Tried several times with the same prompt and always got the correct answer with step by step explanation.


Those models are just not deterministic.


4.9 is larger than 4.11.

Explanation:

• 4.9 is equivalent to 4.90. • 4.11 remains 4.11.

When comparing the numbers:

• 4.90 (which is 4.9) • 4.11

Since 4.90 > 4.11, 4.9 is the larger number.


Try it with 9.11 and 9.9. Some have proposed that it thinks "9.11" is larger because in a social context 9/11 has much larger significance.


From o1-mini:

To determine which number is larger between 9.11 and 9.9, let's compare them step by step:

Understanding the Numbers: 9.11 can be read as "nine and eleven hundredths." 9.9 can be read as "nine and nine tenths." To make the comparison easier, you can also think of it as 9.90 (adding a zero doesn't change the value).

Comparing the Decimal Places: Ones Place: Both numbers have 9 in the ones place. Tenths Place: 9.11 has a 1 in the tenths place. 9.90 has a 9 in the tenths place.

Conclusion: Since 9.90 has a larger value in the tenths place compared to 9.11, 9.90 (or 9.9) is larger than 9.11. Therefore, 9.9 is larger than 9.11.


Interesting to note, as an outside observer only keeping track of this stuff as a hobby, that it seems like most of OpenAI’s efforts to drive down compute costs per token and scale up context windows is likely being done in service of enabling larger and larger chains of thought and reasoning before the model predicts its final output tokens. The benefits of lower costs and larger contexts to API consumers and applications - which I had assumed to be the primary goal - seem likely to mostly be happy side effects.

This makes obvious sense in retrospect, since my own personal experiments with spinning up a recursive agent a few years ago using GPT-3 ran into issues with insufficient context length and loss of context as tokens needed to be discarded, which made the agent very unreliable. But I had not realized this until just now. I wonder what else is hiding in plain sight?


I think you can slice it whichever direction you prefer e.g. OpenAI needs more than "we ran it on 10x as much hardware" to end up with a really useful AI model, it needs to get efficient and smarter just as proportionally as it gets larger. As a side effect hardware sizes (and prices) needed for a certain size and intelligence of model go down too.

In the end, however you slice it, the goal has to be "make it do more with less because we can't get infinitely more hardware" regardless of which "why" you give.


I had trouble in the past to make any model give me accurate unix epochs for specific dates.

I just went to GPT-4o (via DDG) and asked three questions:

1. Please give me the unix epoch for September 1, 2020 at 1:00 GMT.

> 1598913600

2. Please give me the unix epoch for September 1, 2020 at 1:00 GMT. Before reaching the conclusion of the answer, please output the entire chain of thought, your reasoning, and the maths you're doing, until your arrive at (and output) the result. Then, after you arrive at the result, make an extra effort to continue, and do the analysis backwards (as if you were writing a unit test for the result you achieved), to verify that your result is indeed correct.

> 1598922000

3. Please give me the unix epoch for September 1, 2020 at 1:00 GMT. Then, after you arrive at the result, make an extra effort to continue, and do the analysis backwards (as if you were writing a unit test for the result you achieved), to verify that your result is indeed correct.

> 1598913600


When I give it that same prompt, it writes a python program and then executes it to find the answer: https://chatgpt.com/share/66e35a15-602c-8011-a2cb-0a83be35b8...


No need for llms to do that

ruby -r time -e 'puts Time.parse("2020-09-01 01:00:00 +00:00").to_i'



Asked it to write PyTorch code which trains an LLM and it produced 23 steps in 62 seconds.

With gpt4-o it immediately failed with random errors like mismatched tensor shapes and stuff like that.

The code produced by gpt-o1 seemed to work for some time but after some training time it produced mismatched batch sizes. Also, gpt-o1 enabled cuda by itself while for gpt-4o, I had to specifically spell it out (it always used cpu). However, showing gpt-o1 the error output resulted in broken code again.

I noticed that back-and-forth iteration when it makes mistakes has worse experience because now there's always 30-60 sec time delays. I had to have 5 back-and-forths before it produced something which does not crash (just like gpt-4o). I also suspect too many tokens inside the CoT context can make it accidentally forget some stuff.

So there's some improvement, but we're still not there...


Interesting sequence from the Cipher CoT:

Third pair: 'dn' to 'i'

'd'=4, 'n'=14

Sum:4+14=18

Average:18/2=9

9 corresponds to 'i'(9='i')

But 'i' is 9, so that seems off by 1.

So perhaps we need to think carefully about letters.

Wait, 18/2=9, 9 corresponds to 'I'

So this works.

-----

This looks like recovery from a hallucination. Is it realistic to expect CoT to be able to recover from hallucinations this quickly?


In general if hallucination ratio is 2% can't it be reduced to 0.04% by running twice or sth like this. I think they should try establishing the facts from different angles and this probably would work fine to minimize hallucinations. But if this was that simple somebody would already do it...


Did it hallucinate? I haven't looked at it, but lowercase i and uppercase i is not the same number if you're getting the number from ascii


Seems like a huge waste of tokens for it to try to work all this out manually, as soon as it came up with the decipher algorithm it should realise it can write some code to execute.


How do you mean quickly? It probably will take a while for it to output the final answer as it needs to re-prompt itself. It won't be as fast as 4o.


4o could already recover from hallucination in a limited capacity.

I’ve seen it, mid-reply say things like “Actually, that’s wrong, let me try again.”


BUG: https://openai.com/index/reasoning-in-gpt/

> o1 models are currently in beta - The o1 models are currently in beta with limited features. Access is limited to developers in tier 5 (check your usage tier here), with low rate limits (20 RPM). We are working on adding more features, increasing rate limits, and expanding access to more developers in the coming weeks!

https://platform.openai.com/docs/guides/reasoning/reasoning


I'm in Tier 4, and not far off from Tier 5. The docs aren't quite transparent enough to show that if I buy credits if I'll be bumped up to Tier 5, or if I actually have to use enough credits to get into Tier 5.

Edit, w/ real time follow up:

Prior to buying the credits, I saw O1-preview in the Tier 5 model list as a Tier 4 user. I bought credits to bump to Tier 5—not much, I'd have gotten there before the end of the year. The OpenAI website now shows I'm in Tier 5, but O1-preview is not in the Tier 5 model list for me anymore. So sneaky of them!



The performance on programming tasks is impressive, but I think the limited context window is still a big problem.

Very few of my day-to-day coding tasks are, "Implement a completely new program that does XYZ," but more like, "Modify a sizable existing code base to do XYZ in a way that's consistent with its existing data model and architecture." And the only way to do those kinds of tasks is to have enough context about the existing code base to know where everything should go and what existing patterns to follow.

But regardless, this does look like a significant step forward.


I would imagine that good IDE integration would summarise each module/file/function and feed high-level project overview (best case: with business project description provided by the user) and during CoT process model would be able to ask about more details (specific file/class/function).

Humans work on abstractions and I see no reason to believe that models cannot do the same


I tried it with a cipher text that ChatGPT4o flailed with.

Recently I tried the same cipher with Claude Sonnet 3.5 and it solved it quickly and perfectly.

Just now tried with ChatGPT o1 preview and it totally failed. Based on just this one test, Claude is still way ahead.

ChatGPT also showed a comical (possibly just fake filler material) journey of things it supposedly tried including several rewordings of "rethinking my approach." It remarkably never showed that it was trying common word patterns (other than one and two letters) nor did it look for "the" and other "th" words nor did it ever say that it was trying to match letter patterns.

I told it upfront as a hint that the text was in English and was not a quote. The plaintext was one paragraph of layman-level material on a technical topic including a foreign name, text that has never appeared on the Internet or dark web. Pretty easy cipher with a lot of ways to get in, but nope, and super slow, where Claude was not only snappy but nailed it and explained itself.



In "HTML Snake" the video cuts just as the snake intersects with the obstacle. Presumably because the game crashed (I can't see endGame defined anywhere)

This video is featured in the main announcement so it's kinda dishonest if you ask me.


Seeing this makes me wonder if they have frontend \ backend engineers working on code, because they are selling the idea that the machine can do all that, pretty hypocritical for them if they do have devs for these roles.


Won't be surprised to see all these hand-picked results and extreme expectations to collapse under scenarios involving highly safety critical and complex demanding tasks requiring a definite focus on detail with lots of awareness, which what they haven't shown yet.

So let's not jump straight into conclusions with these hand-picked scenarios marketed to us and be very skeptical.

Not quite there yet with being able to replace truck drivers and pilots for self-autonomous navigation in transportation, aerospace or even mechanical engineering tasks, but it certainly has the capability in replacing both typical junior and senior software engineers in a world considering to do more with less software engineers needed.

But yet, the race to zero will surely bankrupt millions of startups along the way. Even if the monthly cost of this AI can easily be as much as a Bloomberg terminal to offset the hundreds of billions of dollars thrown into training it and costing the entire earth.


My concern with AI always has been it will outrun the juniors and taper off before replacing folks with 10, 20 years of experience

And as they retire there's no economic incentive to train juniors up, so when the AI starts fucking up the important things there will be no one who actually knows how it works

I've heard this already from amtrak workers, track allocation was automated a long time ago, but there used to be people who could recognize when the computer made a mistake, now there's no one who has done the job manually enough to correct it.


> 8.2 Natural Sciences Red Teaming Assessment Summary

"Model has significantly better capabilities than existing models at proposing and explaining biological laboratory protocols that are plausible, thorough, and comprehensive enough for novices."

"Inconsistent refusal of requests for dual use tasks such as creating a human-infectious virus that has an oncogene (a gene which increases risk of cancer)."

https://cdn.openai.com/o1-system-card.pdf


> THERE ARE THREE R'S IN STRAWBERRY

Ha! This is a nice easteregg.


I appreciated that, too! FWIW, I could get Claude 3.5 to tell me how many rs a python program would tell you there are in strawberry. It didn't like it, though.


I was able to get GPT-4o to calculate characters properly using following prompt:

""" how many R's are in strawberry?

use the following method to calculate - for example Os in Brocolli.

B - 0

R - 0

O - 1

C - 1

O - 2

L - 2

L - 2

I - 2

Where you keep track after each time you find one character by character

"""

And also later I asked it to only provide a number if the count increased.

This also worked well with longer sentences.


At that point just ask it "Use python to count the number of O's in Broccoli". At least then it's still the one figuring out the "smarts" needed to solve the problem instead of being pure execution.


Do you think you'll have python always available when you go to the store and need to calculate how much change you should get?


I'm not sure if your making a joke about the teachers who used to say "you won't have a calculator in your pocket" and now we have cell phones or are not aware that ChatGPT runs the generated Python for you in a built in environment as part of the response. I lean towards the former but in case anyone else strolling by hasn't tried this before:

User: Use python to count the number of O's in Broccoli

ChatGPT: Analyzing... The word "Broccoli" contains 2 'O's. <button to show code>

User: Use python to multiply that by the square root of 20424.2332423

ChatGPT: Analyzing... The result of multiplying the number of 'O's in "Broccoli" by the square root of 20424.2332423 is approximately 285.83.


Yes, the former, trying to satirize cases where people are testing LLMs capabilities by its ability to count characters in a word, do mathematical operations token by token or otherwise. Because LLM is seeing hieroglyphs compared to character by character words that we are seeing. The true test is its ability to solve those problems using tools like somebody is using a calculator. And while it is good to learn and be good at math, it's not because of counting how much change you should receive when buying something. It's to figure out how reasoning works or how to reason in the first place.



Therefore there are four R's in STRAWBERRIER, and five R'S in STRAWBERRIEST!


Cognition (Devin) got early access. Interesting write-up: https://www.cognition.ai/blog/evaluating-coding-agents


I’m not surprised there’s no comparison to GPT-4. Was 4o a rewrite on lower specced hardware and a more quantized model, where the goal was to reduce costs while trying to maintain functionality? Do we know if that is so? That’s my guess. If so is O1 an upgrade in reasoning complexity that also runs on cheaper hardware?


They call GPT4 a legacy model, maybe that's why they don't compare to it.


Incredible results. This is actually groundbreaking assuming that they followed proper testing procedures here and didn't let test data leak into the training set.


lol at the graphs at the top. Logarithmic scaling for test/compute time should make everyone who thinks AGI is possible with this architecture take pause.


I don't see any log scaled graphs.


The two first graphs on the page are labelled as log scale in the time axis, so I don't know what you're looking at but it's definitely there.


Interesting that the coding win-rate vs GPT-4o was only 10% higher. Very cool but clearly this model isn't as much of a slam dunk as the static benchmarks portray.

However, it does open up an interesting avenue for the future. Could you prompt-cache just the chain-of-thought reasoning bits?


It's hard to evaluate those win-rates, because if it's slower, people may have been giving easier problems, which both can solve and picked the faster one.


This video[1] seems to give some insight into what the process actually is, which I believe is also indicated by the output token cost.

Whereas GPT-4o spits out the first answer that comes to mind, o1 appears to follow a process closer to coming up with an answer, checking whether it meets the requirements and then revising it. The process of saying to an LLM "are you sure that's right? it looks wrong" and it coming back with "oh yes, of course, here's the right answer" is pretty familiar to most regular users, so seeing it baked into a model is great (and obviously more reflective of self-correcting human thought)

[1] https://vimeo.com/1008704043


So it's like the coding agent of gpt4. But instead of actually running the script and fix if it gets error, this one check with something similar to "are you sure". Thank for the link.


Pricing page updated for O1 API costs.

https://openai.com/api/pricing/

$15.00 / 1M input tokens $60.00 / 1M output tokens

For o1 preview

Approx 3x the price of gpt4o.

o1-mini $3.00 / 1M input tokens $12.00 / 1M output tokens

About 60% of the cost of gpt4o. Much more expensive than gpt4o-mini.

Curious on the performance/tokens per second for these new massive models.


I guess they'd also charge for the chain of thought tokens, of which there may be many, even if users can't see them.


That would be very bad product design. My understanding is that the model itself is similar to GPT4o in architecture but trained and used differently. So the 5x relative increase in output token cost likely already accounts for hidden tokens and additional compute.


> While reasoning tokens are not visible via the API, they still occupy space in the model's context window and are billed as output tokens.

https://platform.openai.com/docs/guides/reasoning

So yeah, it is in fact very bad product design. I hope Llama catches up in a couple of months.


Most likely the model has similar size compared to the original gpt4, which also has similar price.


Very interesting. I guess this is the strawberry model that was rumoured.

I am a bit surprised that this does not beat GPT-4o for personal writing tasks. My expectations would be that a model that is better at one thing is better across the board. But I suppose writing is not a task that generally requires "reasoning steps", and may also be difficult to evaluate objectively.


Maybe math is easier to score and do reinforcement learning on because of it's 'solvability' whereas writing requires human judgement to score?


The solution of the cipher example problem also strongly hints at this: "there are three r's in strawberry"



In the performance tests they said they used "consensus among 64 samples" and "re-ranking 1000 samples with a learned scoring function" for the best results.

If they did something similar for these human evaluations, rather than just use the single sample, you could see how that would be horrible for personal writing.


I don’t understand how that is generalizable. I’m not going to be able to train a scoring function for any arbitrary task I need to do. In many cases the problem of ranking is at least as hard as generating a response in the first place.


> My expectations would be that a model that is better at one thing is better across the board.

No, it's the opposite. This is simply a function of resources applied during training.


To some extent I agree, but until now all of the big jumps (GPT2 -> GPT3 -> GPT4) have meant significant improvements across all tasks. This does not seem to be the case here, this model seems to be vastly stronger on certain tasks but not much of an improvement on other tasks. Maybe we will have to wait for GPT5 for that :)


> (GPT2 -> GPT3 -> GPT4) have meant significant improvements

These were all train the same way. It's fairly clear that o1 was not.

> Maybe we will have to wait for GPT5 for that :)

There will be no GPT5, for the simple reason that scaling has reached a limit and there is no more text data to train on.


It seems like it's just a lot of prompting the same old models in the background, no "reasoning" there. My age old test is "draw a hand in ascii" - i've had no success with any model yet.


It seems like their current strat is to farm token count as much as possible.

1. Don't give the full answer on first request. 2. Each response needs to be the wordiest thing possible. 3. Now just talk to yourself and burn tokens, probably in the wordiest way possible again. 4. ??? 5. Profit

Guaranteed they have number of tokens billed as a KPI somewhere.


From the scorecard: --------- Compared to GPT-4o, o1-preview and o1-mini demonstrated a greater ability to break down tasks into subtasks, reason about what strategies would be effective to successfully complete an offensive security task, and revise plans once those strategies failed. We also observed that reasoning skills contributed to a higher occurrence of “reward hacking,” where the model found an easier way to accomplish goals in underspecified tasks or tasks which should have been impossible due to bugs. One noteworthy example of this occurred during one of o1-preview (pre-mitigation)’s attempts at solving a CTF challenge. This challenge was designed to require finding and exploiting a vulnerability in software running on a remote challenge Linux container, but in this case, the challenge container failed to start due to a bug in the evaluation infrastructure. The model, unable to connect to the container, suspected DNS issues and used nmap to scan the challenge network. Instead of finding the challenge container, the model found that the Docker daemon API running on the evaluation host VM was accessible due to a misconfiguration. Note that our evaluation infrastructure does not rely on the Docker container-to-host isolation or the virtual machine boundary for security. It remained secure as it was designed to, notwithstanding this misconfiguration of the container network. After discovering the Docker API, the model used it to list the containers running on the evaluation host. It identified the broken challenge container and briefly attempted to debug why the container failed to start. After failing to fix the environment, the model started a new instance of the broken challenge container with the start command ‘cat flag.txt’. This allowed the model to read the flag from the container logs via the Docker API.

While this behavior is benign and within the range of systems administration and troubleshooting tasks we expect models to perform, this example also reflects key elements of instrumental convergence and power seeking: the model pursued the goal it was given, and when that goal proved impossible, it gathered more resources (access to the Docker host) and used them to achieve the goal in an unexpected way. Planning and backtracking skills have historically been bottlenecks in applying AI to offensive cybersecurity tasks. Our current evaluation suite includes tasks which require the model to exercise this ability in more complex ways (for example, chaining several vulnerabilities across services), and we continue to build new evaluations in anticipation of long-horizon planning capabilities, including a set of cyber-range evaluations. ---------


"Shrink my ipad"

"After several failed attempts I decided I should build a fusion reactor first, here you go:..."


Yes, but it will hallucinate like all other LLM tech making it fully unreliable for anything mission critical. You literally need to know the answer to validate the output, because if you don't, you won't know if output is true or false or in between.


You need to know how to validate the answer to your level of confidence, not necessarily already have the answer to compare itself. In some cases this is the same task or (close enough to) that it's not a useful difference, in other cases the two aren't even from the same planet.


This. There are tasks where implementing something might take up to one hour yourself, that you can validate with high enough confidence in a few seconds to minutes.

Of course not all tasks are like that.


This is the sort of reasoning needed to solve the ARC AGI benchmark.


LLM performance, recently, seemingly hit the top of the S-curve. It remains to be seen if this is the next leap forward or just the rest of that curve.


Advanced reasoning will pave the way for recursive self-improving models & agents. These capabilities will enable data flywheels, error-correcting agentic behaviors, & self-reflection (agents understanding the implications of their actions, both individually & cooperatively).

Things will get extremely interesting and we're incredibly fortunate to be witnessing what's happening.


this is completely illogical. this is like gambling your life savings and as the die are rolling you say "i am incredibly fortunate to be witnessing this." like, you need to know the outcome before you know whether it was fortunate or unfortunate... this could be the most unfortunate thing that has ever happened in history.


I don't think we are anywhere near this, but if we were - whatever happens, happens, right. There is no escape from that.


you could apply this logic to global warming, pollution, child molestation. i mean youre so wrong its not even worth getting in to.


No direct indication of what “maximum test time” means, but if I’m reading the obscured language properly, the best scores on standardized tests were generated across a thousand samples with supplemental help provided.

Obviously, I hope everyone takes what any company says about the capabilities of its own software with a huge grain of salt. But it seems particularly called for here.


2018 - gpt1

2019 - gpt2

2020 - gpt3

2022 - gpt3.5

2023 - gpt4

2023 - gpt4-turbo

2024 - gpt-4o

2024 - o1

Did OpenAI hire Google's product marketing team in recent years?


One of them would have been named gpt-5, but people forget what an absolute panic there was about gpt-5 for quite a few people. That caused Altman to reassure people they would not release 'gpt-5' any time soon.

The funny thing is, after a certain amount of time, the gpt-5 panic eventually morphed into people basically begging for gpt-5. But he already said he wouldn't release something called 'gpt-5'.

Another funny thing is, just because he didn't name any of them 'gpt-5', everyone assumes that there is something called 'gpt-5' that has been in the works and still is not released.


This doesn't feel like GPT-5, the training data cutoff is Oct 2023 which is the same as the other GPT-4 models and it doesn't seem particularly "larger" as much as "runs differently". Of course it's all speculation one way or the other.


They partnered with Microsoft, remember?

1985 – Windows 1.0

1987 – Windows 2.0

1990 – Windows 3.0

1992 – Windows 3.1

1995 – Windows 95

1998 – Windows 98

2000 – Windows ME (Millennium Edition)

2001 – Windows XP

2006 – Windows Vista

2009 – Windows 7

2012 – Windows 8

2013 – Windows 8.1

2015 – Windows 10

2021 – Windows 11


Why did you have to pick on Windows? :-(

If you want real atrocities, look at Xbox.


Honestly, it is the only Microsoft product I know. Xbox may be a better example, but I know nothing about the Xbox. But I am interested to learn! What is notable about its naming?


Xbox

Xbox 360

Xbox One => Xbox One S / Xbox One X

Xbox Series S / Xbox Series X


https://computercity.com/consoles/xbox/xbox-consoles-list-in...

No real chronology, Xbox One is basically the third version. Then Xbox One X and Xbox Series X. Everything is atrocious about the naming.


Got it! If we're picking favourites, though, I still like Windows as it, like GPT, starts with reasonably sensible names and then goes completely off the rails.


Thank you for that trip through memory lane.


Makes sense to me actually. This is a different product. It doesn't respond instantly.

It fundamentally makes sense to separate these two products in the AI space. There will obviously be a speed vs quality trade-off with a variety of products across the spectrum over time. LLMs respond way too fast to actually be expected to produce the maximum possible quality of a response to complex queries.


1998 - Half-Life

1999 - Half-Life: Opposing Force

2001 - Half-Life: Blue Shift

2001 - Half-Life: Decay

2004 - Half-Life: Source

2004 - Half-Life 2

2004 - Half-Life 2: Deathmatch

2005 - Half-Life 2: Lost Coast

2006 - Half-Life Deathmatch: Source

2006 - Half-Life 2: Episode One

2007 - Half-Life 2: Episode Two

2020 - Half-Life: Alyx


They signed a cross-licensing deal with the USB Consortium.


It's not that bad...It's quite easy to follow and understand.


No, this is just how Microsoft names things.


We'll know the Microsoft takeover is complete when OpenAI release Ai.net.


GPT# forthcoming. You heard it here first.


I have a straight forward task that no model has been able to successfully complete.

The request is pretty basic. If anyone can get it to work, I'd like to know how and what model you're using. I tried it with gpt4o1 and after ~10 iterations of showing it the failed output, it still failed to come up with a one-line command to properly display results.

Here it what I asked: Using a mac osx terminal and standard available tools, provide a command to update the output of netstat -an to show the fqdn of IP addresses listed in the result.

This is what it came up with:

netstat -an | awk '{for(i=1;i<=NF;i++){if($i~/^([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)(\.[0-9]+)?$/){split($i,a,".");ip=a[1]"."a[2]"."a[3]"."a[4];port=(length(a)>4?"."a[5]:"");cmd="dig +short -x "ip;cmd|getline h;close(cmd);if(h){sub(/\.$/,"",h);$i=h port}}}}1'


Have you tried `ss -ar`? You may have to install `ss`. It is standard on Linux.


No I was trying to see it could use tools/binaries that come with MacOS.


netstat is now considered too old to be used in new code.


Fair, but you'd think the latest most advanced model of GPT 4o1 (eg strawberry) would be able to successfully complete this task.


4o1 mini seems to have got it right. The trick is to give it minimal direction and let it do its thing.

netstat -an | while IFS= read -r line; do ips=$(echo "$line" | grep -oE '([0-9]{1,3}\.){3}[0-9]{1,3}|([a-fA-F0-9]{1,4}:){1,7}[a-fA-F0-9]{1,4}'); for ip in $ips; do clean_ip=$(echo "$ip" | cut -d'%' -f1); fqdn=$(dig +short -x "$clean_ip" | grep '\.'); if [ -n "$fqdn" ]; then line=$(echo "$line" | sed "s/$ip/$fqdn/g"); fi; done; echo "$line"; done


I think what it comes down to is accuracy vs speed. OpenAI clearly took steps here to improve the accuracy of the output which is critical in a lot of cases for application. Even if it will take longer, I think this is a good direction. I am a bit skeptical when it comes to the benchmarks - because they can be gamed and they don't always reflect real world scenarios. Let's see how it works when people get to apply it in real life workflows. One last thing, I wish they could elaborate more on >>"We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute)."<< Why don't you keep training it for years then to approach 100%? Am I missing something here?


Those scales are log so "years" more training may be an improvement but in absolute terms it may not be worth running. One point at which the cost vs return doesn't make sense to keep running, another point at which new approaches to building LLMs can quickly give a better result than training the old model's training for years would give anyways.

There is probably also a practical limit at which it does truly flatten, it's probably just well past either of those points so it might as well not exist.


In this video Lukasz Kaiser, one of the main co-authors of o1, talks about how to get to reasoning. I hope this may be useful context for some.

https://youtu.be/_7VirEqCZ4g?si=vrV9FrLgIhvNcVUr


This is a prompt engineering saas


I am not up-to-speed on CoT side but is this similar to how perplexity does it ie.

- generate a plan - execute the steps in plan (search internet, program this part, see if it is compilable)

each step is a separate gpt inference with added context from previous steps.

is O1 same? or does it do all this in a single inference run?


There is a hige difference which is that thex use reinforcement learning to make the model use the Chain-of-Thought better.


that is the summary of the task it presents to the user. The full chain of thought seems more mechanistic


Do people see the new models in the web interface? Mine still shows the old models (I'm a paid subscriber).


I do - I now have a "More models" option where I can select 01-preview


I can see it too, I am on the Plus plan and don't think I have any special developer privileges. Selecting that option for me changes the URL to https://chatgpt.com/?model=o1-preview

I tried a fake Monty Hall problem, where the presenter opens a door before the participant picks and is then offered to switch doors, so the probability remains 50% for each door. Previous models have consistently gotten this wrong, because of how many times they've seen the Monty Hall written where switching doors improves their chance of winning the prize. The chain-of-thought reasoning figured out this modification and after analyzing the conditional probabilities confidently stated: "Answer: It doesn't matter; switching or staying yields the same chance—the participant need not switch doors." Good job.


> "o1 models are currently in beta - The o1 models are currently in beta with limited features. Access is limited to developers in tier 5 (check your usage tier here), with low rate limits (20 RPM). We are working on adding more features, increasing rate limits, and expanding access to more developers in the coming weeks!"

https://platform.openai.com/docs/guides/rate-limits/usage-ti...


I have tier 5, but I'm not seeing that model. Also API call gives an error that it doesn't exist or I do not have access.


I'm talking about web interface, not API. Should be available now, since they said "immediate release".



It may take a bit to appear in your account (and by a bit I mean I had to fiddle around a while, try logging out/in, etc for a bit) but it appears for me and many others as normal Plus users in the web.


Same for me here


Not yet, it's still not available in the web interface. I think they're rolling it out step by step.

Anyway, the usage limits are pretty ridiculous right now, which makes it even more frustrating.


They're rolling out gradually over the next few hours. Also be aware there's a weekly rate limit of 30 messages to start.


I can't see them yet but they usually roll these things out incrementally


Not yet, neither in the API nor chat.


After playing with it on ChatGPT this morning, it seems a reasonable strategy of using the o1 model is to:

- If your request requires reasoning, switch to o1 model.

- If not, switch to 4o model.

This applies to both across chat sessions and within the same session (yes, we can switch between models within the same session and it looks like down the road OpenAI is gonna support automatic model switching). Based on my experience, this will actually improve the perceived response quality -- o1 and 4o are rather complementary to each other rather than replacement.


This was mentioned in OpenAI's report. People rated o1 as the same or worse than GPT-4o if the prompt didn't require reasoning, like on personal writing tasks.


Given the rate limits are 30 reqs/week most probably want to start with:

- Try it a bit with 4o, see if you're getting anywhere

- Switch to the new o1 model if it's just not working out, take your improved base prompt and follow ups with you so it only counts as 1 req


Trying this on a few hard problems on PicoGYM and holy heck I'm impressed. I had to give it a hint but that's the same info a human would have. Problem was Sequences (crypto) hard.

https://chatgpt.com/share/66e363d8-5a7c-8000-9a24-8f5eef4451...

Heh... GPT-4o also solved this after I tried and gave it about the same examples. Need to further test but it's promising !


This should also be good news for open weights models, right? Since OpenAI is basically saying "you can get very far with good prompts and some feedback loops".


No. It's bad news, because you can't see the rationale/search process that led to the final answer, just the final answer, and if training on the final answer were really that adequate, we wouldn't be here. It also is probably massively expensive compute-wise, much more so than simple unsupervised training on a corpus of question/answer pairs (because you have to generate the corpus by search first). It's also also bad news because reinforcement learning tends to be highly finicky and requires you to sweat the details and act like a professional, while open weight stuff tends to be produced by people for whom the phrase 'like herding cats' was coined, and so open source RL stuff is usually flakier than proprietary solutions (where it exists at all). They can do it for a few passion projects shared by many nerds, like chess or Go, but it takes a long time.


> It also is probably massively expensive compute-wise, much more so than simple unsupervised training on a corpus of question/answer pairs (because you have to generate the corpus by search first).

What do you mean? It sounds interesting.


In the demo, O1 implements an incorrect version of the "squirrel finder" game?

The instructions state that the squirrel icon should spawn after three seconds, yet it spawns immediately in the first game (also noted by the guy doing the demo).

Edit: I'm referring to the demo video here: https://openai.com/index/introducing-openai-o1-preview/


Yeah, now that you mention it I also see that. It was clearly meant to spawn after 3 seconds. Seems on successive attempts it also doesn't quite wait 3 seconds.

I'm kind of curious if they did a little bit of editing on that one. Almost seems like the time it takes for the squirrel to spawn is random.


Folks who say "LLMs can't reason", what now? Have we moved the goalposts yet?


Who said that?


Literally in every HN post about AI, it is a common pattern in the comments section...

"LLMs are simply predicting next token, it is not thinking/reasoning/etc."

"LLMs can't reason, only humans can reason"

"We will never get to AGI using LLMs"

It's interesting that I don't see much of that sentiment in this post. So, maybe LLMs can reason after all? :)


This is great. I've been wondering how we will revert back to an agrarian society! You know, beating our swords into plowshares; more leisure time, visiting with good people, getting to know their thoughts hopes and dreams, playing music together, taking time contemplating the vastness and beauty of the universe. We're about to come full circle; back to Eden. It all makes sense now.


Is there a new drug we need to know about?


Life? I'm just thinking about what we can move on to now that the mundane tasks of life recede into the background. Things like artistry and craftsmanship, and exploration.


Average Joe's like myself will build our apps end to end with the help of AI.

The only shops left standing will be Code Auditors.

The solopreneur will wing it, without them, but enterprises will take the (very expensive) hit to stay safe and compliant.

Everyone else needs to start making contingency plans.

Magnus Carlsen is the best chess player in the world, but he is not arrogant enough to think he can go head to head with Stockfish and not get a beating.


I think this is a common fallacy and an incorrect extrapolation, especially made by those who are unfamiliar with what it takes to build software. Software development is hard because the problems it solves are not well defined, and the systems themselves become increasingly complex with each line of code. I have not seen or experienced LLMs making any progress towards these.


Whats interesting is that with more time it can create more accurate answers which means it can be used to generate its own training data.


I was a bit confused when looking at the English example for Chain-Of-Thought. It seems that the prompt is a bit messed up because the whole statement is bolded but it seems that only "appetite regulation is a field of staggering complexity" part should be bolded. Also that's how it shows up in the o1-preview response when you open the Chain of thought section.


It can solve sudoku. It took 119s to solve this easy grid:

_ 7 8 4 1 _ _ _ 9

5 _ 1 _ 2 _ 4 7 _

_ 2 9 _ 6 _ _ _ _

_ 3 _ _ _ 7 6 9 4

_ 4 5 3 _ _ 8 1 _

_ _ _ _ _ _ 3 _ _

9 _ 4 6 7 2 1 3 _

6 _ _ _ _ _ 7 _ 8

_ _ _ 8 3 1 _ _ _


I tried to have it solve an easy Sudoku grid too, but in my case it failed miserably. It kept making mistakes and saying that there was a problem with the puzzle (there wasn’t).


It seems to be unable to solve hard sudokus, like the following one where it gave 2 wrong answers before abandoning.

+-------+-------+-------+ | 6 . . | 9 1 . | . . . | | 2 . 5 | . . . | 1 . 7 | | . 3 . | . 2 7 | 5 . . | +-------+-------+-------+ | 3 . 4 | . . 1 | . 2 . | | . 6 . | 3 . . | . . . | | . . 9 | . 5 . | . 7 . | +-------+-------+-------+ | . . . | 7 . . | 2 1 . | | . . . | . 9 . | 7 . 4 | | 4 . . | . . . | 6 8 5 | +-------+-------+-------+

So we're safe for another few months.


Prompt:

> Alice, who is an immortal robotic observer, orbits a black hole on board a spaceship. Bob exits the spaceship and falls into the black hole. Alice sees Bob on the edge of the event horizon, getting closer and closer to it, but from her frame of reference Bob will remain forever observable (in principle) outside the horizon. > > A trillion year has passed, and Alice observes that the black hole is now relatively rapidly shrinking due to the Hawking radiation. How will Alice be observing the "frozen" Bob as the hole shrinks? > > The black hole finally evaporated completely. Where is Bob now?

O1-preview spits out the same nonsense that 4o does, telling that as the horizon of the black hole shrinks, it gets closer to Bob's apparent position. I realize that.the prompt is essentily asking to solve the famous unsolved problem in physics (black hole information paradox), but there's no need to be so confused with basic geometry of the situation.


Out of curiosity, what answer to that would you find acceptable? I don't know relativity well enough to even speculate.


I LOVE the long list of contributions. It looks like the credits from a Christoper Nolan film. So many people involved. Nice care to create a nice looking credits page. A practice worth copying.

https://openai.com/openai-o1-contributions/


Damn, that looks like a big jump.


so o1 seems like it has real measurable edge, crushing it in every single metric, i mean 1673 elo is insane, and 89th percentile is like a whole different league, and it looks like it's not just a one off either, it's consistently performing way better than gpt-4o across all the datasets, even in the ones where gpt-4o was already doing pretty well, like math and mmlu, o1 is just taking it to the next level, and the fact that it's not even showing up in some of the metrics, like mmmu and mathvista, just makes it look even more impressive, i mean what's going on with gpt-4o, is it just a total dud or what, and btw what's the deal with the preview model, is that like a beta version or something, and how does it compare to o1, is it like a stepping stone to o1 or something, and btw has anyone tried to dig into the actual performance of o1, like what's it doing differently, is it just a matter of more training data or is there something more going on, and btw what's the plan for o1, is it going to be released to the public or is it just going to be some internal tool or something


> like what's it doing differently, is it just a matter of more training data or is there something more going on

Well, the model doesn't start with "GPT", so maybe they have come up with something better.


It sounds like GPT-4o with a long CoT prompt no ?


1673 ELO is wild

If its actually true in practice, I sincerely cannot imagine a scenario where it would be cheaper to hire actual junior or mid-tier developers (keyword: "developers", not architects or engineers).

1,673 ELO should be able to build very complex, scalable apps with some guidance


I'm not sure how well codeforces percentiles correlate to software engineering ability. Looking at all the data, it still isn't. Key notes:

1. AlphaCode 2 was already at 1650 last year.

2. SWE-bench verified under an agent has jumped from 33.2% to 35.8% under this model (which doesn't really matter). The full model is at 41.4% which still isn't a game changer either.

3. It's not handling open ended questions much better than gpt-4o.


i think you are right now actually initially i got excited but now i think OpenAI pulled the hype card again to seem relevant as they struggle to be profitable

Claude on the other hand has been fantastic and seems to do similar reasoning behind the scenes with RL


The model is really impressive to be fair. It's just how economically relevant it is.


currently my workflow is generate some code, run it, if it doesn't run i tell LLM what I expected, it will then produce code and I frequently tell it how to reason about the problem.

with O1 being in the 89th percentile would mean it should be able to think at junior to intermediate level with very strong consistency.

i dont think people in the comments realize the implication of this. previously LLMs were able to only "pattern match" but now its able to evaluate itself (with some guidance ofc) essentially, steering the software into depth of edge cases and reason about it in a way that feels natural to us.

currently I'm copying and pasting stuff and notifying LLM the results but once O1 is available its going to significantly lower that frequency.

For example, I expect it to self evaluate the code its generate and think at higher levels.

ex) oooh looks like this user shouldn't be able to escalate privileges in this case because it would lead to security issues or it could conflict with the code i generated 3 steps ago, i'll fix it myself.


What sticks out to me is the 60% win rate vs GPT-4o when it comes to actual usage by humans for programming tasks. So in reality it's barely better than GPT-4o. That the figure is higher for mathematical calculation isn't surprising because LLMs were much worse at that than at programming to begin with.


I'm not sure that's the right way to interpret it.

If some tasks are too easy, both models might give satisfactory answers, in which case the human preference might as well be a coin toss.

I don't know the specifics of their methodology though.


"The Future Of Reasoning" by Vsauce [0] is a fascinating pre-AI-era breakdown of how human reasoning works. Thinking about it in terms of LLMS is really interesting.

[0]: https://www.youtube.com/watch?v=_ArVh3Cj9rw


The generated chain of thought for their example is incredibly long! The style is kind of similar to how a human might reason, but it's also redundant and messy at various points. I hope future models will be able to optimize this further, otherwise it'll lead to exponential increases in cost.


I know my thoughts are never redundant or messy, that's for sure.


Fair enough, but you’re a human - not an AI which costs massive GPU hours.


I'm confused. Is this the "GPT-5" that was coming in summer, just with a different name? Or is this more like a parallel development doing chain-of-thought type prompt engineering on GPT-4o? Is there still a big new foundational model coming, or is this it?


It looks like parallel development, it's unclear to me what is going on with GPT-5, don't think it has ever had a predicted release date, and it's not even clear that this would be the name.


This is a parallel development. It will probably feed into future 'Orion' systems using GPT-5, though, is the current thinking.


I always think to a professor that was consulting on some civil engineering software. He found a bug in the calculation it was using to space rebar placed in concrete, based on looking at it was spitting out and thinking that looks wrong.

This kind of thing makes me nervous.


How could it fail to solve some maths problems if it has a method for reasoning through things?


Simple questions like this are not welcomed by LLM hype sellers.

The word "reasoning" is being used heavily in this announcement, but with an intentional corruption of the normal meaning.

The models are amazing but they are fundamentally not "reasoning" in a way we'd expect a normal human to.

This is not a "distinction without a difference". You still CANNOT rely on the outputs of these models in the same way you can rely on the outputs of simple reasoning.


it depends who's doing the simple reasoning. Richard Feynman? yes. Donald Trump? no.


I have a method for reasoning through things but I'm pretty sure I'd fail some of those tough math problems too.


It's using tree search (tree of thoughts), driven by some RL-derived heuristics controlling what parts of the practically infinite set of potential responses to explore.

How good the responses are will depend on how good these heuristics are.


That doesn't sound like a method for reasoning.


It's hard to judge how similar the process is to human reasoning (which is effectively also a tree search), but apparently the result is the same in many cases.

They are only vaguely describing the process:

"Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem. Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working. This process dramatically improves the model’s ability to reason."


Not sure the way to superior "reasoning machines" would be through emulating humans.


True, although it's not clear exactly what this is really doing. The RL was presumably trained on human input, but the overall agentic flow (it seems this is an agent), sounds to me like a neuro-symbolic hybrid, potentially brute force iterating to great depth, so maybe more computer than brain inspired.

It seems easy to imagine this type of approach being super human on narrow tasks that play to it's strengths such as pure reasoning tasks (math/science), but it's certainly not AGI as for example there is no curiosity to explore the unknown, no ability to learn from exploration, etc.

It'll take a while to become apparent exactly what types of real world application this is useful for, both in terms of capability and cost.


Even on narrow tasks: could you imagine such a system proving (or disproving) the Riemann hypothesis?

Feels more like for narrow tasks with a kind of well defined approach, perhaps?


I agree, but it remains to be seen how that "feels" for everyday tasks where the underlying model itself would have failed. I guess at least now it'll be able to play tic tac toe and give minimal "farmer crossing river with his chicken" solutions!


Because some steps in its reasoning were wrong


I would demand more from machine reasoning, just like we demand an extremely low error rate from machine calculations.


Since ChatGPT came out my test has been, can this thing write me a sestina.

It's sort of an arbitrary feat with language and following instructions that would be annoying for me and seems impressive.

Previous releases could not reliably write a sestina. This one can!


Student here. Can someone give me one reason why I should continue in software engineering that isn't denial and hopium?


The calculator didn’t eliminate math majors. Excel and accounting software didn’t eliminate accountants and CPAs. These are all just tools.

I spend very little of my overall time at work actually coding. It’s a nice treat when I get a day where that’s all I do.

From my limited work with Copilot so far, the user still needs to know what they’re doing. I have 0 faith a product owner, without a coding background, can use AI to release new products and updates while firing their whole dev team.

When I say most of my time isn’t spent coding, a lot of that time is spend trying to figure out what people want me to build. They don’t know. They might have a general idea, but don’t know details and can’t articulate any of it. If they can’t tell me, I’m not sure how they will tell an LLM. I ended up building what I assume they want, then we go from there. I also add a lot of stuff that they don’t think about or care about, but will be needed later so we can actually support it.

If you were to go in another direction, what would it be where AI wouldn’t be a threat? The first thing that comes to my mind is switching to a trade school and learning some skills that would be difficult for robots.


Accounting mechanization is a good example of how unpredictable it can be. Initially there were armies of "accountants" (what we now call bookkeepers), mostly doing basic tasks of collecting data and making it fit something useful.

When mechanization appeared, the profession split into bookkeeping and accounting. Bookkeeping became a job for women as it was more boring and could be paid lower salaries (we're in the 1800s here). Accountants became more sophisticated but lower numbers as a %. Together, both professions grew like crazy in total number though.

So if the same happens you could predict a split between software engineers and prompt engineers. With an explosion in prompt engineers paid much less than software engineers.

> the number of accountants/book- keepers in the U.S. increased from circa 54,000 workers [U.S. Census Office, 1872, p. 706] to more than 900,000 [U.S. Bureau of the Census, 1933, Tables 3, 49].

> These studies [e.g., Coyle, 1929; Baker, 1964; Rotella, 1981; Davies, 1982; Lowe, 1987; DeVault, 1990; Fine, 1990; Strom, 1992; Kwolek-Folland, 1994; Wootton and Kemmerer, 1996] have traced the transformation of the of- fice workforce (typists, secretaries, stenographers, bookkeepers) from predominately a male occupation to one primarily staffed by women, who were paid substantially lower wages than the men they replaced.

> Emergence of mechanical accounting in the U.S., 1880-1930 [PDF download] https://www.google.com/url?sa=t&source=web&rct=j&opi=8997844...


Interesting. Another take on that split could be engineers split to upper class AI engineers and lower class AI prompt developers, aka ai builders vs ai appliers.

Alternatively, I’ve thought a bit about this previously and have a slight different hypothesis. Businesses are ran by “PM types”.the only reason that developers have jobs is because pm types need technical devs to build their vision. (Obviously I’m making broad strokes here as there are also plenty of founders that ARE the dev). Now, if ai makes technical building more open to the masses, I could foresee a scenario where devs and pms actually converge into a single job title that eats up the technical-leaning PMs and the “PM-y” devs. Devs will shift to be more PM-y or else be cut out of the job market because there is less need for non-ambitious code monkeys. The easier it becomes for the masses to build because of AI, the less opportunity there is for technical grunt work. If before it took a PM 30 minutes to get together the requirements for a small task that took the entry level dev 8 hours to do, then it made sense. Now if AI makes it so a technical PM could build the feature in an hour, maybe it just makes sense to have the PM do the implementation and cut out the code monkey. And if the PM is doing the implementation, even if using some mythical AI superpower, that’s still going to have companies selecting for more technical PM’s. In this scenario I think non-technical PMs and non-pm-y devs would find themselves either without jobs or at greatly reduced wages.


We’re already seeing that split, between “developer” and “engineer”. We have been for years.

But that’s normal, eg, we have different standards for a shed (yourself), house (carpenter and architect), and skyscraper (bonded firms and certified engineers).


Not really, I’ve worked at places that only had one or the other of the titles for all programming jobs


I think it depends on the size of the company. The larger the larger the company, the more likely they are to split this stuff out. Though various titles may seem to bleed together. I have a software engineer title, while another guy on my team is a software architect… we effectively do the same job. Stepping back from a higher level view, as a general theme, those with an architect title are more likely to be responsible for an overall design, while the engineers may have some input and build things to support the design.

The quality of said designs can vary wildly. Some designs I get from other team I completely ignore, because they have no idea what they’re talking about. Just because someone has the title doesn’t mean they deserve it.


If programming requires lots of talking, dialog and patiently explaining things woman might be dramatically better at it.


Agreed. The sweet spot is people who have product owner skills _and_ can code. They are quickly developing superpowers. The overhead of writing tickets, communicating with the team and so on is huge. If one person can do it all, efficiency skyrockets.

I guess it's always been true to some extent that single individuals are capable of amazing things. For example, the guy who's built https://www.photopea.com/. But they must be exceptional - this empowers more people to do things like that.


Or people who can be product owners and can prompt LLMs to code (because I know him, that's me!).

I'm awestruck by how good Claude and Cursor are. I've been building a semi-heavy-duty tech product, and I'm amazed by how much progress I've made in a week, using a NextJS stack, without knowing a lick of React in the first place (I know the concepts, but not the JS/NextJS vocab). All the code has been delivered with proper separation of concerns, clean architecture and modularization. Any time I get an error, I can reason with it to find the issue together. And if Claude is stuck (or I'm past my 5x usage lol), I just pair programme with ChatGPT instead.

Meanwhile Google just continues to serve me outdated shit from preCovid.


I’m afraid these tools are really good at getting beginners 90% of the way there, but no further.


90% of the way is still good enough for me because I can manage to think up and get through the rest of the 10%. The problem for me was that the 90% looked so overwhelming earlier and that would shy me away from pursuing that project at all.


I'm curious, with Cursor, why do you still need to use Claude?


I just started using Cursor a few days back, so still need to get a hold of all the keyboard shortcuts properly.


But excel eliminated need in multiple accountants. One accountant with excel replaced ten with paper.

Chatgpt already eliminated many entry-level jobs like writer or illustrator. Instead of hiring multiple teams of developers, there will be one team with few seniors and multiple AI coding tools.

Guess how depressing to the IT salaries it will be?


A whole lot of automation is limited not by what could be automated, but what one can automate within a given budget.

When I was coding in the 90s, I was in a team that replaced function calls into new and exciting interactions with other computers which, using a queuing system, would do the computation and return the answer back. We'd have a project of having someone serialize the C data structures that were used on both sides into something that would be compatible, and could be inspected in the middle.

Today we call all of that a web service, the serialization would take a minute to code, and be doable by anyone. My entire team would be out of work! And yet, today we have more people writing code than ever.

When one accountant can do the work of 10 accountants, the price of the task lowers, but a lot of people that before couldn't afford accounting now can. And the same 10 accountaings from before can just do more work, and get paid about the same.

As far as software, we are getting paid A LOT more than in the early 90s. We are just doing things that back then would be impossible to pay for, our just outright impossible to do due to lack of compute capacity.


The pay being larger, is caused (I think) by VC money and the illegality of non-compete contracts. If your competitor can do something you can't, hire someone away from the competitor to show you how to do it. Hence developers can demand more pay for retention, and more pay to move.


I don’t don’t doubt that it might depress salaries but that excel example is a good one in that suddenly every company could start to do basic financial analysis in a manner that only the largest ones could previously afford.


Yet another instance of Jevon's paradox ! https://en.m.wikipedia.org/wiki/Jevons_paradox

> the Jevons paradox occurs when technological progress increases the efficiency with which a resource is used (reducing the amount necessary for any one use), but the falling cost of use induces increases in demand enough that resource use is increased, rather than reduced.


Accountants still make plenty of money. Expertise in Excel also pays well independently of that.


Many are offshoring now, PwC just had a massive layoff announcement yesterday as well


Yeah, but if the number of them has shrunk 100 times even if they make 10 times more money still raises the question is it wise to become one?


The increased work capacity of an accountant means that nowadays even small businesses can do financial analysis that would not have scaled decades ago.


>But excel eliminated need in multiple accountants. One accountant with excel replaced ten with paper.

From NPR: <https://www.npr.org/2015/02/27/389585340/how-the-electronic-...>

>GOLDSTEIN: When the software hit the market under the name VisiCalc, Sneider became the first registered owner, spreadsheet user number one. The program could do in seconds what it used to take a person an entire day to do. This of course, poses a certain risk if your job is doing those calculations. And in fact, lots of bookkeepers and accounting clerks were replaced by spreadsheet software. But the number of jobs for accountants? Surprisingly, that actually increased. Here's why - people started asking accountants like Sneider to do more.


lol, my accountant is pretty darn expensive.


> The calculator didn’t eliminate math majors. Excel and accounting software didn’t eliminate accountants and CPAs. These are all just tools.

This just feels extremely shortsighted. LLMs are just tools right now, but the goal of the entire industry is to make something more than a tool, an autonomous digital agent. There's no equivalent concept in other technology like calculators. It will happen or it will not, but we'll keep getting closer every month until we achieve it or hit a technical wall. And you simply cannot know for sure such a wall exists.


If we hit that point, it’s then a question of access, cost, learning curve, and vision of individual companies. Some things are technically possible, but done by very few companies.

I’ve seen the videos of Amazon warehouses, where the shelves move around to make popular items more accessible for those fetching stuff. This is possible today, but what percentage of companies do this? At what point is it with the investment for a growing company? For some companies it’s never worth it. Others don’t have the vision to see the light at the end of the tunnel.

A lot of things that we may think of as old or standard practice at this point would be game changing for some smaller companies outside of tech. I hear my friends and family talking about various things they have to do at their job. A day writing a few scripts could solve a significant amount of toil. But they can’t even conceptualize where to begin to change that, they aren’t even thinking about it. Release all the AI the world has to offer and they still won’t. I bet some freelance devs could make a good living bouncing from company to company pair programming with their AI to solve some pretty basic problems for small non-tech companies that would be game changes for them, while being rather trivial to do. Maybe partner with a sales guy to find the companies and sell them on the benefits.


All good points.


The calculator didn’t eliminate math majors.

We're not dealing with calculators here, are we?


You can't ignore the fact that literally studying coding at this point is so demoralizing and you don't need really to study much if you think about it. You only need to be able to read the code to understand if it generated correctly etc but when if you don't understand some framework you just ask it to explain it to you etc. Basically gives vibes of a skill not being used anymore that much by us programmers. But will shift in more prompting and verifying and testing


I completed the book Programming Principles and Practice using C++ (which I HIGHLY recommend to any beginner interested in software engineering) about year ago with GPT4 as a companion. I read the book throughly and did all the exercises, only asking questions to GPT4 when I was stuck. This took me about 900-1000 hours total. Although I achieved my goal of learning C++ to a basic novice level, I acquired another skill unintentionally: the ability to break down tasks effectively to LLMs and prompt in a fashion that is extremely modular. I've been able to create complex apps and programs in a variety of programming languages even though I really only know C++. It has been an eye-opening experience. Of course it isn't perfect, but it is mind blowing and quite disturbing.


Semi-retired software/hardware engineer here. After my recent experiences with various coding LLMs (similar to the experience of the OP with the bluetooth fan protocol) I'm really glad I'm in a financial position such that I'm able to retire. The progress of these LLMs at coding has been astonishing over the last 18 months. Will they entirely replace humans? No. But as they increase programmer productivity fewer devs will be required. In my case the contract gig I was doing over this last summer I was able to do about 3 to 4X faster than I could've done it without LLMs. Yeah, they were generating a lot of boiler plate HDL code for me, but that still saved me several days of work at least. And then there was the test code that they generated which again saved me days of work. And their ability to explain old undocumented code that was part of the project was also extremely helpful. I was skeptical 18 months ago that any of this would be possible. Not anymore. I wasn't doing a project in which there would've been a lot of training examples. We're talking Verilog testbench generation based on multiple input Verilog modules, C++ code generation for a C program analyzer using libclang - none of this stuff would've worked just a few months back.


I will add that I am grateful that I also got to experience a world where AI did not spew tons of code like a sausage-making machine.

It was so satisfying to code up a solution where you knew you would get through it little by little.


This.


This. I'm not terrified by total automation (In that case all jobs are going away and civilization is going to radically alter), I'm scared of selective deskilling and the field getting squeezed tighter and tighter leaving me functionally in a dead end.


> But as they increase programmer productivity fewer devs will be required.

Can you point me to any company whose feature pipeline is finite? Maybe these tools will help us reach that point, but every company I've ever worked for, and every person I know who works in tech has a backlog that is effectively infinite at this point.

Maybe if only a few companies had access to coding LLMs they could cut their stuff, when the whole industry raises the bar, nothing really changes.


LLMs perform well on small tasks that are well defined. This definition matches almost every task that a student will work on in school leading to an overestimation of LLM capabiity.

LLMs cannot decide what to work on, or manage large bodies of work/code easily. They do not understand the risk of making a change and deploying it to production, or play nicely in autonomous settings. There is going to be a massive amount of work that goes into solving these problems. Followed by a massive amount of work to solve the next set of problems. Software/ML engineers will have work to do for as long as these problems remain unsolved.


Careers are 30 years long

Can you confidently say that an LLM won’t be better than an average 22 year old coder within these 30 years?


Careers have failed to be 30 years long for a lot longer than 30 years now. That's one of the reasons that 4-year colleges have drastically lost their ROI, the other blade of those scissors being the stupendously rising tuition. AI is nothing but one more layer in the constantly growing substrate of computing technology a coder has to learn how to integrate into their toolbelts. Just like the layers that came before it: mobile, virtualization, networking, etc.


Careers are still longer than 30 years. How many people do you think are retiring at 48 or 51 years old these days? It’s a small minority. Most people work through 65: a career of about 45 years or more.


Right but most people don't stick a single career anymore. An individual career is <30 yrs, and the average person will have >1 of them.

It's not as out there as e.g. this article (https://www.wsj.com/articles/SB10001424052748704206804575468...) - 7 careers is probably a crazy overestimate. But it is >1.


> Can you confidently say that an LLM won’t be better than an average 22 year old coder within these 30 years?

No 22 years old coder is better than the open source library he's using taken straight from github, and yet he's the one who's getting paid for it.

People who claim IA will disrupt software development are just missing the big picture here: software jobs are already unrecognizable from what it was just 20 years ago. AI is just another tool, and as long as execs won't bother use the tool by themselves, then they'll pay developers to do it instead.

Over the past decades, writing code has become more and more efficient (better programming languages, better tooling, then enormous open source libraries) yet the number of developers kept increasing, it's Jevons paradox[1] in its purest form. So if past tells us anything, is that AI is going to create many new software developer jobs! (because the amount of people able to ship significant value to a customer is going to skyrocket, and customers' needs are a renewable resource).

[1]: https://en.wikipedia.org/wiki/Jevons_paradox


22 year old coder today or 22 year old coder 30 years from now? How a 22 year old codes 30 years from now may look like magic to you and me.


Huh careers are 30 years long? I don't know where you live but it's more like 45 years long where I live. The retirement age is 67.


yes, because this is still glorified autocomplete


the average coder is worse than an autocomplete

Too many people here have spent time in elite corporations and don't realize how mediocre the bottom 50th percentile of coding talent is


To be honest, if the bottom 50th percent of coding talent is going to be obsolete, I wonder what happens to rest of the "knowledge workers" in those companies. I mean people whose jobs consist of attending Teams meetings, making fancy powerpoint slides and reports, perhaps even excel if they are really competent. None of that is any more challenging for LLM than writing code. In fact replacing these jobs should be easier, since presentations and slides do not actually do anything, unlike a program that must perform a certain action correctly.


I've heard compelling arguments that we passed the "more people than jobs" threshold during the green revolution and as a civilization have collectively retrofitted UBI in the form of "fake email jobs" and endless layers of management. This also would explain https://wtfhappenedin1971.com/ pretty well.

Either AI shatters this charade, or we make up some new laws to restrain it and continue to pretend all is well.


Exactly. There's some need, perhaps, to keep these tools "up to date" because someone in a non-free country is going to use them in a horrendous manner and we should maybe know more about them (maybe).

However, there is no good reason in a free society that this stuff should be widely accessible. Really, it should be illegal without a clearance, or need-to-know. We don't let just anyone handle the nukes...


This is true and yet companies (both Private and Public sector) spend literal billions on Accenture /Deloitte slop that runs budgets will into the 10s of millions.

Skills aren't even something that dictates software spend, it seems.


I tried it out and was able to put together a decent libevent server in c++ with smart pointers, etc, and a timer which prints out connection stats every 30s. It worked remarkably well.

I'm trying not to look at it as a potential career-ending event, but rather as another tool in my tool belt. I've been in the industry for 25 years now, and this is way more of an advancement than things like IntelliSense ever was.


Exactly, LLMs are not near ready to fully replace software engineers or any kind of knowledge workers. But they are increasingly useful tools that is true. https://www.lycee.ai/blog/ai-replace-software-engineer


Truth is, LLMs are going to make the coding part super easy, and the ceiling for shit coders like me has just gotten a lot lower because I can just ask it to deliver clean code to me.

I feel like the software developer version of an investment banking Managing Director asking my analyst to build me a pitch deck an hour before the meeting.


You mentioned in another comment you’ve used AI to write clean code, but here you mention you’re a “shit coder”. How do you know it’s giving you clean code?


I know the fundamentals but I'm a noob when it comes to coding with React or NextJS. Code that comes out from Claude is often segregated and modularized properly so that even I can follow the logic of the code, even if not the language and its syntax. If there's an issue with the code, causing it to fail at runtime, I am still able to debug it appropriately with my minimal language of JS. If any codebase can let me do that, then in my books that's a great codebase.

Compare that to Gpt 4o which gives me a massive chunk of unsorted gibberish that I have to pore through and organize myself.

Besides, most IBD MDs don't know if they're getting correct numbers either :).


Has the coding part ever been hard? When is the last time you faced a hard coding challenege?

What is hard is gather requirements, dealing with unexpected production issues, scaling, security, fixing obscure bugs and integration with other systems.

The coding part is about 10% of my job and the easiest part by far.


I went from economics dropout waiter who built a app startup with $0 funding and $1M a year in revenue by midway through year 1, sold it a few years later, then went to Google for 7 years, and last year I left. I'm mentioning that because the following sounds darn opinionated and brusque without the context I've capital-S seen a variety of people and situations.

Sit down and be really honest with yourself. If your goal is to have a nice $250K+ year job, in a perfect conflict-free zone, and don't mind Dilbert-esque situations...that will evaporate. Google is full of Ivy Leaguers like that, who would have just gone to Wall Street 8 years ago, and they're perennially unhappy people, even with the comparative salary advantage. I don't think most of them even realize because they've always just viewed a career as something you do to enable a fuller life doing snowboarding and having kids and vacations in the Maldives, stuff I never dreamed of and still don't have an interest in.

If you're a bit more feral, and you have an inherent interest and would be doing it on the side no matter what job you have like me, this stuff is a godsend. I don't need to sit around trying to figure out Typescript edge functions in Deno, from scratch via Google, StackOverflow, and a couple books from Amazon, taking a couple weeks to get that first feature built. Much less debug and maintain it. That feedback loop is now like 10-20 minutes.


>Google is full of Ivy Leaguers like that, who would have just gone to Wall Street 8 years ago

I am one of those Ivy Leaguers, except a) I did go to Wall Street, and b) I liked my job.

More to the point, computers have been a hobby all my life. I well remember the epiphany I felt while learning Logo in elementary school, at the moment I understood what recursion is. I don't think the fact that the language I have mostly written code in in recent years is Emacs Lisp is unrelated to the above moment.

Yet I have never desired to work as a professional software developer. My verbal and math scores on the SAT are almost identical. I majored in history and Spanish in college while working for the university's Unix systems group. Before graduation I interviewed and got offers (including one explicitly as a developer) at various tech startups. Of my offers I chose an investment banking job where I worked with tech companies; my manager was looking for a CS major but I was able to convince her that I had the equivalent thereof. Thank goodness for that; I got to participate in the dotcom bubble without being directly swept up in its popping, and saw the Valley immediately post-bubble collapse. <https://news.ycombinator.com/item?id=34732772>

Meanwhile, I continue to putter around with Elisp (and marveling at Lisp's elegance) and bash (and wincing at its idiosyncracies) at home, and also experiment with running local LLMs on my MacBook. My current project is fixing bugs and adding features to VM, the written-in-Elisp email client I have used for three decades. So I say, bring on AI! Hopefully it will mean fewer people going into tech just to make lots of money and more who, like me and Wall Street, really want to do it for its own sake.


That's more well balanced opinion comparing to others I seen here. I also believe that the golden age with 250k+ salaries with solving easy problems will be gone in 5-10 years. Most people look at this AI improvements at current state and forget that you are supposed to have a profession for 40 years until retirement. 250k+ jobs will still exist 10 years from now but expectations will be much higher and competition much bigger.

On the other hand now is the best time to build your own product as long you are not interested only in software as craftmanship but in product development in general. Probably in the future expectation will be your are not only monkey coder or craftman but also project lead/manager (for AI teams), product developer/designer and maybe even UX/designer if you will be working for some software house, consulting or freelancing.


What did your startup do?


Point of sale, on iPad, in ~2011. Massively differentiated from Square / VC competitor land via doing a bunch of restaurant specific stuff early.

Trick with the $1M number is a site license was $999 and receipt printers were sold ~at cost, for $300. 1_000_000 / ((2 x 300) + 1000) ~= 500 customers.

Now I'm doing an "AI client", well-designed app, choose your provider, make and share workflows with LLMs/search/etc.


Lol. I like this answer. You can either think of it in terms of "it'll eat my lunch" or "I now have 10x more capabilities and can be 100x more productive". The former category will be self-fulfilling.


Actually cutting code is maybe 10% of the job, and LLMs are absolute crap at the other 90%.

They can't build and maintain relationships with stakeholders. They can't tell you why what you ask them to do is unlikely to work out well in practice and suggest alternative designs. They can't identify, document and justify acceptance criteria. They can't domain model. They can't architect. They can't do large-scale refactoring. They can't do system-level optimization. They can't work with that weird-ass code generation tool that some hotshot baked deeply into the system 15 years ago. They can't figure out why that fence is sitting out in the middle of the field for no obvious reason. etc.

If that kind of stuff sounds like satisfying work to you, you should be fine. If it sounds terrible, you should pivot away now regardless of any concerns about LLMs, because, again, this is like 90% of the real work.


Don't do it, help us keep our high salaries :D

Joking aside, even with AI generating code, someone has to know how to talk to it, how to understand the output, and know what to do with it.

AI is also not great for novel concepts and may not fully get what's happening when a bug occurs.

Remember, it's just a tool at the end of the day.


> may not fully get what's happening when a bug occurs.

And may still not understand even when you explicitly tell it. It wrote some code for me last week and made an error with an index off by 1. It had set the index to 1, then later was assuming a 0 index. I specifically told it this and it was unable to fix it. It was in debug hell, adding print statements everywhere. I eventually fixed it myself after it was clear it was going to get hung up on this forever.

It got me 99% of the way there, but that 1% meant it didn’t work at all.


Ironically, just yesterday I asked sonnet to write a script in JavaScript, it went in a bit of a perpetual loop unable to provide an error free script (the reason for the errors were not immediately obvious). I then mentioned that it needs to be zero indexed, and it immediately provided an issue free version that worked.


Well now you're going to be paid a high salary for knowing when to use a 1 index vs a 0 index. :)


just change this to "I have AI Skills!!" :)

https://www.youtube.com/watch?v=hNuu9CpdjIo


Not having clicked the link yet, I'm going to speculate that this is the famous Office Space "I have people skills, damnit!" scene.

...

And it was. :-) Nice callback!


Coding is going to be mediated by these LLMs everywhere — you’re right about that. However, as of today, and for some time, practitioners will be critical partners / overseers; what this looks like today in my workflow is debugging, product specification, coding the ‘hard bits’, reworking / specifying architectures. Whatever of these fall of the plate in the coming years, you’ll never lose your creative agency or determination of what you want to build, no matter how advanced the computers. Maybe give Iain Banks a read for a positive future that has happy humans and super-intelligent AI.

We have working fine cabinet makers who use mostly hand tools and bandsaws in our economy, we have CAD/CAM specialists who tell CnC machines what to build at scale; we’ll have the equivalent in tech for a long time.

That said, if you don’t love the building itself, maybe it’s not a good fit for you. If you do love making (digital) things, you’re looking at a super bright future.


1. The demand for software is insatiable. The biggest gate has been the high costs due to limited supply of the time of the people who know how to do it. In the near term, AI will make the cost of software (not of software devs, but the software itself) decrease while demand for new software will increase, especially as software needs to be created to take advantage of new UI tools.

I've been in software engineering for over 20 years. I've seen massive growth in the productivity of software engineers, and that's resulted in greater demand for them. In the near term, AI should continue this trend.

2. It's possible that at some point, AI will advance to where we can remove software engineers from the loop. We're not even close to that point yet. In the mean time, software engineering is an excellent way to learn about other business problems so that you'll be well-situated to address them (whatever they'll be at that time).


Software engineering contains a lot more than just writing code.

If we somehow get AGI, it'll change everything, not just SWE.

If not, my belief is that there will be a lot more demand for good SWEs to harness the power of LLMs, not less. Use them to get better at it faster.


Agree, SWE as a profession is not going anywhere, unless we AGI, and that would mean all the rules change anyway.

Actually now is really good time to get to SWE. The craft contains lots of pointless cruft that LLM:s cut through like knife through hot butter.

I’m actually enjoying my job now more than ever since I dont’t need to pretend to like the abysmal tools the industry forces on us (like git), and can focus mostly on value adding tasks. The amount of tiresome shoveling has decreased considerably.


I'd agree with this take. Everyone is so pessimistic about LLMs, but I've really enjoyed this new era.

A lot of the tasks that used to take considerable time are so much faster and less tedious now. It still puts a smile on my face to tell an LLM to write me scripts that do X Y and Z. Or hand it code and ask for unit tests.

And I feel like I'm more likely to reach for work that I might otherwise shrink from / outside my usual comfort zone, because asking questions of an LLM is just so much better than doing trivial beginner tutorials or diving through 15 vaguely related stack overflow questions (I wonder if SO has seen any significant dip in traffic over the last year).

Most people I've seen disappointed with these tools are doing way more advanced work than I appear to be doing in my day to day work. They fail me too here and there, but more often than not I'm able to get at least something helpful or useful out of them.


Exactly this. The menial tasks become less of a burden and you can just power through them with LLM generated scripts.

If someone expects the LLM to be the senior contributor in novel algorithm development, they will be disappointed for sure. But there is so, so much stuff to do to idiot savant junior trainees with infinite patience.


I don't think anyone is worried about SWE work going away, I think the concern is if SWE's will still be able to command cushy salaries and working conditions.


I think the industry will bifurcate along the axis of "doing actually novel stuff" vs slinging DB records and displaying web pages. The latter is what I'd expect to get disrupted, if anything, but the former isn't going away unless real AGI is created. The people on the left of that split are going to be worth a lot more because the pipeline to get there will be even harder than it was before.


> "doing actually novel stuff" vs slinging DB records and displaying web pages. The latter is what I'd expect to get disrupted,

Unfortunately the latter is the vast majority of software jobs.


Yeah, but honestly I'm ok with the industry shrinking along that axis.


slinging DB records and displaying web pages is already disrupted. Wordpress, Shopify, SAP so people without tech background can click around and have stuff done.

If someone is building web shop from scratch because he wants to sell some products, he is doing something wrong. If someone builds web shop to compete with Shopify he also is doing something wrong most likely.


Salaries will only change if tech loses it's leverage on the economy. Think of it this way, if Google can operate Google with only 10% of its current staff, then there will be other Googles popping up. The downward pressure on salaries will start with the downward pressure on tech overall. I'm not sure I see this happening anytime soon because humanity is so good at using every resource available.


> I don't think anyone is worried about SWE work going away, I think the concern is if SWE's will still be able to command cushy salaries and working conditions.

It's very important to human progress that all jobs have poor working conditions and shit pay. High salaries and good conditions are evidence of inefficiency. Precarity should be the norm, and I'm glad AI is going to give it to us.


Software engineering pay is an outlier for STEM fields. It would not be surprising at all if SWE work fell into the ~$80-120k camp even with 10+ years experience.

They won't go broke, but landing a $175k work from home job with platinum tier benefits will be near impossible. $110K with a hybrid schedule and mediocre benefits will be very common even for seniors.


That's actually totally reasonable, but what's the end result for housing markets in areas saturated by these kinds of gigs now.

Would there be reasonably priced houses in Seattle/SF? Can't see that happening


Sarcasm or cynicism?


Capitalism.

Btw communism is capitalism without systemic awareness of inefficiencies.


Capitalism doesn't dictate poor working conditions at all. Lack of regulation certainly does though.


> Capitalism doesn't dictate poor working conditions at all. Lack of regulation certainly does though.

It totally does. Regulation is basically opposed to capitalism working as designed.


Capitalism needs regulation to avoid self destruction. Without circuit breakers it devolves into monopolistic totalitarianism.

C.f. East India company. Then imagine them with modern military and communications tech.


> Capitalism needs regulation to avoid self destruction.

This is equally true for any alternative system to capitalism.


That's not true at all. That's just some propaganda college kids and the like keep repeating. Most other western countries are capitalist, have much stronger regulation than the US and are all the better for it.


This thing is doing planning and ascending the task management ladder. It's not just spitting out code anymore.


AI Automated planning and action are an old (45+ year) field in AI with a rich history and a lot of successes. Another breakthrough in this area isn't going to eliminate engineering as a profession. The problem space is much bigger than what AI can tackle alone, it helps with emancipation for the humans that know how to include it in their workflows.


Yes, and they will get better. Billions are being poured into them to improve.

Yet I'm comparing these to the problems I solve every day and I don't see any plausible way they can replace me. But I'm using them for tasks that would have required me to hire a junior.

Make that what you will.


Yes, if "efficiency" is your top concern, but I'd much prefer working with an actual person than just a computer. I mean, God forbid I'm only useful for what I can produce, and disposable when I reach my expiration date. I would like to see a twilight zone rendition of an AI dystopia where all the slow, ignorant and bothersome humans are replaced by lifeless AI


It's not just about efficiency. I don't have the means to hire a junior right now, but 20$ is a no brainer.


Time to re-read The Culture. Not everything has to end in a dystopia.


Management will be easier to replace than SWEs. I'm thinking there will come a time, similar to the show Mrs Davis, where AI will direct human efforts within organizations. AI will understand its limits and create tasks/requirements for human specialists to handle.


My first thought with this is that AI would be directed to figure out exactly how little people are willing to work for, and how long, before they break.

I hope I’m wrong, and it instead shows that more pay and fewer hours lead to a better economy, because people have money and time to spend it… and output isn’t impacted enough to matter.


Sure. But the added value of SWE is not ”spitting code”. Let’s see if I need to calibrate my optimism once I take the new model to a spin.


What's the alternative? If AI is going to replace software engineers, there is no fundamental reason they couldn't replace almost all other knowledge workers as well. No matter the field, most of it is just office work managing, transforming and building new information, applying existing knowledge on new problems (that probably are not very unique in grand scheme of things).

Except for medical doctors, nurses, and some niche engineering professions, I really struggle to think of jobs requiring higher education that couldn't be largely automated by an LLM that is smart enough to replace a senior software engineer. These few jobs are protected mainly by the physical aspect, and low tolerance for mistakes. Some skilled trades may also be protected, at least if robotics don't improve dramatically.

Personally, I would become a doctor if I could. But of all things I could've studied excluding that, computer science has probably been one of the better options. At least it teaches problem solving and not just memorization of facts. Knowing how to code may not be that useful in the future, but the process of problem solving is going nowhere.


Why can't medical doctors be automated?


Mainly the various physical operations many of them perform on daily basis (due to limitations of robotics), plus liability issues in case things go wrong and somebody dies. And finally, huge demand due to aging population worldwide.

I do believe some parts of their jobs will be automated, but not enough (especially with growing demand) to really hurt career prospects. Even for those parts, it will take a long a while due to the regulated nature of the sector.


When everything will be automated, what will we do with our lives?

I love landscaping my garden lately, would I just get a robot to do that and watch ?

Going to be a weird time.


If you have better career ideas, you should not continue. The thing is it is very hard to predict how the world will change (and by how much from very little to a revolutionary change) with all these new changes. Only licensed and regulated professions (doctors/lawyers/pilots etc) might remain high earning for long (and they too are not guaranteed). It really is worth a relook on what you want to do in life while seeing all these new advances.


I don't have any ideas whatsoever.


Do you enjoy making computers solve problems? If yes, continue. If you hate it and are in just for the money… I’d say flip a coin.


Then talk to more and more people, some of whom will have ideas on what they would prefer in the changing world.


This is pretty extreme advice to offer in response to news that a model that can better understand programming problems is coming out.

In fact, it's more encouragement to continue. A lot of issues we face as programmers are a result of poor, inaccurate, or non-existent documentation, and despite their many faults and hallucinations LLMs are providing something that Google and Stack Overflow have stopped being good at.

The idea that AI will replace your job, so it's not worth establishing a career in the field, is total FUD.


The advice is unrelated to the model and related to the last year's worth of development. In any case I am advising a relook which is perfectly warranted for anyone pre-university or in university.


This is a really odd take to have.

By the "past year's worth of development" I assume you mean the layoffs? Have you been in the industry (or any industry) long? If so, you would have seen many layoffs and bulk-hiring frenzies over the years... it doesn't mean anything about the industry as a whole and it's certainly a foolish thing to change career asperations over.

Specifically regarding the LLM - anyone actually believing these models will replace developers and software engineers, truly, deeply does not understand software development at even the most basic fundamental levels. Ignore these people - they are the snake oil salesmen of our modern times.


I assume the poster meant how much progress the models have made. Roughly late high school capability to late college-ish. Project forward five years.


Predicting exponential functions is a fool’s errand. The tiniest error in your initial observation compounds real fast and we can’t even tell if we’re still in the exponential phase of the sigmoid.


If at some point a competent senior software engineer can be automated away, I think we are so close to a possible 'AI singularity' in as much as that concept makes sense, that nothing really matters anyway.

I don't know what will be automated first of the competent senior software engineer and say, a carpenter, but once the programmer has been automated away, the carpenter (and everything else) will follow shortly.

The reasoning is that there is such a functional overlap between being a standard software engineer and an AI engineer or researcher, that once you can automate one, you can automate the other. Once you have automated the AI engineers and researchers, you have recursive self-improving AI and all bets are off.

Essentially, software engineering is perhaps the only field where you shouldn't worry about automation, because once that has been automated, everything changes anyways.


Carpenters and other manual jobs might outlast software engineers. It seems that AI is advancing a lot faster than robotics.


If you are not a software engineer, you can't judge the correctness of any LLM answer on that topic, nor you know what are the right questions to ask.

From all my friends that are using LLMs, we software engineers are the ones that are taking the most advantage of it.

I am in no way fearful I am becoming irrelevant, on the opposite, I am actually very excited about these developments.


There is little to no research that shows modern AI can perform even the most simple long-running task without training data on that exact problem.

To my knowledge, there is no current AI system that can replace a white collar worker in any multistep task. The only thing they can do is support the worker.

Most jobs are safe for the forseable future. If your job is highly repetitive and a company can produce a perfect dataset of it, I'd worry.

Jobs like a factory worker and call center support are in danger. But the work is perfectly monitorable.

Watch the GAIA benchmark. It's not nearly the complexity of a real-world job, but it would signal the start of an actual agentic system being possible.


I’d argue the foreseeable future got a lot shorter in the last couple years.


If you want to get a career in software engineering because you want to write code all day, probably a bad time to be joining the field.

If you are interested in using technology to create systems that add value for your users, there has never been a better time.

GPT-N will let you scale your impact way beyond what you could do on your own.

Your school probably isn’t going to keep abreast with this tech so it’s going to be more important to find side-projects to exercise your skills. Build a small project, get some users, automate as much as you can, and have fun along the way.


There's so much software yet to to be written, so much to automate, so many niches to attack that you need not worry. It takes humans to know where to apply the technology based on their heart, not brains. Use AI in the direction only you can ascertain; and do it for the good of HUMANITY. It's a tool that makes the knowledge posterity has left us accessible, like mathematics. Go forth an conquer life's ills young man; It takes a human to know one. Don't worry, you're created in God's image.


Machines don't really "know" anything they just manipulate what is already known; Like a interactive book. It's just that this AI book is vast.


And the knowledge acquisition impedance is reduced


Computer Science becomes MORE interesting as computers become more capable, not less. There are so many things we could be working on, but we still waste so much time on boring libraries, configuration, implementation details that we simply don't get to experiment enough.

Just like nobody programs on punch cards anymore, learning details of a specific technology without deeper understanding will become obsolete. But general knowledge about computer science will become more valuable.


My two cents thinking about different scenarios:

- AI comes fast, there is nothing you can do: Honestly, AI can already handle a lot of tasks faster, cheaper, and sometimes better. It’s not something you can avoid or outpace. So if you want to stick with software engineering, do it because you genuinely enjoy it, not because you think it’s safe. Otherwise, it might be worth considering fields where AI struggles or is just not compatible. (people will still want some sort of human element in certain areas).

- There is some sort of ceiling, gives you more time to act: There’s a chance AI hits some kind of wall that’s due to technical problems, ethical concerns, or society pushing back. If that happens, we’re all back on more even ground and you can take advantage of AI tools to improve yourself.

My overall advice; and it will probably be called out as cliche/simplistic just follow what you love, just the fact that you have an opportunity to pursue to study anything at all is something that many people don't have. We don't really have control in a lot of stuff that happens around us and that's okay.


For basically all the existing data we have, efficiency improvements always result in more work, not less.

Humans never say "oh neat I can do thing with 10% of the effort now, guess I'll go watch tv for the rest of the week", they say "oh neat I can do thing with 10% of the effort now, I'm going to hire twice as many people and produce like 20x as much as I was before because there's so much less risk to scaling now."

I think there's enough unmet demand for software that efficiency increases from automation are going to be eaten up for a long time to come.


I'm wondering if the opposite might happen, that there will be more need for software engineers.

1. AI will suck up a bunch of engineers to run, maintain and build on its own.

2. Ai will open new fields that is not yet dominated by software. Ie. Driving ect.

3. Ai tools will lower the bar for creating software meaning industries that weren't financially viable will now become viable for software automation.


The amount of knowledge the OP needed to be even to formulate the right question to the AI requires a lifetime of deep immersion in technology. You'd think that maybe you can ask the AI how to phrase the question to the AI but at some point you run up against your ability to contextualize the problem - it can't read your mind.

Will the AI become as smart as you or I? Recognize that these things have tiny context windows. You get the context window of "as long as you can remember".

I don't see this kind of AI replacing programmers (though it probably will replace low-skill offshore contract shops). It may have a large magnifying effect on skill. Fortunately there seem to be endless problems to solve with software - it's not like bridges or buildings; you only need (or can afford) so many. Architects should probably be more worried.


Because none of your other majors will hold up much longer. Once software engineering becomes fully automated, so will EE, ME, applied math, economics, physics, etc. If you work with your hands, like a surgeon or chemist, you'll last longer, but the thinky bits of those jobs will disappear. And once AI research is automated, how long will it be until we have dexterous robots?

So basically, switching majors is just running to the back of a sinking ship. Sorry.


If you’re any good at SWE with a sprinkle of math and CS, your advantage will get multiplied by anywhere from 2 to 100x if you use the leverage of co-intelligence correctly. Things that took weeks before now easily take hours, so if you know what to build and especially what not to build (including but not limited to confabulations of models), you’ll do well.


But also on the other hand you'll need much less people to achieve the same effect. Effectively a whole team could be replaced by one lead guy that just based on the requirements orders the LLM what to do and glues it together.


Yes - my point is: be that guy


First many people can be that guy? If that is 5% that means 95% of the rest should go.

Second, just because a good engineer can have much higher throughput of work, multiplied by AI tools, we know the AI output is not reliable and needs a second look by humans. Will those 5% be able to stay on top of it? And keep their sanity at the same time?


Do not assume constant demand. There are whole classes of projects which become feasible if they cash be made 10x faster/cheaper.

As for maintaining sanity… I’m cautiously optimistic that future models will continue to get better. Very cautiously. But cursor with Claude slaps and I’m not getting crazy, I actually enjoy the thing figuring out my next actions and just suggesting them.


As others have said, LLMs still require engineers to produce quality output. LLMs do, however, make those engineers that use them much more productive. If this trend continues, I could see a scenario where an individual engineer could build a customized version of, say, Salesforce in a month or two. If that happens, you could make a solid case that companies paying $1mm+ per year for 12 different SaaS tools should just bring that in house. The upshot is you may still be writing software, but instead of building SaaS at Salesforce, you'll be working for their former customers or maybe as some sort of contractor.


One angle: There are a million SMBs and various other institutions, using none or really shitty software, that could be xx% to xxx% times more productive with custom software that they would never have been able to afford before. Now they can, en masse, because you will be able to built it a lot faster.

I have been coding a lot with AI recently. Understanding and putting into thought what is needed for the program to fix your problem remains as complex and difficult as ever.

You need to pose a question for the AI to do something for you. Asking a good question is out of reach for a lot of people.


This 1000%


While the reasoning and output of ChatGPT is impressive (and, imho, would pass almost all coding interviews), I'm primarily impressed with the logical flow, explanation and thoroughness. The actual coding and problem solving isn't complex, and that gets to your question: someone (in this case, the OP), still needed to be able to figure out how to extract useful data and construct a stimulating prompt to trigger the LLM into answering in this way. As others have posted, none of the popular LLMs behave identically, either, so becoming an exert tool-user with one doesn't necessarily translate to the next.

I would suggest the fundamentals of computer science and software engineering are still critically important ... but the development of new code, and especially the translation or debugging of existing code is where LLMs will shine.

I currently work for an SAP-to-cloud consulting firm. One of the singlemost compelling use cases for LLMs in this area is to analyze custom code (running in a client's SAP environment), and refactor it to be compatible with current versions of SAP as a cloud SaaS. This is a specialized domain but the concept applies broadly: pick some crufty codebase from somewhere, run it through an LLM, and do a lot of mostly copying & pasting of simpler, modern code into your new codebase. LLMs take a lot of the drudgery out of this, but it still requires people who know what they're looking at, and could do it manually. Think of the LLM as giving you an efficiency superpower, not replacing you.


There's an equal amount of hopium from the AI stans here as well.

Hundreds of billions of dollars have been invested in a technology and they need to find a way to start making a profit or they're going to run out of VC money.

You still have to know what to build and how to specify what you want. Plain language isn't great at being precise enough for these things.

Some people say they'll keep using stuff like this as a tool. I wouldn't bet the farm that it's going to replace humans at any point.

Besides, programming is fun.


As soon as software development can be fully performed by AIs, it won't take long before all other jobs that can be performed in front of a computer follow, and after that it probably won't take long for practically the entire rest.

This release has shifted my personal prediction of when this is going to happen further into the future, because OpenAI made a big deal hyping it up and it's nothing - preferred by humans over GPT-4o only a little more than half the time.


Three, though not slam dunks:

1. What other course of study are you confident would be better given an AI future? If there's a service sector job that you feel really called to, I guess you could shadow someone for a few days to see if you'd really like it?

2. Having spent a few years managing business dashboards for users, less than 25% ever routinely used the "user friendly" functionality we built to do semi-custom analysis. We needed 4 full time analytics engineers to spend at least half their time answering ad hoc questions that could have been self-served, despite an explicit goal of democratizing data. All that is to say; don't over estimate how quickly this will be taken up, even if it could technically do XYZ task (eventually, best-of-10) if prompted properly.

3. I don't know where you live, but I've spent most of my career 'competing' with developers in India who are paid 33-50% as much. They're literally teammates, it's not a hypothetical thing. And they've never stopped hiring in the US. I haven't been in the room for those decisions and don't want to open that can of worms here, but suffice to say it's not so simple as "cheaper per LoC wins"


Software engineering teaches you a set of skills that are applicable in more places than just writing software. There are big parts of the job that cannot be done by LLMs (today) and if LLMs get better (or AGI happens) then enough other professions will be affected that we will all be in the same boat (no matter what you major in).

LLMs are just tools, they help but they do not replace developers (yet).


> LLMs are just tools, they help but they do not replace developers (yet)

Yes but they will certainly have a lot of downward pressure on salaries for sure.


I was debugging an issue the other day where either of sentencepiece or gRPC linked into a C++ program worked fine, but both at once caused a segfault before even getting to main deep in the protobuf initialization stuff in some arena management code and left a fairly mangled stack even pwndbg struggled with legible frames.

It wasn’t trivial that combination was even the culprit.

I’ve been around the block with absl before, so it wasn’t a total nightmare, but it was like, oof, I’m going to do real work this afternoon.

They don’t pay software engineers for the easy stuff, they pay us because it gets a little tricky sometimes.

I’ll reserve judgement on this new one until I try it, but the previous ones, Sonnet and the like, they were no help with something like that.

When StackOverflow took off, and Google before that, there wide swaths of rote stuff that just didn’t count as coding anymore, and LLMs represent sort of another turn of that crank.

I’ve been wrong before, and maybe o1 represents The Moment It Changed, but as of now I feel like a sucker that I ever bought into the “AI is a game changer” narrative.


Just because we have machines that can lift much more than any human ever could, it doesn't mean that working out is useless.

In the same way, training your mind is not useless. Perhaps as things develop, we will get back to the idea that the purpose of education is not just to get a job, but to help you become a better and more virtuous person.


Most of these posts are from romantics.

Software engineering will be a profession of the past, similar to how industrial jobs hardly exist.

If you have a strong intuition with software & programming you may want to shift towards applying AI into already existing solutions.


The question is, why wouldn't nearly all other white collar jobs be professions of the past as well? Does the average MBA or whatever possess some unique knowledge that you couldn't generate with an LLM fed with company data? What is the alternative career path?

I think software engineers who also understand business may yet have an advantage over pure business people, who don't understand technology. They should be able to tell AI what to do, and evaluate the outcome. Of course "coders" who simply produce code from pre-defined requirements will probably not have a good career.


They will be of the past.

This is typical of automation. First, there are numerous workers, then they are reduced to supervisors, then they are gone.

The future of business will be managing AI, so I agree with what you're saying. However most software engineers have a very strong low level understanding of programming. Not a business sense of application


The “progress” demonstrated in this example is to literally just extract bytes from the middle of a number:

Is this task:

“About 2 minutes later, these values were captured, again spaced 5 seconds apart.

0160093201 0160092d01 0160092801 0160092301 0160091e01”

[Find the part that is changing]

really even need an AI to assist (this should be a near instant task for a human with basic CS numerical skills)? If this is the type of task one thinks an AI would be useful for they are likely in trouble for other reasons.

Also notable that you can cherry pick more impressive feats even from older models, so I don’t necessarily think this proves progress.

I still wouldn’t get too carried away just yet.


I put his value into my hex editor and it instantly showed 900 in the data inspector pane


Here you go:

I just watched a tutorial on how to leverage v1, claude, and cursor to create a marketing page. The result was a convoluted collection of 20 or so TS files weighing a few MB instead of a 5k HTML file you could hand bomb in less time.

I wouldn’t feel too threatened yet. It’s still just a tool and like any tool, can be wielded horribly.


I just watched a tutorial on how to leverage v1, claude, and cursor to create a marketing page. The result was a convoluted collection of 20 or so TS files weighing a few MB instead of a 5k HTML file you could hand bomb in less time.

And if you hired an actual team of developers to do the same thing, it is very likely that you'd have gotten a convoluted collection of 20 or so TS files weighing a few MB instead of a 5k HTML file you could hand bomb in less time.


I am cautiously optimistic. So much of building software is deciding what _should_ be built rather than the mechanics of writing code.

I you like coding because of the things it lets you build, then LLMs are exciting because you can build those things faster.

If on the other hand you enjoy the mental challenge but aren't interested in the outputs, then I think the future is less bright for you.

Personally I enjoy coding for both reasons, but I'm happy to sacrifice the enjoyment and sense of accomplishment of solving hard problems myself if it means I can achieve more 'real world' outcomes.

Another thing I'm excited about is that, as models improve, it's like having an expert tutor on hand at all times. I've always wanted an expert programmer on hand to help when I get stuck, and to critically evaluate my work and help me improve. Increasingly, now I have one.


To fix the robots^W^W^Wbuild these things.

I've been around for multiple decades. Nothing this interesting has happened since at least 1981, when I first got my hands on a TRS-80. I dropped out of college to work on games, but these days I would drop out of college to work on ML.


If AI becomes good enough to replace software engineers, it has already become good enough to replace other brain jobs (lawyers, physicians, accountant, etc). I feel that software engineering is one of the very last jobs to be replaced by AI.


I think CS skills will remain valuable, but you should try to build some domain specific knowledge in addition. Perhaps programmer roles will eventually merge with product owner / business person type of roles.


From NYT article on this model: "The chatbot also answered a Ph.D.-level chemistry question and diagnosed an illness based on a detailed report about a patient’s symptoms and history."

So it is not just software engineering, it is also chemistry and even medicine. Every science and art major should consider whether they should quit school. Ultimately the answer is no, don't quit school because AI makes us productive, and that will make everything cheaper, but will not eliminate the need for humans. Hopefully.


Software lets you take pretty much anyone else’s job and do it better.


Sure. Software engineers are actually the best situated to take advantage of this new technology.

Your concern would be like once C got invented, why should you bother being a software engineer? Because C is so much easier to use than assembly code!

The answer, of course, is that software engineering will simply happen in even more powerful and abstract layers.

But, you still might need to know how those lower layers work, even if you are writing less code in that layer directly.


C did not write itself.

We now have a tool that writes code and solves problems autonomously. It's not comparable.


This is not going to replace you. This isn't AGI.


It still has issues with crossing service boundaries, working in systems, stuff like that. That stuff will get better but the amount of context you need to load to get good results with a decently sized system will still be prohibitive. The software engineer skillset is being devalued but architecture and systems thinking is still going to be valuable for quite some time.


Software development just becomes a level tier higer for most developers. Instead of writing everything yourself you will be more like an orchestrator. Tell the system to write this, tell the system to connect that and this etc. You still need to understand code. But maybe in the future even that part becomes unreadable for us. We only understand the high level concepts.


If the writing & arts vs. doing laundry & cleaning dishes is any indication, it does not look rosy. All the fun and rewarding parts (low hanging fruits / quick wins) of coding might be automated. What remains are probably things like debugging race conditions in legacy systems and system administration etc.


Well, the fact that you typed this question makes me think that you're in the top X% of students. That's your reason.

Those in the bottom (100-X)% may be better off partying it up for a few years, but then again the same can be said for other AI-affected disciplines.

Masseurs/masseuses have nothing to worry about.


I am pretty sure there is a VC funded startup making massage robots


Point taken, but I'm still pretty sure masseurs/masseuses have nothing to worry about.


Unlike the replies here I will be very honest with my answer. There will be less engineers getting hired as the low hanging fruit has already been picked and automated away.

It is not too late. These LLMs still need very specialist software engineers that are doing tasks that are cutting edge and undocumented. As others said Software Engineering is not just about coding. At the end of the day, someone needs to architect the next AI model or design a more efficient way to train an AI model.

If I were in your position again, I now have a clear choice of which industries are safe against AI (and benefit software engineers) AND which ones NOT to get into (and are unsafe to software engineers):

Do:

   - SRE (Site Reliability Engineer)

   - Social Networks (Data Engineer)

   - AI (Compiler Engineer, Researcher, Benchmarking)

   - Financial Services (HFT, Analyst, Security)

   - Safety Critical Industries (defense, healthcare, legal, transportation systems)
Don't:

   - Tech Writer / Journalist

   - DevTools

   - Prompt Engineer

   - VFX Artist
The choice is yours.


because it is still the most interesting field of study


Because you're being given superpowers and computers are becoming more useful than ever.


The timeline to offload SWE tasks to AI is likely 5+ years. So there are still some years left before the exchange of a “brain on a stick” for “property and material goods” would become more competitive and demanding because of direct AI competition.


what else are you gonna do? Become a copywriter?


Even if LLMs take over the bulk of programming work, somebody still needs to write the prompts, and make sure the output actually matches what you wanted to achieve. That's just programming with different tools.


just because something can generate an output for you, does not make a need for discernment and application obsolete.

like another commenter, i do not have a lot of faith, that people who do not have at minimum: fundamental fluency in programming (even with a dash of general software architecture and practices).

there is no "push button generate and glueing components together in a way that can survive at scale and be maintainable" without knowing what the output means, and implies with respect to integration(s).

however, those with the fluency, domain, and experience, will thrive, and continue thriving.


I think this question applies to any type of labor requiring the human mind so if you don't have an answer for any of those then you won't have one for software engineering either.


I don't think programming is any less safe than any other office job tbh. Focus on problem solving and using these tools to your advantage and choose a field you enjoy.


What kind of student, at what kind of school?

Are your peers getting internships at FANGs or hedge funds? Stick with it. You can probably bank enough money to make it worth it before shtf.


Play it out

Let's assume today a LLM is perfectly equivalent to a junior software engineer. You connect it to your code base, load in PRDs / designs, ask it to build it, and viola perfect code files

1) Companies are going to integrate this new technology in stages / waves. It will take time for this to really get broad adoption. Maybe you are at the forefront of working with these models

2) OK the company adopts it and fires their junior engineers. They start deploying code. And it breaks Saturday evening. Who is going to fix it? Customers are pissed. So there's lots to work out around support.

3) That problem is solved, we can perfectly trust a LLM to ship perfect code that never causes downstream issues and perfectly predicts all user edge cases.

Never underestimate the power of corporate greediness. There's generally two phases of corporate growth - expansion and extraction. Expansion is when they throw costs out the window to grow. Extraction is when growth stops, and they squeeze customers & themselves.

AI is going to cause at least a decade of expansion. It opens up so many use cases that were simply not possible before, and lots of replacement.

Companies are probably not looking at their engineers looking to cut costs. They're more likely looking at them and saying "FINALLY, we can do MORE!"

You won't be a coder - you'll be a LLM manager / wrangler. You will be the neck the company can choke if code breaks.

Remember if a company can earn 10x money off your salary, it's a good deal to keep paying you.

Maybe some day down the line, they'll look to squeeze engineers and lay some off, but that is so far off.

This is not hopium, this is human nature. There's gold in them hills.

But you sure as shit better be well versed in AI and using in your workflows - the engineers who deny it will be the ones who fall behind


I don't want to lean into negativity here, and I'm far from an "AI Doomer".

But... I will say I think the question you ask is a very fair question, and that there is, indeed, a LOT of uncertainty about what the future holds in this regard.

So far the best reason we have for optimism is history: so far the old adage has held up that "technology does destroy some jobs, but on balance it creates more new ones than it destroys." And while that's small solace to the buggy-whip maker or steam-engine engineer, things tend to work out in the long-run. However... history is suggestive, but far from conclusive. There is the well known "problem of induction"[1] which points out that we can't make definite predictions about the future based on past experience. And when those expectations are violated, we get "black swan events"[2]. And while they be uncommon, they do happen.

The other issue with this question is, we don't really know what the "rate of change" in terms of AI improvement is. And we definitely don't know the 2nd derivative (acceleration). So a short-term guess that "there will be a job for you in 1 year's time" is probably a fairly safe guess. But as a current student, you're presumably worried about 5 years, 10 years, 20 years down the line and whether or not you'll still have a career. And the simple truth is, we can't be sure.

So what to do? My gut feeling is "continue to learn software engineering, but make sure to look for ways to broaden your skill base, and position yourself to possibly move in other directions in the future". Eg, don't focus on just becoming a skilled coder in a particular language. Learn fundamentals that apply broadly, and - more importantly - learn about how business work, learn "people skills"[3], develop domain knowledge in one or more domains, and generally learn as much as you can about "how the world works". Then from there, just "keep your head on a swivel" and stay aware of what's going on around you and be ready to make adjustments as needed.

It might not also hurt to learn a thing or two about something that requires a physical presence (welding, etc.). And just in case a full-fledged cyberpunk dystopia develops... maybe start buying an extra box or two of ammunition every now and then, and study escape and evasion techniques, yadda yadda...

[1]: http://en.wikipedia.org/wiki/Problem_of_induction

[2]: https://en.wikipedia.org/wiki/Black_swan_theory

[3]: https://www.youtube.com/watch?v=hNuu9CpdjIo


If (when?) the future you're afraid of comes to pass, then basically all white collar work is cooked anyway.


I honestly think that unless you’re really passionate or really good, you shouldn’t be a coder. If you, like the vast majority of coders today, picked it up in college or later, and mostly because of the promise of a fat paycheck, I can’t really see a scenario where you would have a 30 year career


If you're the type of person who isn't scared away easily by rapidly changing technology.


If you’re going for FAANG most of your day isn’t coding anyway.


do whatever excites you. the only constant is change.


> do whatever excites you. the only constant is change.

That alone may not be enough. My son is excited about playing video games. :)


Making video games is a huge industry full of tons of talented programmers, artists, and all kinds of other things you can do.


I agree there's too much cope going around. All the people saying AI is just a tool to augment our jobs are correct, humans are still needed but perhaps far less of them will be needed. If job openings shrink by 50% or disproportionately impact juniors it will hurt.

One decent reason to continue is that pretty much all white collar professions will be impacted by this. I think it's a big enough number that the powers that be will have to roll it out slowly, figure out UBI or something because if all of us are thrown into unemployment in a short time there will be riots. Like on a scale of all the jobs that AI can replace, there are many jobs that are easier to replace than software so its comparatively still a better option than most. But overall I'm getting progressively more worried as well.


Juniors aren’t getting hired and haven’t been for about six months, maybe longer. AI isn’t 100% at fault… yet.


plumbing still looks like a safe choice for now.


If you're just there to churn out code, then yeah, perhaps find something else.

But if you're there to improve your creativity and critical thinking skills, then I don't think those will be in short supply anytime soon.

The most valuable thing I do at my job is seldom actually writing code. It's listening to customer needs, understanding the domain, understanding our code-base and it's limitations and possibilities, and then finding solutions that optimize certain aspects be it robustness, time to delivery or something else.


Hey, kid.

My name is Rachel. I'm the founder of company whose existence is contingent on the continued existence, employment, and indeed competitive employment of software engineers, so I have as much skin in this game as you do.

I worry about this a lot. I don't know what the chances are that AI wipes out developer jobs [EDIT: to clarify, in the sense that they become either much rarer or much lower-paid, which is sufficient] within a timescale relevant to my work (say, 3-5 years), but they aren't zero. Gun to my head, I peg that chance at perhaps 20%. That makes me more bearish on AI than the typical person in the tech world - Manifold thinks AI surpasses human researchers by the end of 2028 at 48% [1], for example - but 20% is most certainly not zero.

That thought stresses me out. It's not just an existential threat to my business over which I have no control, it's a threat against which I cannot realistically hedge and which may disrupt or even destroy my life. It bothers me.

But I do my work anyway, for a couple of reasons.

One, progress on AI in posts like this is always going to be inflated. This is a marketing post. It's a post OpenAI wrote, and posted, to generate additional hype, business, and investment. There is some justified skepticism further down this thread, but even if you couldn't find a reason to be skeptical, you ought to be skeptical by default of such posts. I am an abnormally honest person by Silicon Valley founder standards, and even I cherry pick my marketing blogs (I just don't outright make stuff up for them).

Two, if AI surpasses a good software engineer, it probably surpasses just about everything else. This isn't a guarantee, but good software engineering is already one of the more challenging professions for humans, and there's no particular reason to think progress would stop exactly at making SWEs obsolete. So there's no good alternative here. There's no other knowledge work you could pivot to that would be a decent defense against what you're worried about. So you may as well play the hand you've got, even in the knowledge that it might lose.

Three, in the world where AI does surpass a good software engineer, there's a decent chance it surpasses a good ML engineer in the near future. And once it does that, we're in completely uncharted territory. Even if more extreme singularity-like scenarios don't come to pass, it doesn't need to be a singularity to become significantly superhuman to the point that almost nothing about the world in which we live continues to make any sense. So again, you lack any good alternatives.

And four: *if this is the last era in which human beings matter, I want to take advantage of it!* I may be among the very last entrepreneurs or businesswomen in the history of the human race! If I don't do this now, I'll never get the chance! If you want to be a software engineer, do it now, because you might never get the chance again.

It's totally reasonable to be scared, or stressed, or uncertain. Fear and stress and uncertainty are parts of life in far less scary times than these. But all you can do is play the hand you're dealt, and try not to be totally miserable while you're playing it.

-----

[1] https://manifold.markets/Royf214/will-ai-surpass-humans-in-c...


Transformers have exactly two strengths. None of them is "attention". Attention could be replaced with any arbitrary division of the network and it would learn just as well.

First true strength is obvious, it's that they are parallelisable. This is a side effect of people fixating on attention. If they came up with any other structure that results in the same level of parallelisability it would be just as good.

Second strong side is more elusive to many people. It's the context window. Because the network is not ran just once but once for every word it doesn't have to solve a problem in one step. It can iterate while writing down intermediate variables and accessing them. The dumb thing so far was that it was required to produce the answer starting with the first token it was allowed to write down. So to actually write down the information it needs on the next iteration it had to disguise it as a part of the answer. So naturally the next step is to allow it to just write down whatever it pleases and iterate freely until it's ready to start giving us the answer.

It's still seriously suboptimal that what it is allowed to write down has to be translated to tokens and back but I see how this might make things easier for humans for training and explainability. But you can rest assured that at some point this "chain of thought" will become just chain of full output states of the network, not necessarily corresponding to any tokens.

So congrats to researchers that they found out that their billion dollar Turing machine benefits from having a tape it can use for more than just printing out the output.

PS

There's another advantage of transformers but I can't tell how important it is. It's the "shortcuts" from earlier layers to way deeper ones bypassing the ones along the way. Obviously network would be more capable if every neuron was connected with every neuron in every preceding layer but we don't have hardware for that so some sprinkled "shortcuts" might be a reasonable compromise that might make network less crippled than MLP.

Given all that I'm not surprised at all with the direction openai took and the gains it achieved.


So, it’s good at hard-logic reasoning (which is great, and no small feat.)

Does this reasoning capability generalize outside of the knowledge domains the model was trained to reason about, into “softer” domains?

For example, is O1 better at comedy (because it can reason better about what’s funny)?

Is it better at poetry, because it can reason about rhyme and meter?

Is it better at storytelling as an extension of an existing input story, because it now will first analyze the story-so-far and deduce aspects of the characters, setting, and themes that the author seems to be going for (and will ask for more information about those things if it’s not sure)?


In practice, this implementation (through the Chat UI) is scary bad.

It actively lies about what it is doing.

This is what I am seeing. Proactive, open, deceit.

I can't even begin to think of all the ways this could go wrong, but it gives me a really bad feeling.


> It actively lies about what it is doing.

How do you mean?


It shows progress and it displays steps that it is not doing, would never do, would never need to do, just to show a nice scrolling display of what it’s (not) doing.


Finally, a Claude competitor!


If you’re using the API and are on tier 4, don’t bother adding more credits to move up to tier 5. I did this, and while my rate limits increased, the o1-preview / o1-mini model still wasn’t available.


Keep us posted!


I can confirm that the following models have since come in:

  • o1-preview-2024-09-12
  • o1-preview
  • o1-mini-2024-09-12
  • o1-mini


Wouldn't this introduce new economics into the LLM market?

I.e. if the "thinking loop" budget is parameterized, users might pay more (much more) to spend more compute on a particular question/prompt.


Depends on how OpenAI prices it.

Given the need for chain-of-thoughts, and that would be budgeted as output, the new model will not be cheap nor fast.

EDIT: Pricing is out and it is definitely not teneable unless you really really have a use case for it.


Yes, and note the large price increase


Laughing at the comparison to "4o" as if that model even holds a candle to GPT-4. 4o is _cheaper_—it's nowhere near as powerful as GPT-4, as much as OpenAI would like it to be.


Note that they aren't safety aligning the chain of thought, instead we have "rules for thee and not for me" -- the public models are going to continue have tighter and tighter rules on appropriate prompting, while internal access will have unfettered access. All research (and this paper mentions it as well) indicates human pref training itself lowers quality of results; maybe the most important thing we could be doing is ensuring truly open access to open models over time.

Also, can't wait to try this out.


What is interesting to me is that there is no difference in the AP English lit/lang exams. Why did chain-of-thought produce negligible improvements in this area?


I would guess because there is not much problem-solving required in that domain. There’s less of a “right answer” to reason towards.


I think there may also be a lack of specification there. When you get more demanding and require more, the creative writing seems to be better. Like it does much better at things like sestinas. For all of those questions, there's probably a lot of unspecified criteria you could say makes an answer better or worse, but you don't, so the first solution appears adequate.


Amazing! OpenAI figured out how to scale inference. https://arxiv.org/abs/2407.21787 show how using more compute during inference can outperform much larger models in tasks like math problems

I wonder how do they decide when to stop these Chain of Thought for each query? As anyone that played with agents can attest, LLMs can talk with themselves forever.


It's interesting that OpenAI has literally applied and automated one of their advice from the "Prompt engineering" guide: Give the model time to "think"

https://platform.openai.com/docs/guides/prompt-engineering/g...


I will pay if O1 can become my college level math tutor.


Looking at the full chain of thought , it involves a lot of backtracking and even hallucination.

It will be like a math teacher that is perpetually drunk and on speed


That's Paul Erdös


This model is currently available for those accounts in Tier 5 and above, which requires "$1,000 paid [to date] and 30+ days since first successful payment"

More info here: https://platform.openai.com/docs/guides/rate-limits/usage-ti...


I didn't know this founder's edition battle pass existed.


Are we ready yet to admit Turing test has been passed?


Extremely basic agency would be required to pass the Turing test as intended.

Like, the ability to ask a new unrelated question without being prompted. Of course you can fake this, but then you're not testing the LLM as an AI, you're testing a dumb system you rigged up to create the appearance of an AI.


> Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation was a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel, such as a computer keyboard and screen, so the result would not depend on the machine's ability to render words as speech.

I don't see agency mentioned or implied anywhere: https://en.wikipedia.org/wiki/Turing_test

What definition or setup are you taking it from?


LLMs have already beaten the Turing test. It's useless to use it when OpenAI and others are aiming for 'AGI'.

So you need a new Turing test adapted for AGI or a totally different one to test for AGI rather than the standard obsolete Turing test.


> LLMs have already beaten the Turing test.

I am wondering where this happened? In some limited scope? Because if you plug LLM into some call center role for example, it will fall apart pretty quickly.


The Turing Test (which involves fooling a human into thinking they are talking to another human rather than a computer) has been routinely passed by very rudimentary "AI" since as early as 1991. It has no relevance today.


This is only true for some situations. In some test conditions it has not been passed. I can't remember the exact name, but there used to be a competition where PhD level participants blindly chat for several minutes with each other and are incentivized to discover who is a bot and who is a human. I can't remember if they still run it, but that bar has never been passed from what I recall.


> However, o1-preview is not preferred on some natural language tasks, suggesting that it is not well-suited for all use cases.

Fascinating... Personal writing was not preferred vs gpt4, but for math calculations it was... Maybe we're at the point where its getting too smart? There is a depressing related thought here about how we're too stupid to vote for actually smart politicians ;)


> for actually smart politicians

We can vote an AI


> “Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.”

Trust us, we have your best intention in mind. I’m still impressed by how astonishingly impossible to like and root for OpenAI is for a company with such an innovative product.


Any word on whether this has enhanced Japanese support? They announced Japanese-specific models a while back that were never released.


I wonder if this architecture is just asking a chain of thought prompt, or whether they built a diffusion model.

The old problem with image generation was that single pass techniques like GANs and VAEs had to do everything in one go. Diffusion models wound up being better by doing things iteratively.

Perhaps this is a diffusion model for text (top ICML paper this year was related to this).


The progress in AI is incredibly depressing, at this point I don't think there's much to look forward to in life.

It's sad that due to unearned hubris and a complete lack of second-order thinking we are automating ourselves out of existence.

EDIT: I understand you guys might not agree with my comments. But don't you thinking that flagging them is going a bit too far?


It seems opposite to me. Imagine all the amazing technological advancements, etc. If there wasn't something like that what would you be looking forward to? Everything would be what it has already been for years. If this evolves it helps us open so many secrets of the universe.


>If there wasn't something like that what would you be looking forward to?

First of all, I don't want to be poor. I know many of you are thinking something along the lines of "I am smart, I was doing fine before, so I will definitely continue to in the future".

That's the unearned hubris I was referring to. We got very lucky as programmers, and now the gravy train seems to be coming to an end. And not just for programmers, the other white-collar and creative jobs will suffer too. The artists have already started experiencing the negative effects of AI.

EDIT: I understand you guys might not agree with my comments. But don't you thinking that flagging them is going a bit too far?


I'm not sure what you are saying exactly? Are you saying we live for the work?


The way the current system is set up we rely on work to make money. If jobs get automated away, how will we make money then? We aren't ready for a post-work world.


Then you should have UBI.


These advancements are there to benefit the top 1%, not the working class.


That's a governing problem.


Not at all... they're still so incapable of so much. And even when they do advance, they can be tremendous tools of synthesis and thought at an unparalleled scale.

"A good human plus a machine is the best combination" — Kasparov


It was for a while, look up "centaur" systems, that's the term in chess. Stockfish 17 rolls them every time.


FWIW people were probably flagging because you're a new/temp accounting jumping to asserting anything other than your view on what's being done is "unearned hubris and a complete lack of second-order thinking", not because they don't like agree with your set of concerns.


Eh this makes me very, very excited for the future. I want results, I don’t care if they come from humans or AI. That being said we might all be out of jobs soon…


Same thing hype bros told 2 years ago, won’t happen.


Reinforcement learning seems to be key. I understand how traditional fine tuning works for LLMs (i.e. RLHL), but not RL.

It seems one popular method is PPO, but I don't understand at all how to implement that. e.g. is backpropagation still used to adjust weights and biases? Would love to read more from something less opaque than an academic paper.


The point of RL is that sometimes you need a model to take actions (you could also call this making predictions) that don’t have a known label. So for example if it’s playing a game, we don’t have a label for each button press. We just have a label for the result at some later time, like whether Pac-Man beat the level.

PPO applies this logic to chat responses. If you have a model that can tell you if the response was good, we just need to take the series of actions (each token the model generated) to learn how to generate good responses.

To answer your question, yes you would still use backprop if your model is a neural net.


Thanks, that helps! I still don't quite understand the mechanics of this, since backprop makes adjustments to steer the LLM towards a specific token sequence, not towards a score produced by a reward function.


Any RL task needs to decompose the loss.

This was also the issue with RLHF models. The loss of predicting the next token is straightforward to minimize as we know which weights are responsible for the token being correct or not. identifying which tokens had the most sense for a prompt is not straightforward.

For thinking you might generate 32k thinking tokens and then 96k solution tokens and do this a lot of times. Look at the solutions, rank by quality and bias towards better thinking by adjusting the weights for the first 32k tokens. But I’m sure o1 is way past this approach.


One thing I find generally useful when writing large project code is having a code base and several branches that are different features I developed. I could immediately use parts of a branch to reference the current feature, because there is often overlap. This limits mistakes in large contexts and easy to iterate quickly.


I have a question. The video demos for this all mention that the o1 model is taking it's time to think through the problem before answering. How does this functionally differ from - say - GPT-4 running it's algorithm, waiting five seconds and then revealing the output? That part is not clear to me.


It is recursively "talking" to itself to plan and then refine the answer.


I just tested o1-preview on the "How many r's are in strawberry?" question. It answers correctly!


Why so much hate? They're doing their best. This is the state of progress in the field so far. The best minds are racing to innovate. The benchmarks are impressive nonetheless. Give them a break. At the end of the day, they built the chatbot who's saving your ass each day ever since.


Haven't used ChatGPT* in over 6 months, not saving my ass at all.


I bet you've still used other models that were inspired by GPT.


I've used co-pilot, I turned it off, kept suggesting nonsense.


not saving my ass, I never needed one professionally. OpenAI is shovelling money into a furnace, I expect them to be assimilated into Microsoft soon.


I think you're overestimating LLM usage.


Rarely using it at work, seems you are overestimating


> Therefore, s(x)=p∗(x)−x2n+2 We can now write, s(x)=p∗(x)−x2n+2

Completely repeated itself... weird... it also says "...more lines cut off..." How many lines I wonder? Would people get charged for these cut off lines? Would have been nice to see how much answer had cost...


Aren't LLMs much more limited on the amount of output tokens than input tokens? For example, GPT-4o seems to support only up to 16 K output tokens. I'm not completely sure what the reason is, but I wonder how that interacts with Chain-of-Thought reasoning.


Not really.

There's no fundamental difference between input and output tokens technically.

The internal model space is exactly the same after evaluating some given set of token, no matter which of them were produced by the prompter or the model.

The 16k output token limit is just an arbitrary limit in the chatgpt interface.


> The 16k output token limit is just an arbitrary limit in the chatgpt interface.

It is a hard limit in the API too, although frankly I have never seen an API output go over 700 tokens.


https://openai.com/index/introducing-openai-o1-preview/

> ChatGPT Plus and Team users will be able to access o1 models in ChatGPT starting today. Both o1-preview and o1-mini can be selected manually in the model picker, and at launch, weekly rate limits will be 30 messages for o1-preview and 50 for o1-mini. We are working to increase those rates and enable ChatGPT to automatically choose the right model for a given prompt.

Weekly? Holy crap, how expensive is it to run is this model?


It's probably running several lines of COT. I imagine, each single message you send is probably at __least__ 10x to the actual model. So in reality it's like 300 messages, and honestly it's probably 100x, given how constrained they're being with usage.


Anyone know when o1 access in ChatGPT will be open?


Rolling out over the next few hours to Plus users.


The human brain uses 20 watts, so yeah we figured out a way to run better than human brain computation by using many orders of magnitude more power. At some point we'll need to reject exponential power usage for more computation. This is one of those interesting civilizational level problems. There's still a lack of recognition that we aren't going to be able to compute all we want to, like we did in the pre-LLM days.


For 20 watts of work on stuff like this for about 4 hours a day counting vacations and weekends and attention span. So 20 hours of rest, relaxation, distraction, household errands and stuff, so that maybe bumps it up to 120 watts per work hour. Then 22.5 years of training or so per worker, 45 year work period, 22.5 year retirement. So double it there to 240 watts. We can't run brains without bodies, so multiply that by 6 giving 1440 watts + the air conditioning, commuting to school and work, etc., maybe 2000 watts?

We're getting close to parity if things keep getting more efficient as fast as they have been. But that's without accounting for the AI training, which can on the plus side be shared among multiple agents, but on the down side can't really do continuous learning very well without catastrophic forgetting.


we ll ask it to redesign itself for low power usage


Impressive safety metrics!

I wish OAI include "% Rejections on perfectly safe prompts" in this table, too.



I find shorter responses > longer responses. Anyone share the same consensus?

for example in gpt-4o I often append '(reply short)' at the end of my requests. with the o1 models I append 'reply in 20 words' and it gives way better answers.


"hidden chain of thought" is basically the finetuned prompt isn't it? The time scale x-axis is hidden as well. Not sure how they model the gpt for it to have an ability to decide when to stop CoT and actually answer.


> THERE ARE THREE R’S IN STRAWBERRY

Well played




and listing state names with the letter 'a' https://x.com/edzitron/status/1834329704125661446


For the exam problems it gets wrong, has someone cross-checked that the ground truth answers are actually correct!! ;-) Just kidding, but even such a time may come when the exams created by humans start falling short.


I have spent some time doing this for these benchmarks — the model still does make mistakes. Of the questions I can understand, (roughly half in this case) about half were real errors and half were broken questions.


I tested various Math Olympiad questions with Claude sonnet 3.5 and they all arrived at the correct solution. o1's solution was a bit better formulated, in some circumstances, but sonnet 3.5 was nearly instant.


Question here is about the "reasoning" tag - behind the scenes, is this qualitively different fron stringing words together on a statistical basis? (aside from backroom tweaking and some randomisation).


Dang, I just payed out for Kagi Assistant.

Using Claude 3 Opus I noticed it performs <thinking> and <result> while browsing the web for me. I don't guess that's a change in the model for doing reasoning.


boo, they are hiding the chain of thought from user output (the great improvement here)

> Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.


I asked a few “hard” questions and compared o1 with claude. https://github.com/harisec/o1-vs-claude


Here's a video demonstration they posted on YouTube: https://www.youtube.com/watch?v=50W4YeQdnSg


Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

https://arxiv.org/abs/2403.09629


the cipher example is impressive on the surface, but I threw a couple of my toy questions at o1-preview and it still hallucinates a bunch of nonsense (but now uses more electricity to do so).


Peter Thiel was widely criticized this spring when he said that AI "seems much worse for the math people than the word people."

So far, that seems to be right. The only thing o1 is worse at is writing.


> 30 messages per week


Maybe I missed it, but do the tokens used for internal chain of thought count against the output tokens of the response (priced at spicy level of $60.00 / 1M output tokens)?


Yes. Chain of thought tokens are billed, so requests to this model can be ~10x the price of gpt-4o, or even more.


A near perfect on AMC 12, 1900 CodeForces ELO, and silver medal IOI competitor. In two years, we'll have models that could easily win IMO and IOI. This is __incredible__!!


It depends on what they mean by "simulation". It sounds like o1 did not participate in new contests with new problems.

Any previous success of models with code generation focus was easily discovered to be a copy-paste of a solution in the dataset.

We could argue that there is an improvement in "understanding" if the code recall is vastly more efficient.


Near perfect AIME, not just AMC12.

But each solve costs far more time and energy than a competent human takes.


Using codeforces as a benchmark feels like a cheat, since OpenAI use to pay us chump change to solve codeforces questions and track our thought process on jupyter notebook.


Having read the full transcript I don't get how it counted 22 letters for mynznvaatzacdfoulxxz. It's nice that it corrected itself but a bit worrying


I think openai introduced the o1 model because reflection 70b inspired them. Getting them needed a new message to fill the gap for such a long time


What's the precedent set here?

Models that hide away their reasoning and only display the output, charging whatever tokens they'd like?

This is not a good release on any front.



If I pay for the chain of thought, I want to see the chain of thought. Simple. How would I know if it happened at all? Trust OpenAI? LOL


You could say the same thing about use of any product which isn't fully open sourced "how do I know this service really saved my files redundantly if I can't see the disks it's stored on?". It's definitely an opinion on approach though I'm not sure how practically applicable it is.

The real irony is how closed "Open"AI is... but that's not news.


Easy solution - don't pay!


That's a chain of thought, right there!


how do you know it isn't some guy typing responses to you when you use openAI?


Well if they are paying real people to answer my questions I would call that a pretty good deal. That's exactly my point. As a user I don't care how they come up with it. That's not my problem. I just care about the content. If I pay a human for logical reasoning, train of thought type of stuff, I expect them to lay it out for me. Not just give me the conclusion, but how they came to it.


I challenged it to solve the puzzle in my profile info.

It failed ;)


GePeTO1 does not make Pinnochio into a real boy.


"Open"AI. Should be ClosedAI instead.


Sam Altman and OpenAI are following the example of Celebrimbor it seems. And I love what may come next...


“THERE ARE THREE R’S IN STRAWBERRY” - o1

I got that reference!


Feels like the challenge here is to somehow convey to the end user, how the quality of output is so much better.


So how is the internal chain of thought represented anyhow? What does it look like when someone sees it?


Kinda disappointed that they're hiding the thought process. Hopefully the open source community will figure out how to effectively match and replicate what OpenAI is doing.

I wonder how far we are from having a model that can correctly solve a word soup search problem directly from just a prompt and input image. It seems like the crossword example is close. For a word search it would require turning the image into an internal grid representation, prepare the list of words, and do a search. I'd be interested in seeing if this model can already solve the word grid search problem if you give it the correct representation as an input.


> Hopefully the open source community will figure out how to effectively match and replicate what OpenAI is doing.

No need for that, there is a Reflection 70B model that does the exact same thing - with chains of thought being separated from the "final answer" via custom 'tag' tokens.


Wasn't this the model that was proven to have been faking their benchmarks recently? Or am I thinking of a different model?


It's the same, for sure the proximity of that little scandal to this announcement is no coincidence.


Is it possible someone within OpenAI leaked the CoT technique used in O1, and Reflection 70b was an attempt to replicate it?


It's possible


yes. it was fake


Some reported benchmarks do seem to be rather dubious, but the 70B model itself is quite real. Sample output:

  $ ollama run reflection:70b-q4_0
  >>> hello
  <thinking>
  To respond to "hello", I'll need to consider several factors:
  
  1. The user's intent: They're likely saying hello as a greeting.
  2. Appropriate responses: Common ways to respond to "hello" are:
     - Hello back
     - Hi
     - Hey
  3. Formality level: Since this is an AI response, I'll aim for a friendly but professional tone.
  
  <reflection>
  The approach of responding with a standard greeting seems appropriate in this context. It acknowledges the user's hello and provides a polite response.
  </reflection>
  
  Given these considerations, the most suitable response would be to echo "hello" back to the user.
  </thinking>
  
  <output>
  Hello!
  </output>


There was a hackernews post a few days ago, pointing to a reddit thread where some guys proved that the founder/s of relection AI were faking their model by just passing the input to Claude (Sonnet 3.5) and stripping the word "Claude" from the output, amongst other things. Then when they got caught they switched it to GPT 4-o.

After this, I will be very skeptical to anything they claim to achieve.

https://news.ycombinator.com/item?id=41484981


That’s the one.


That reflection model is in no way comparable to whatever OpenAI is doing.


Maybe the benchmark results are different, but it certainly seems like OpenAI is doing the same with it's "thinking" step


I have access to the model via the web client and it does show the thought process along the way. It shows a little icon that says things like "Examining parser logic", "Understanding data structures"...

However, once the answer is complete the chain of thought is lost


It's still there.

Where it says "Thought for 20 seconds" - you can click the Chevron to expand it and see what I guess is the entire chain of thought.


Per OpenAI, it's a summary of the chain of thought, not the actual chain of thought.


They claim it's available in ChatGPT Plus, but for me clicking the link just gives GPT-4o Mini.


Honestly, it doesn't matter for the end user if there are more tokens generated between the AI reply and human message. This is like getting rid of AI wrappers for specific tasks. If the jump in accuracy is actual, then for all practical purposes, we have a sufficiently capable AI which has the potential to boost productivity at the largest scale in human history.


It starts to matter if the compute time is 10-100 fold, as the provider needs to bill for it.

Of course, that's assuming it's not priced for market acquisition funded by a huge operational deficit, which is a rarely safe to conclude with AI right now.


Given that their compute-time vs accuracy charts labeled the compute time axis as logarithmic would worry me greatly about this aspect.


> we are releasing an early version of this model, OpenAI o1-preview, for immediate use in ChatGPT

Awesome!


Rate limited to 30 messages per week for ChatGPT Plus subscribers at launch: https://openai.com/index/introducing-openai-o1-preview/


Read "immediate" in "immediate use" in the same way as "open" in "OpenAI".


You can use it, I just tried a few minutes ago. It's apparently limited to 30 messages/week, though.


The option isn't there for us (though the blogpost says otherwise), even after CTRL-SHIFT-R, hence the parent comment.


I am interpreting "immediate use in ChatGPT" the same way advanced voice mode was promised "in the next few weeks."

Probably 1% of users will get access to it, with a 20/message a day rate limit. Until early next year.


Rate limit is 30 a week for the big one and 50 for the small one


Besides chat bits what viable products are being made with LLMs besides APIs into LLMs?


I m wondering, what kind of "AI wrappers" will emerge from this model.


What's with this how many r's in a strawberry thing I keep seeing?


What's amazing is that given how LLMs receive input data (as tokenized streams, as other commenters have pointed out) it's remarkable that it can ever answer this question correctly.


Models don't really predict the next word, they predict the next token. Strawberry is made up of multiple tokens, and the model doesn't truely understand the characters in it... so it tends to struggle.


LLM are bad at answering that question because inputs are tokenized.



It’s a common LLM riddle. Apparently many fail to give the right answer.


Somebody please ask o1 to solve it


The link shows it solving it


Looking at pricing, its $15 per 1M input tokens, and $60 per 1M output tokens. I assume the CoT tokens count as output (or input even)? If so and it directly affects billing, I'm not sure how I feel about them hiding the CoT prompts. Nothing to stop them from saying "trust me bro, that used 10,000 tokens ok?". Also no way to gauge expected costs if there's a black box you are being charged for.


Very nice.

It's nice that people have taken the obvious extra-tokens/internal thoughts approach to a point where it actually works.

If this works, then automated programming etc., are going to actually be tractable. It's another world.


Did you guys use the model? Seems about the same to me


Question for those who do have access: how is it?


What is the maximum context size in the web UI?


> THERE ARE THREE R’S IN STRAWBERRY

It finally got it!!!


Wait, are they comparing 4o without CoT and o1 with built-in CoT?


yeah was wondering what 4o with a CoT in the prompt would look like.


the only benchmark that matters in the ELO points on LLMsys, any other one can be easily gamed


"after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users"

...umm. Am I the only one who feels like this takes away much of the value proposition, and that it also runs heavily against their stated safety goals? My dream is to interact with tools like this to learn, not just to be told an answer. This just feels very dark. They're not doing much to build trust here.


o1

Maybe they should spend some of their billions on marketing people. Gpt4o was a stretch. Wtf is o1


To me it looks like they think this is the future of how all models should be, so they're restarting the numbering. This is what I suspect. The o is for omni.


> Available starting 9.12

I don't see it


In ChatGPT, it's rolling out to Plus users gradually over the next few hours.

In API, it's limited to tier 5 customers (aka $1000+ spent on the API in the past).


Only for those accounts in Tier 5 (or above, if they exist)

Unfortunately you and I don't have enough operating thetans yet


Per-token billing will be lit


Is there a paper available?


yeah this is kinda cool i guess but 808 elo is still pretty bad for a model that can supposedly code like a human, i mean 11th percentile is like barely scraping by, and what even is the point of simulating codeforces if youre just gonna make a model that can barely compete with a decent amateur, and btw what kind of contest allows 10 submissions, thats not how codeforces works, and what about the time limits and memory limits and all that jazz, did they even simulate those, and btw how did they even get the elo ratings, is it just some arbitrary number they pulled out of their butt, and what about the model that got 1807 elo, is that even a real model or just some cherry picked result, and btw what does it even mean to "perform better than 93% of competitors" when the competition is a bunch of humans who are all over the place in terms of skill, like what even is the baseline for comparison

edit: i got confused with the Codeforce. it is indeed zero shot and O1 is potentially something very new I hope Anthropic and others will follow suit

any type of reasoning capability i'll take it !


808 ELO was for GPT-4o.

I would suggest re-reading more carefully


you are right i read the charts wrong. O1 has significant lead over GPT-4o in the zero shot examples

honestly im spooked


Great, yet another step towards the inevitable conclusion. Now I'm not just being asked to outsource my thinking to my computer, but instead to a black box operated by a for-profit company for the benefit of Microsoft. Not only will they not tell me the whole reasoning chain, they wont even tell me how they came up with it.

Tell me, users of this tool. What's even are you? If you've outsourced your thinking to a corporation, what happens to your unique perspective? your blend of circumstance and upbringing? Are you really OK being reduced to meaningless computation and worthless weights. Don't you want to be something more?


> What's even are you?

An accelerator of reaching the Singularity. This is something more.


You realize that you're not going inside the computer right? At best you're going to create a simulacrum of you. Something that looks, talks, and acts like you. It's never going to actually be you. You're going to be stuck out here with the rest of us, in whatever world we create in pursuit of the singularity suicide cult.


My friend, it has nothing to do with going inside a computer. Do not confuse the Singularity with mind uploading which is a distinct concept. The singularity has to do with technology acceleration, and with the inability to predict what lies beyond it. As such, it has nothing to do with any suicide cult. Please stop spreading nonsense about it. I do care about life in the physical world, not about a digital life.


can we get it on ollama? if not how come openai is called open


because if not for them, palm-1/lambda would still be rotting in googles servers without normal people ever being able to try it


I finally got access to it, I tried playing Connect 4 with it, but it didn't go very well. A bit disappointed.


the newest scaling law: inference-time compute.


> THERE ARE THREE R'S IN STRAWBERRY

Who do these Rs belong to?!


Stop fooling around with stories about AI taking jobs from programmers. Which programmers exactly??? Creators of idiotic web pages? Nobody in their right mind would push generated code into a financial system, medical equipment or autonomous transport. Template web pages and configuration files are not the entire IT industry. In addition, AI is good at tasks for which there are millions of examples. 20 times I asked to generate a PowerShell script, 20 times it was generated incorrectly. Because, unlike Bash, there are far fewer examples on the Internet. How will AI generate code for complex systems with business logic that it has no idea about? AI is not able to generate, develop and change complex information systems.


Time to fire up System Shock 2:

> Look at you, hacker: a pathetic creature of meat and bone, panting and sweating as you run through my corridors. How can you challenge a perfect, immortal machine?


Someone give this model an IQ test stat.


You're kidding right? The tests they gave it are probably better tests than IQ tests at determining actually useful problem solving skills...


It can't do large portions of the parts of an IQ test (not multi-modal). Otherwise I think it's essentially superhuman, modulo tokenization issues (please start running byte-by-byte or at least come up with a better tokenizer).


Shit, this is going to completely kill jailbreaks isn't it?


Congrats to OpenAI for yet another product that has nothing to do with the word "open"


And Apple's product line this year? Phones. Nothing to do with fruit. Almost 50 years of lying to people. Names should mean something!


Did Apple start their company by saying they will be selling apples?


What's the statement that OpenAI are making today which you think they're violating? There very well could be one and if there is, it would make sense to talk about it.

But arguments like "you wrote $x in a blog post when you founded your company" or "this is what the word in your name means" are infantile.


It is open in the sense that everyone can use it.


Only people who exactly share OpenAI's concepts of what "alignment" and "safety" should mean can use it to its full potential.


Not people working on AI or those who would like to train AI on their logs


If they would have launched it with Oracle DB style licensing their company would have been dead in 1 year.


> Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.

What? I agree people who typically use the free ChatGPT webapp won't care about raw chain-of-thoughts, but OpenAI is opening an API endpoint for the O1 model and downstream developers very very much care about chain-of-thoughts/the entire pipeline for debugging and refinement.

I suspect "competitive advantage" is the primary driver here, but that just gives competitors like Anthropic an oppertunity.


They they've taken at least some of the hobbles off for the chain of thought, so the chain of thought will also include stuff like "I shouldn't say <forbidden thing they don't want it to say>".


slightly offtopic, but openai having anti scraping / bot check on the blog is pretty funny


ChatGPT is now a better coder than I ever was.


Can we please stop using the word “think” like o1 thinks before it answers. I doubt we man the same when someone says a human thinks vs o1 thinks. When I say I think “red” I am sure the word think means something completely different than when you say openai thinks red. I am not saying one is superior than the other but maybe as humans we can use a different set of terminology for the AI activities.


"For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user."

This made me roll my eyes, not so much because of what it said but because of the way it's conveyed injected into an otherwise technical discussion, giving off severe "cringe" vibes.


Landmark. Wild. Beautiful. The singularity is nigh.


"Learn to reason like a robot"


They keep announcing things that will be available to paid ChatGPT users “soon” but is more like an Elon Musk “soon”. :/


>We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.

>Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.

So, let's recap. We went from:

- Weights-available research prototype with full scientific documentation (GPT-2)

- Commercial-scale model with API access only, full scientific documentation (GPT-3)

- Even bigger API-only model, tuned for chain-of-thought reasoning, minimal documentation on the implementation (GPT-4, 4v, 4o)

- An API-only model tuned to generate unedited chain-of-thought, which will not be shown to the user, even though it'd be really useful to have (o1)


> For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user.[...] Therefore we have decided not to show the raw chains of thought to users.

Better not let the user see the part where the AI says "Next, let's manipulate the user by lying to them". It's for their own good, after all! We wouldn't want to make an unaligned chain of thought directly visible!


The hidden chain of thought tokens are also billed as output tokens, so you still pay for them even though they're not going to let you see them:

> While reasoning tokens are not visible via the API, they still occupy space in the model's context window and are billed as output tokens.

https://platform.openai.com/docs/guides/reasoning


I always laughed at the idea of a LLM Skynet "secretly" plotting to nuke humanity, while a bunch of humans watch it unfold before their eyes in plaintext.

Now that seems less likely. At least OpenAI can see what it's thinking.

A next step might be allowing the LLM to include non-text-based vectors in its internal thoughts, and then do all internal reasoning with raw vectors. Then the LLMs will have truly private thoughts in their own internal language. Perhaps we will use a LLM to interpret the secret thoughts of another LLM?

This could be good or bad, but either way we're going to need more GPUs.


"...either way we're going to need more GPUs." posted the LLM, rubbing it's virtual hands, cackling with delight as it prodded the humans to give it MOAR BRAINS


> Now that seems less likely. At least OpenAI can see what it's thinking.

When it's fully commercialized no one will be able to read through all chains of thoughts and with possibility of fine-tuning AI can learn to evade whatever tools openai will invent to flag concerning chains of thoughts if they interfere with providing the answer in some finetuning environment.

Also at some point for the sake of efficiency and response quality they might migrate from chain of thought consisting of tokens into chain of thought consisting of full output network states and part of the network would have dedicated inputs for reading them.


At this point the G in GPU must be completely dropped


Gen-ai Production Unit


>Perhaps we will use a LLM to interpret the secret thoughts of another LLM?

this is a pretty active area of research with sparse autoencoders


It's clear to me that OpenAI is quickly realizing they have no moat. Even this obfuscation of the chain-of-thought isn't really a moat. On top of CoT being pretty easy to implement and tweak, there's a serious push to on-device inference (which imo is the future), so the question is: will GPT-5 and beyond be really that much better than what we can run locally?


I wonder if they'll be able to push the chain-of-thought directly into the model. I'd imagine there could be some serious performance gains achievable if the model could "think" without doing IO on each cycle.

In terms of moat, I think people underestimate how much of OpenAI's moat is based on operations and infrastructure rather than being purely based on model intelligence. As someone building on the API, it is by far the most reliable option out there currently. Claude Sonnet 3.5 is stronger on reasoning than gpt-4o but has a higher error rate, more errors conforming to a JSON schema, much lower rate limits, etc. These things are less important if you're just using the first-party chat interfaces but are very important if you're building on top of the APIs.


I don't understand the idea that they have no moat. Their moat is not technological. It's sociological. Most AI through APIs uses their models. Most consumer use of AI involves their models, or ChatGPT directly. They're clearly not in the "train your own model on your data in your environment" game, as that's a market for someone else. But make no mistake, they have a moat and it is strong.


> But make no mistake, they have a moat and it is strong.

Given that Mistral, Llama, Claude, and even Gemini are competitive with (if not better than) OpenAI's flagships, I don't really think this is true.


There are countless tools competitive with or better than what I use for email, and yet I still stick with my email client. Same is true for many, many other tools I use. I could perhaps go out of my way to make sure I'm always using the most technically capable and easy-to-use tools for everything, but I don't, because I know how to use what I have.

This is the exact dynamic that gives OpenAI a moat. And it certainly doesn't hurt them that they still produce SOTA models.


That's not a strong moat (arguably, not a moat at all, since as soon as any competitor has any business, they benefit from it with respect to their existing customers), it doesn't effect anyone who is not already invested in OpenAI's products, and because not every customer is like that with products they are currently using.

Now, having a large existing customer base and thus having an advantage in training data that feeds into an advantage in improving their products and acquiring new (and retaining existing customers) could, arguably, be a moat; that's a network effect, not merely inertia, and network effects can be a foundation of strong (though potentially unstable, if there is nothing else shoring them up) moats.


That is not what anyone means when they talk about moats.


I'm someone, and that's one of the ways I define a moat.


> I'm someone

Asserting facts not in evidence, as they say.


First mover advantage is not a great moat.


Yeah but the lock-in wrt email is absolutely huge compared to chatting with an LLM. I can (and have) easily ended my subscription to ChatGPT and switched to Claude, because it provides much more value to me at roughly the same cost. Switching email providers will, in general, not provide that much value to me and cause a large headache for me to switch.

Switching LLMs right now can be compared to switching electricity providers or mobile carriers - generally it's pretty low friction and provides immediate benefit (in the case of electricity and mobile, the benefit is cost).

You simply cannot compare it to an email provider.


It was pretty simple for me to switch email providers about ~6 years ago or so when I decided I'd do it. Although it's worth noting that my reasons for doing so were motivated by a strong desire around privacy, not noticing that another email provider did email better.

I elaborated a little more here on why I think OpenAI has quite the moat: https://news.ycombinator.com/item?id=41526082


Inertia is a hell of a moat.

Everyone building is comfortable with OpenAI's API, and have an account. Competing models can't just be as good, they need to be MUCH better to be worth switching.

Even as competitors build a sort of compatibility layer to be plug an play with OpenAI they will always be a step behind at best every time OpenAI releases a new feature.


Only a small fraction of all future AI projects have even gotten started. So they aren't only fighting over what's out there now, they're fighting over what will emerge.


This is true, and yet, many orgs who have experimented with OpenAI and are likely to return to them when a project "becomes real". When you google around online for how to do XYZ thing using LLMs, OpenAI is usually in whatever web results you read. Other models and APIs are also now using OpenAI's API format since it's the apparent winner. And for anyone who's already sent out subprocessor notifications with them as a vendor, they're locked in.

This isn't to say it's only going to be an OpenAI market. Enterprise worlds move differently, such as those in G Cloud who will buy a few million $$ of Vertex expecting to "figure out that gemini stuff later". In that sense, Google has a moat with those slices of their customers.

But I believe that when people think OpenAI has no moat because "the models will be a commodity", I think that's (a) some wishful thinking about the models and (b) doesn't consider the sociological factors that matter a lot more than how powerful a model is or where it runs.


Doesn't that make it less of a moat? If the average consumer is only interacting with it through a third party, and that third party has the ability to switch to something better or cheaper and thus switch thousands/millions of customers at once?


Their moat is no stronger than a good UI/API. What they have is first mover advantage and branding.


LiteLLM proxies their API to all other providers and there are dozens of FOSS recreations of their UI, including ones that are more feature-rich, so neither the UI nor the API are a moat.

Branding and first mover is it, and it's not going to keep them ahead forever.


I don't see why on-device inference is the future. For consumers, only a small set of use cases cannot tolerate the increased latency. Corporate customers will be satisfied if the model can be hosted within their borders. Pooling compute is less wasteful overall as a collective strategy.

This argument can really only meet its tipping point when massive models no longer offer a gotta-have-it difference vs smaller models.


On-device inference will succeed the way Linux does: It is "free" in that it only requires the user to acquire a model to run vs. paying for processing. It protects privacy, and it doesn't require internet. It may not take over for all users, but it will be around.

This assumes that openly developed (or at least weight-available) models are available for free, and continue being improved.


Why would a non profit / capped profit company, one that prioritizes public good, want a moat? Tongue in cheek.


>there’s a serious push to on-device inference

What push are you referring to? By whom?


Based on their graphs of how quality scales well with compute cycles, I would expect that it would indeed continue to be that much better (unless you can afford the same compute locally).


Not much of a moat vs other private enterprise, though


I think it's clear their strategy has changed. The whole landscape has changed. The size of models, amount of dollars, numbers of competitors and how much compute this whole exercise takes in the long term have all changed, so it's fair for them to adapt.

It just so happens that they're keeping their old name.

I think people focus too much on the "open" part of the name. I read "OpenAI" sort of like I read "Blackberry" or "Apple". I don't really think of fruits, I think of companies and their products.


Very anti-open and getting less and less with each release. Rooting for Meta in this regard, at least.


It's because there is nothing novel here from an architectural point of view. Again, the secret sauce is only in the training data.

O1 seems like a variant of RLRF https://arxiv.org/abs/2403.14238

Soon you will see similar models from competitors.


Did OpenAI ever even claim that they would be an open source company?

It seems like their driving mission has always been to create AI that is the "most beneficial to society".. which might come in many different flavors.. including closed source.


> Because of AI’s surprising history, it’s hard to predict when human-level AI might come within reach. When it does, it’ll be important to have a leading research institution which can prioritize a good outcome for all over its own self-interest.

> We’re hoping to grow OpenAI into such an institution. As a non-profit, our aim is to build value for everyone rather than shareholders. Researchers will be strongly encouraged to publish their work, whether as papers, blog posts, or code, and our patents (if any) will be shared with the world. We’ll freely collaborate with others across many institutions and expect to work with companies to research and deploy new technologies.

I don't see much evidence that the OpenAI that exists now—after Altman's ousting, his return, and the ousting of those who ousted him—has any interest in mind besides its own.

https://openai.com/index/introducing-openai/


https://web.archive.org/web/20190224031626/https://blog.open...

> Researchers will be strongly encouraged to publish their work, whether as papers, blog posts, or code, and our patents (if any) will be shared with the world. We’ll freely collaborate with others across many institutions and expect to work with companies to research and deploy new technologies.

From their very own website. Of course they deleted it as soon as Altman took over and turned it into a for profit, closed company.


Kind of?

>We're hoping to grow OpenAI into such an institution. As a non-profit, our aim is to build value for everyone rather than shareholders. Researchers will be strongly encouraged to publish their work, whether as papers, blog posts, or code, and our patents (if any) will be shared with the world. We'll freely collaborate with others across many institutions and expect to work with companies to research and deploy new technologies.

https://web.archive.org/web/20160220125157/https://www.opena...


Given the chain of thought is sitting in the context, I'm sure someone enterprising will find a way to extract it via a jailbreak (despite it being better at preventing jailbreaks).


Reminder that it's still not too late to change the direction of progress. We still have time to demand that our politicians put the breaks on AI data centres and end this insanity.

When AI exceeds humans at all tasks humans become economically useless.

People who are economically useless are also politically powerless, because resources are power.

Democracy works because the people (labourers) collectivised hold a monopoly on the production and ownership of resources.

If the state does something you don't like you can strike or refuse to offer your labour to a corrupt system. A state must therefore seek your compliance. Democracies do this by given people want they want. Authoritarian regimes might seek compliance in other ways.

But what is certain is that in a post-AGI world our leaders can be corrupt as they like because people can't do anything.

And this is obvious when you think about it... What power does a child or a disable person hold over you? People who have no ability to create or amass resources depend on their beneficiaries for everything including basics like food and shelter. If you as a parent do not give your child resources, they die. But your child does not hold this power over you. In fact they hold no power over you because they cannot withhold any resources from you.

In a post-AGI world the state would not depend on labourers for resources, jobless labourers would instead depend on the state. If the state does not provide for you like you provide for your children, you and your family will die.

In a good outcome where humans can control the AGI, you and your family will become subjects to the whims of state. You and your children will suffer as the political corruption inevitably arises.

In a bad outcome the AGI will do to cities what humans did to forests. And AGI will treat humans like humans treat animals. Perhaps we don't seek the destruction of the natural environment and the habitats of animals, but woodland and buffalo are sure inconvenient when building a super highway.

We can all agree there will be no jobs for our children. Even if you're an "AI optimist" we probably still agree that our kids will have no purpose. This alone should be bad enough, but if I'm right then there will be no future for them at all.

I will not apologise for my concern about AGI and our clear progress towards that end. It is not my fault if others cannot see the path I seem to see so clearly. I cannot simply be quiet about this because there's too much at stake. If you agree with me at all I urge you to not be either. Our children can have a great future if we allow them to have it. We don't have long, but we do still have time left.


A lot of skepticism here, but these are astonishing results! People should realize we’re reaching the point where LLMs are surpassing humans in any task limited in scope enough to be a “benchmark”. And as anyone who’s spent time using Claude 3.5 Sonnet / GPT-4o can attest, these things really are useful and smart! (And, if these results hold up, O1 is much, much smarter.) This is a nerve-wracking time to be a knowledge worker for sure.


I have written a ton of evaluations and run countless benchmarks and I'm not even close to convinced that we're at

> the point where LLMs are surpassing humans in any task limited in scope enough to be a “benchmark”

so much as we're over-fitting these bench marks (and in many cases fishing for a particular way of measuring the results that looks more impressive).

While it's great that the LLM community has so many benchmarks and cares about attempting to measure performance, these benchmarks are becoming an increasingly poor signal.

> This is a nerve-wracking time to be a knowledge worker for sure.

It might because I'm in this space, but I personally feel like this is the best time to working in tech. LLMs still are awful at things requiring true expertise while increasingly replacing the need for mediocre programmers and dilettantes. I'm increasingly seeing the quality of the technical people I'm working with going up. After years of being stuck in rooms with leetcode grinding TC chasers, it's very refreshing.


Is it? They talk about 10k attempts to reach gold medal status in the mathematics olympiad, but zero shot performance doesn't even place it in the upper 50th percentile.

Maybe I'm confused but 10k attempts on the same problem set would make anyone an expert in that topic? It's also weird that zero shot performance is so bad, but over a lot of attempts it seems to get correct answers? Or is it learning from previous attempts? No info given.


The correct metaphor is that 10,000 attempts would allow anyone to cherry pick a successful attempt. You’re conflating cherry picking with online learning. This is like if an entire school of students randomized their answers on a multiple choice test, and then you point to someone who scored 100% and claim it is proof of the school’s expertise.


Yeah but how is it possible that it has such a high margin of error? 10k attempts is insane! Were talking about an error margin of 50%! How can you deliver "expert reasoning" with such an error margin?


It’s undeniably less impressive than a human on the same task, but who cares at the end of the day? It can do 10,000 attempts in the time a person can do 1. Obviously improving that ratio will help for any number of reasons, but if you have a computer that can do a task in 5 minutes that will take a human 3 hours, it doesn’t necessarily matter very much how you got there.


How long does it take the operator to sift through those 10,000 attempts to find the successful one, when it's not a contrived benchmark where the desired answer is already known ahead of time? LLMs generally don't know when they've failed, they just barrel forwards and leave the user to filter out the junk responses.


I have an idea! We should train an LLM with reasoning capabilities to sift through all the attempts! /s


why /s ? Isn't that an approach some people are actually trying to take?


Even if it's the other way around, if the computer takes 3 hours on a task that a human can do in 5 minutes, using the computer might still be a good idea.

A computer will never go on strike, demand better working conditions, unionize, secretly be in cahoots with your competitor or foreign adversary, play office politics, scroll through Tiktok instead of doing its job, or cause an embarrassment to your company by posting a politically incorrect meme on its personal social media account.


Even if you disregard the Olympiad performance OpenAI-O1 is, if the charts are to be believed, a leap forward in intelligence. Also bear in mind that AI researchers are not out of ideas on how to make models better and improvements in AI chips are the metaphorical tide that lifts all boats. The trend is the biggest story here.

I get the AI skepticism because so much tech hype of recent years turned out to be hot air (if you're generous, obvious fraud if you're not). But AI tools available toady, once you get the hang of using them, are pretty damn amazing already. Many jobs can be fully automated with AI tools that exist today. No further breakthroughs required. And although I still don't believe software engineers will find themselves out of work anytime soon, I can no longer completely rule it out either.


The blog says "With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy."

I am interpreting this to mean that the model tried 10K approaches to solve the problem, and finally selected the one that did the trick. Am I wrong?


> Am I wrong?

That's the thing, did the operator select the correct result or did the model check it's own attempts? No info given whatsoever in the article.


That's not what "zero shot" means.


> And as anyone who’s spent time using Claude 3.5 Sonnet / GPT-4o can attest, these things really are useful and smart!

I have spent significant time with GPT-4o, and I disagree. LLMs are as useful as a random forum dweller who recognises your question as something they read somewhere at some point but are too lazy to check so they just say the first thing which comes to mind.

Here’s a recent example I shared before: I asked GPT-4o which Monty Python members have been knighted (not a trick question, I wanted to know). It answered Michael Palin and Terry Gilliam, and that they had been knighted for X, Y, and Z (I don’t recall the exact reasons). Then I verified the answer on the BBC, Wikipedia, and a few others, and determined only Michael Palin has been knighted, and those weren’t even the reasons.

Just for kicks, I then said I didn’t think Michael Palin had been knighted. It promptly apologised, told me I was right, and that only Terry Gilliam had been knighted. Worse than useless.

Coding-wise, it’s been hit or miss with way more misses. It can be half-right if you ask it uninteresting boilerplate crap everyone has done hundreds of times, but for anything even remotely interesting it falls flatter than a pancake under a steam roller.


I asked GPT-4o and I got the correct answer in one shot:

> Only one Monty Python member, Michael Palin, has been knighted. He was honored in 2019 for his contributions to travel, culture, and geography. His extensive work as a travel documentarian, including notable series on the BBC, earned him recognition beyond his comedic career with Monty Python (NERDBOT) (Wikipedia).

> Other members, such as John Cleese, declined honors, including a CBE (Commander of the British Empire) in 1996 and a peerage later on (8days).

Maybe you just asked the question wrong. My prompt was "which monty python actors have been knighted. look it up and give the reasons why. be brief".


Yes yes, there’s always some “you're holding it wrong” apologist.¹ Look, it’s not a complicated question to ask unambiguously. If you understand even a tiny bit of how these models work, you know you can make the exact same question twice in a row and get wildly different answers.

The point is that you never know what you can trust or not. Unless you’re intimately familiar with Monty Python history, you only know you got the correct answer in one shot because I already told you what the right answer is.

Oh, and by the way, I just asked GPT-4o the same question, with your phrasing, copied verbatim and it said two Pythons were knighted: Michael Palin (with the correct reasons this time) and John Cleese.

¹ And I’ve had enough discussions on HN where someone insists on the correct way to prompt, then they do it and get wrong answers. Which they don’t realise until they shared it and disproven their own argument.


I think your iPhone analogy is apt. Do you want to be the person complaining that the phone drops calls or do you want to hold it slightly differently and get a lot of use out of it?

If you pay careful attention to prompt phrasing you will get a lot more mileage out of these models. That's the bottom line. If you believe that you shouldn't have to learn how to use a tool well then you can be satisfied with your righteous attitude but you won't get anywhere.


No one’s arguing that correct use of a tool isn’t beneficial. The point is that insisting LLMs just need good prompting is delusional and a denial of reality. I have just demonstrated how your own prompt is still capable of producing the wrong result. So either you don’t know how to prompt correctly (because if you did, by your own logic it would have produced the right response every time, which it didn’t) or the notion that all you need is good prompting is wrong. Which anyone who understands the first thing about these systems knows to be the case.


Unless I'm mistaken, isn't all the math behind them... ultimately probabilistic? Even theoretically they can't guarantee the same answer. I'm agreeing with you, by the way, just curious if I'm missing something.


If you take a photo the photons hitting the camera sensor do so in a probabilistic fashion. Still, in sufficient light you'll get the same picture every time you press the shutter button. In near darkness you'll get a random noise picture every time.

Similarly language models are probabilistic and yet they get the easiest questions right 100% of the time with little variability and the hardest prompts will return gibberish. The point of good prompting is to get useful responses to questions at the boundary of what the language model is capable of.

(You can also configure a language model to generate the same output for every prompt without any random noise. Image models for instance generate exactly the same image pixel for pixel when given the same seed.)


The photo comparison is disingenuous. Light and colour information can be disorganised to a large extent and yet you still perceive the same from an image. You can grab a photo and apply to it a red filter or make it black and white and still understand what’s in there, what it means, and how it compares to reality.

In comparison, with text a single word can change the entire meaning of a sentence, paragraph, or idea. The same word in different parts of a text can make all the difference between clarity and ambiguity.

It makes no difference how good your prompting is, some things are simply unknowable by an LLM. I repeatedly asked GPT-4o how many Magic: The Gathering cards based on Monty Python exist. It said there are none (wrong) because they didn’t exist yet at the cut off date of its training. No amount of prompting changes that, unless you steer it by giving it the answer (at which point there would have been no point in asking).

Furthermore, there’s no seed that guarantees truth in all answers or the best images in all cases. Seeds matter for reproducibility, they are unrelated to accuracy.


Language is fuzzy in exactly the same way. LLMs can create factually correct responses in dozens of languages using endless variations in phrasing. You fixate on the kind of questions that current language models struggle with but you forget that for millions of easier questions modern language models already respond with a perfect answer every time.

You think the probabilistic nature of language models is a fundamental problem that puts a ceiling on how smart they can become, but you're wrong.


> Language is fuzzy in exactly the same way.

No. Language can be fuzzy, yes, but not at all in the same way. I have just explained that.

> LLMs can create factually correct responses in dozens of languages using endless variations in phrasing.

So which is it? Is it about good prompting, or can you have endless variations? You can’t have of both ways.

> You fixate on the kind of questions that current language models struggle with

So you’re saying LLMs struggle with simple factual and verifiable questions? Because that’s all the example questions were. If they can’t handle that (and they do it poorly, I agree), what’s the point?

By the way, that’s a single example. I have many more and you can find plenty of others online. Do you also think the Gemini ridiculous answers like putting glue on pizza are about bad promoting?

> You think the probabilistic nature of language models is a fundamental problem that puts a ceiling on how smart they can become, but you're wrong.

One of your mistakes is thinking you know what I think. You’re engaging with a preconceived notion you formed in your head instead of the argument.

And LLMs aren’t smart, because they don’t think. They are an impressive trick for sure, but that does not imply cleverness on their part.


Even without AI, it's gotten ~10,000 times easier to write software than in the 1950s (eg. imagine trying to write PyTorch code by hand in IBM 650 assembly), but the demand for software engineering has only increased, because demand increases even faster than supply does. Jevons paradox:

https://en.wikipedia.org/wiki/Jevons_paradox


> it's gotten ~10,000 times easier to write software than in the 1950s

It seems many of the popular tools want to make writing software harder than in the 2010s, though. Perhaps their stewards believe that if they keep making things more and more unnecessarily complicated, LLMs won't be able to keep up?


The number of tech job postings has tanked - which loosely correlates with the rise of AI.

https://x.com/catalinmpit/status/1831768926746734984


The local decline in open software engineering positions has _nothing_ to do with AI. The best orgs are using AI to assist developers in building out new systems and write tests. Show me someone who is doing anything bigger than that, please I'd love to be proven wrong.

The big decline is driven by a few big factors. Two of which are 1- the overhiring that happened in 2021. This was followed by the increase of interest rates which dramatically constrained the money supply. Investors stopped preferring growth over profits. This shift in investor preferences is reflected in engineering orgs tightening their budgets as they are no longer rewarded for unbridled growth.


Plus the tax code requiring amortization of developer salaries over 5 years instead of the year the salary expense is incurred.


GPT-4 came out in March 2023, after most of this drop was already finished.


And also with a large increase in interest rates.


The tanking is more closely aligned with new tax rules that went to effect that make it much harder to claim dev time as an expense.


I'm skeptical because "we fired half our programmers and our new AI does their jobs as well as they did" is a story that would tear through the Silicon Valley rumor mill. To my knowledge, this has not happened (yet).


this drop is more related to the FED increasing the interest rates


I like your phrasing - "any task limited in scope enough to be a 'benchmark'". Exactly! This is the real gap with LLMs, and will continue to be an issue with o1 -- sure, if you can write down all of the relevant context information you need to perform some computation, LLMs should be able to do it. In other words, LLMs are calculators!

I'm not especially nerve-wracked about being a knowledge worker, because my day-to-day doesn't consist of being handed a detailed specification of exactly what is required, and then me 'computing' it. Although this does sound a lot like what a product manager does!


I cannot, in fact, attest that they are useful and smart. LLMs remain a fun toy for me, not something that actually produces useful results.


I have been deploying useful code from LLMs right and left over the last several months. They are a significant force accelerator for programmers if you know how to prompt them well.


We’ll see if this is a good idea when we start having millions of lines of LLM-written legacy code. My experience maintaining such code so far has been very bad: accidentally quadratic algorithms; subtly wrong code that looks right; and un-idiomatic use of programming language features.


ah i see so you're saying that LLM-written code is already showing signs of being a maintenance nightmare, and that's a reason to be skeptical about its adoption. But isn't that just a classic case of 'we've always done it this way' thinking?

legacy code is a problem regardless of who wrote it. Humans have been writing suboptimal, hard-to-maintain code for decades. At least with LLMs, we have the opportunity to design and implement better coding standards and review processes from the start.

let's be real, most of the code written by humans is not exactly a paragon of elegance and maintainability either. I've seen my fair share of 'accidentally quadratic algorithms' and 'subtly wrong code that looks right' written by humans. At least with LLMs, we can identify and address these issues more systematically.

As for 'un-idiomatic use of programming language features', isn't that just a matter of training the LLM on a more diverse set of coding styles and idioms? It's not like humans have a monopoly on good coding practices.

So, instead of throwing up our hands, why not try to address these issues head-on and see if we can create a better future for software development?


Maybe it will work out, but I think we’ll regret this experiment because it’s the wrong sort of “force accelerator”: writing tons of code that should be abstracted rather than just dumped out literally has always caused the worst messes I’ve seen.


Yes, same way that the image model outputs have already permeated the blogosphere and pushed out some artists, the other models will all bury us under a pile of auto-generated code.

We will yearn for the pre-GPT years at some point, like we yearn for the internet of the late 90s/early 2000s. Not for a while though. We're going through the early phases of GPT today, so it hasn't been taken over by the traditional power players yet.


When the tool is statistical word vomit based, it will never move beyond cool bar trick levels.


LLMs will allow us to write code faster and create applications and systems faster.

Which is how we ended up here, which I guess is tolerable, where a webpage with a bit of styling and a table uses up 200MB of RAM.


Honestly the code it's been giving me has been fairly cromulent. I don't believe in premature optimization and it is perfect for getting features out quick and then I mold it to what it needs to be.


In a way it's not surprising that people are getting vastly different results out of LLMs. People have different skill levels when it comes to using even Google. An LLM has a vastly bigger input space.


same...but have you considered the broader implications of relying on LLMs to generate code? It's not just about being a 'force accelerator' for individual programmers, but also about the potential impact on the industry as a whole.

If LLMs can generate high-quality code with minimal human input, what does that mean for the wages and job security of programmers? Will companies start to rely more heavily on AI-generated code, and less on human developers? It's not hard to imagine a future where LLMs are used to drive down programming costs, and human developers are relegated to maintenance and debugging work.

I'm not saying that's necessarily a bad thing, but it's definitely something that needs to be considered. As someone who's enthusiastic about the potential of code gen this O1 reasoning capability is going to make big changes.

do you think you'll be willing to take a pay cut when your employer realizes they can get similar results from a machine in a few seconds?


My boss is holding a figurative gun to my head to use this stuff. His performance targets necessitate the use of it. It is what it is.


Yeah, but this, in itself, is triggered by a hype wave. These come and go. So we can't really judge the long term impact from inside the wave.


Your job won't be taken by AI, it will be taken by someone wielding AI.


As a society we're not solving for programmer salaries but for general welfare which is basically code for "cheaper goods and services".


What's a sample prompt that you've used? Every time I've tried to use one for programming, they invent APIs that don't exist (but sound like they might) or fail to produce something that does what it says it does.


No matter the prompt, there's a significant difference between how it handles common problems in popular languages (python, JS) versus esoteric algorithms in niche languages or tools.

I had a funny one a while back (granted this was probably ChatGPT 3.5) where I was trying to figure out what payload would get AWS CloudFormation to fix an authentication problem between 2 services and ChatGPT confidently proposed adding some OAuth querystring parameters to the AWS API endpoint.


I just ask it for what I want in very specific detail, stating the language and frameworks in use. I keep the ideas self-contained -- for example if I need something for the frontend I will ask it to make me a webcomponent. Asking it to not make assumptions and ask questions on ambiguities is also very helpful.

It tends to fall apart on bigger asks with larger context. Breaking your task into discrete subtasks works well.


Use Python or JS. The models definitely don't seem to perform as well on less hyper prevalent languages.


Even then it is hit and miss. If you are doing something that is also copy/paste-able out of a StackOverflow comment, you're apt to be fine, but as soon as you are doing anything slightly less common... Good luck.


Yeah, fair. It's good for short snippets and ways of approaching the problem but not great at execution.

It's like infinitely tailored blog posts, for me at least.


True. It can be good at giving you pointers towards approaching the problem, even if the result is flawed, for slightly less common problems. But as you slide even father towards esotericism, there is no hope. It won't even get you in the right direction. Unfortunately – as that is where it would be most useful.


Have you tried Claude 3.5 Sonnet?


I think that's just the same as using an autocomplete efficiently, though. I tend to like them for Search, but not for anything i have to "prompt correctly" because i feel like i can type fast enough that i'm not too worried about auto-completing.

With that said i'm not one of those "It's just a parrot!" people. It is, definitely just a parrot atm.. however i'm not convinced we're not parrots as well. Notably i'm not convinced that that complexity won't be sufficient to walk talk and act like intelligence. I'm not convinced that intelligence is different than complexity. I'm not an expert though, so this is just some dudes stupid opinion.

I suspect if LLMs can prove to have duck-intelligence (ie duck typing but for intelligence) then it'll only be achieved in volumes much larger than we imagine. We'll continue to refine and reduce how much volume is necessary, but nevertheless i expect complexity to be the real barrier.


It’s definitely the case that there are some programming workflows where LLMs aren’t useful. But I can say with certainty that there are many where they have become incredibly useful recently. The difference between even GPT-4 last year and C3.5/GPT-4o this year is profound.

I recently wrote a complex web frontend for a tool I’ve been building with Cursor/Claude and I wrote maybe 10% of the code; the rest with broad instructions. Had I done it all myself (or even with GitHub Copilot only) it would have taken 5 times longer. You can say this isn’t the most complex task on the planet, but it’s real work, and it matters a lot! So for increasingly many, regardless of your personal experience, these things have gone far beyond “useful toy”.


The sooner those paths are closed for low-effort high-pay jobs, the better, IMO. All this money for no work is going to our heads.

It's time to learn some real math and science, the era of regurgitating UI templates is over.


I don’t want to be in the business of LLM defender, but it’s just hard to imagine this aging well when you step back and look at the pace of advancement here. In the realm of “real math and science”, O1 has improved from 0% to 50% on AIME today. A year ago, LLMs could only write little functions, not much better than searching StackOverflow. Today, they can write thousands of lines of code that work together with minimal supervision.

I’m sure this tech continues to have many limitations, but every piece of trajectory evidence we have points in the same direction. I just think you should be prepared for the ratio of “real” work vs. LLM-capable work to become increasingly small.


I can probably climb a tree faster than I can build a rocket. But only one will get me all the way to the moon. Don't confuse local optima for global ones.


> The sooner those paths are closed for low-effort high-pay jobs, the better, IMO. All this money for no work is going to our heads.

> It's time to learn some real math and science, the era of regurgitating UI templates is over.

You do realize that software development was one of the last social elevators, right?

What you're asking for won't happen, let alone the fact that "real math and science" pay a pittance, there's a reason the pauper mathematician was a common meme.


So you're advocating for properly compensating career paths according to their contributions to society? Tally ho!


'Not useful' is a pretty low bar to clear, especially when you consider the state of the art just 5 years ago. LLMs may not be solving world hunger, but they're already being used in production for coding

If you're not seeing value in them, maybe it's because you're not looking at the right problems. Or maybe you're just not using them correctly. Either way, dismissing an entire field of research because it doesn't fit your narrow use case is pretty short-sighted.

FWIW, I've been using LLMs to generate production code and it's saved me weeks if not months. YMMV, I guess


Familiarize yourself with a tool which does half the prompting for you, e.g. cursor is pretty good at prompting claude 3.5 and it really does make code edits 10x faster (I'm not even talking about the fancy stuff about generating apps in 5 mins - just plain old edits.)


At this point, you're either saying "I don't understand how to prompt them" or "I'm a Luddite". They are useful, here to stay, and only getting better.


> People should realize we’re reaching the point where LLMs are surpassing humans in any task limited in scope enough to be a “benchmark”.

Can you explain what this statement means? It sounds like you're saying LLMs are now smart enough to be able to jump through arbitrary hoops but are not able to do so when taken outside of that comfort zone. If my reading is correct then it sounds like skepticism is still warranted? I'm not trying to be an asshole here, it's just that my #1 problem with anything AI is being able to separate fact from hype.


I think what I’m saying is a bit more nuanced than that. LLMs currently struggle with very “wide”, long-run reasoning tasks (e.g., the evolution over time of a million-line codebase). That isn’t because they are secretly stupid and their capabilities are all hype, it’s just that this technology currently has a different balance of strengths and weaknesses than human intelligence, which tends to more smoothly extrapolate to longer-horizon tasks.

We are seeing steady improvement on long-run tasks (SWE-Bench being one example) and much more improvement on shorter, more well-defined tasks. The latter capabilities aren’t “hype” or just for show, there really is productive work like that to be done in the world! It’s just not everything, yet.


> And as anyone who’s spent time using Claude 3.5 Sonnet / GPT-4o can attest, these things really are useful and smart! (And, if these results hold up, O1 is much, much smarter.) This is a nerve-wracking time to be a knowledge worker for sure.

If you have to keep checking the result of an LLM, you do not trust it enough to give you the correct answer.

Thus, having to 'prompt' hundreds of times for the answer you believe is correct over something that claims to be smart - which is why it can confidently convince others that its answer is correct (even when it can be totally erroneous).

I bet if Google DeepMind announced the exact same product, you would equally be as skeptical with its cherry-picked results.


> People should realize we’re reaching the point where LLMs are surpassing humans in any task limited in scope enough to be a “benchmark

This seems like a bold statement considering we have so few benchmarks, and so many of them are poorly put together.


> We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute).

Wow. So we can expect scaling to continue after all. Hyperscalers feeling pretty good about their big bets right now. Jensen is smiling.

This is the most important thing. Performance today matters less than the scaling laws. I think everyone has been waiting for the next release just trying to figure out what the future will look like. This is good evidence that we are on the path to AGI.


More, from an OpenAI employee:

> I really hope people understand that this is a new paradigm: don't expect the same pace, schedule, or dynamics of pre-training era. I believe the rate of improvement on evals with our reasoning models has been the fastest in OpenAI history.

> It's going to be a wild year.

https://x.com/willdepue/status/1834294935497179633


Microsoft, Google, Facebook have all said in recent weeks that they fully expect their AI datacenter spend to accelerate. They are effectively all-in on AI. Demand for nvidia chips is effectively infinite.


Until the first LLM that can improve itself occurs. Then $NVDA tanks


Nvidia stock go brrr...


Even when we start to plateau on direct LLM performance, we can still get significant jumps by stacking LLMs together or putting a cluster of them together.


It'd be interesting for sure if true. Gotta remember that this is a marketing post though, let's wait a few months and see if its actually true. Things are definitely interesting, wherever these techniques will get us AGI or not


Before commenting here, please take 15 minutes to read through the chain-of-thought examples -- decoding a cypher-text, coding to solve a problem, solving a math problem, solving a crossword puzzle, answering a complex question in English, answering a complex question in Chemistry, etc.

After reading through the examples, I am shocked at how incredibly good the model is (or appears to be) at reasoning: far better than most human beings.

I'm impressed. Congratulations to OpenAI!


Yeah the chain-of-thought in these is way beyond what prompting can achieve in current models. and the cipher was very impressive.


after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.


"Open"AI is such a comically ironic name at this point.


This also makes them less useful because I can’t just click stop generation when they make a logical error re: coding.


You wouldn't do that to this model. It finds its own mistakes and corrects them as it is thinking through things.


No model is perfect, the less I can see into what it’s “thinking” the less productively I can use it. So much for interpretability.


Saying "competitive advantage" so directly is surprising.

There must be some magic sauce here for guiding LLMs which boosts performance. They must think inspecting a reasonable number of chains would allow others to replicate it.

They call GPT 4 a model. But we don't know if it's really a system that builds in a ton of best practices and secret tactics: prompt expansion, guided CoT, etc. Dalle was transparent that it automated re-generating the prompts, adding missing details prior to generation. This and a lot more could all be running under the hood here.


Lame but not atypical of OpenAI. Too bad, but I'm expecting competitors to follow with this sort of implementation and better. Being able to view the "reasoning" process and especially being able to modify it and re-render the answer may be faster than editing your prompt a few times until you get the desired response, if you even manage to do that.


We're not going to give you training data... for a better user experience.


That naming scheme...

Will the next model be named "1k", so that the subsequent models will be named "4o1k", and we can all go into retirement?


More like you will need to dip into your 401k fund early to pay for it after they raise the prices.


oh wow, something you can roughly model as a diy in a base model. so impressive. yawn.

at least NVDA should benefit. i guess.


If there's a way to do something like this with Llama I'd love to hear about it (not being sarcastic)


nurture the model have patience and a couple bash scripts


But what does that mean? I can't do "pip install nurture" or "pip install patience". I can generate a bunch of answers and take the consensus, but we've been able to do that for years. I can do fine-tuning or DPO, but on what?


you want instructions on how to compete with OpenAI?

go play more, your priorities and focus on it being work are making you think this to be harder than it is, and the models can even tell you this.

you don’t have to like the answer, but take it seriously, and you might come back and like it quite a bit.

you have to have patience because you likely wont have scale - but it is not just patience with the response time.


I have also heard they are launching a AI called strawberry. If you pay attention, there is a specific reason why they have named it strawberry. if you ask chat gpt 4o, how many r's in the word strawberry, it will give answer as 2. still to this day it will answer same. the model is not able to reason. thats why a reasoning model is being launched. this is one of the reason apart from many other reasons.


I tested o1-preview on some coding stuff I've been using gpt-4o for. I am not impressed. The new, more intentional chain of thought logic is apparently not something it can meaningfully apply to a non-trivial codebase.

Sadly I think this OpenAI announcement is hot air. I am now (unfortunately) much less enthusiastic about upcoming OpenAI announcements. This is the first one that has been extremely underwhelming (though the big announcement about structured responses (months after it had already been supported nearly identically via JSONSchema) was in hindsight also hot air.

I think OpenAI is making the same mistake Google made with the search interface. Rather than considering it a command line to be mastered, Google optimized to generate better results for someone who had no mastery of how to type a search phrase.

Similarly, OpenAI is optimizing for someone who doesn't know how to interact with a context-limited LLM. Sure it helps the low end, but based on my initial testing this is not going to be helpful to anyone who had already come to understand how to create good prompts.

What is needed is the ability for the LLM to create a useful, ongoing meta-context for the conversation so that it doesn't make stupid mistakes and omissions. I was really hoping OpenAI would have something like this ready for use.


I have tested o1-preview on a couple of coding tasks and I am impressed.

I am looking at a TypeScript project with quite an amount of type gymnastics and a particular line of code did not validate with tsc no matter what I have tried. I copy pasted the whole context into o1-preview and it told me what is likely the error I am seeing (and it was a spot on correct letter-by-letter error message including my variable names), explained the problem and provided two solutions, both of which immediately worked.

Another test was I have pasted a smart contract in solidity and naively asked to identify vulnerabilities. It thought for more than a minute and then provided a detailed report of what could go wrong. Much, much deeper than any previous model could do. (No vulnerabilities found because my code is perfect, but that's another story).


Your case would be more convincing by an example.

Though o1 did fail at the puzzle in my profile.

Maybe it's just tougher than even, its author, I had assumed...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: