Some practical notes from digging around in their documentation:
In order to get access to this, you need to be on their tier 5 level, which requires $1,000 total paid and 30+ days since first successful payment.
Pricing is $15.00 / 1M input tokens and $60.00 / 1M output tokens. Context window is 128k token, max output is 32,768 tokens.
There is also a mini version with double the maximum output tokens (65,536 tokens), priced at $3.00 / 1M input tokens and $12.00 / 1M output tokens.
The specialized coding version they mentioned in the blog post does not appear to be available for use.
It’s not clear if the hidden chain of thought reasoning is billed as paid output tokens. Has anyone seen any clarification about that? If you are paying for all of those tokens it could add up quickly. If you expand the chain of thought examples on the blog post they are extremely verbose.
> Some practical notes from digging around in their documentation: In order to get access to this, you need to be on their tier 5 level, which requires $1,000 total paid and 30+ days since first successful payment.
Tier 5 level required for _API access_. ChatGPT Plus users, for example, also have access to the o1 models.
I’m x, PM for the OpenAI API. I’m pleased to share with you our new series of models, OpenAI o1. We’ve developed these models to spend more time thinking before they respond. They can reason through complex tasks and solve harder problems than previous models in science, coding, and math.
As a trusted developer on usage tier 5, you’re invited to get started with the o1 beta today.
Read the docs
You have access to two models:
Our larger model, o1-preview, which has strong reasoning capabilities and broad world knowledge.
Our smaller model, o1-mini, which is 80% cheaper than o1-preview.
Try both models! You may find one better than the other for your specific use case. Both currently have a rate limit of 20 RPM during the beta. But keep in mind o1-mini is faster, cheaper, and competitive with o1-preview at coding tasks (you can see how it performs here). We’ve also written up more about these models in our blog post.
I’m curious to hear what you think. If you’re on X, I’d love to see what you build—just reply to our post.
I think it's fantastic that now, for very little money, everyone gets to share a narrow but stressful subset of what it feels like to employ other people.
Really, I recommend reading this part of the thread while thinking about the analogy. It's great.
It’s nice on the outside, but employees are actually all different people and this here is one company’s blob of numbers with not much incentive to optimize your cost.
Competition fixes some of this, I hope Anthropic and Mistral are not far behind.
On the contrary. It will be the world's most scrutinized employee. Thousands of people, amongst them important people with big levers, will be screaming in their ear on my behalf constantly, and my — our collective — employee gets better without me having to do anything. It's fantastic!
> Any respectable employer/employee relationship transacts on results rather than time anyway.
No. This may be common in freelance contracts, but is almost never the case in employment contracts, which specify a time-based compensation (usually either per hour or per month).
I believe parent's point was that if ones management is clueless as to how to measure output and compensation/continued employment is unlinked from same... one is probably working for a bad company.
Also, now we're paying for output tokens that aren't even output, with no good explanation for why these tokens should be hidden from the person who paid for them.
Good catch. That indicates that chains of thought are a straightforward approach to make LLMs better at reasoning if you could copy it just by seeing the steps.
Also seems very impractical to embed this into a deployed product. How can you possibly hope to control and estimate costs? I guess this is strictly meant for R&D purposes.
With the conventional models you don't get the activations or the logits even though those would be useful.
Ultimately if the output of the model is not worth what you end up paying for it then great, I don't see why it really matters to you whether OpenAI is lying about token counts or not.
As a single user, it doesn’t really, but as a SaaS operator I want tractable, hopefully predictable pricing.
I wouldn’t just implicitly trust a vendor when they say “yeah we’re just going to charge you for what we feel like when we feel like. You can trust us.”
OAI doesn't show the actual COT, on the grounds that it's potentially unsafe output and also to prevent competitors training on it. You only see a sanitized summary.
No access to reasoning output seems totally bonkers. All of the real cost is in inference, assembling an HTTP request to deliver that result seems trivial?
According to the data provided by OpenAI, that isn't true anymore. And I trust data more than anecdotal claims made by people whose job is being threatened by systems like these.
>According to the data provided by OpenAI, that isn't true anymore
OpenAI main job is to sell that their models are better than human. I still remember when they're marketing their gpt-2 weights as too dangerous to release.
I remember that too, it's when I started following the space (shout out computerphile/robert miles) and iirc the reason they gave was not "it's too dangerous cause it's so badass" they basically were correct in that it can produce sufficiently "human" output as to break typical bot detectors on social media which is a legitimate problem - whether the repercussions of that failure to detect botting is meaningful enough to be considered "dangerous" is up to the reader to decide
also worth noting I don't agree with the comment you're replying to - but did want to add context to the situation of gpt-2
What? Surely you have some area of your life you are above-average knowledgable about. Have a conversation with chatGPT about it, with whatever model, and you can see for yourself it is far from expert level.
You are not "trusting data more than anecdotal claims", you are trusting marketing over reality.
Benchmarks can be gamed. Statistics can be manipulated. Demonstrations can be cherry picked.
PS: I stand to gain heavily if AI systems could perform at an expert level, this is not a claim from someone 'whose job is being threatened'.
> For each problem, our system sampled many candidate submissions and submitted 50 of them based on a test-time selection strategy. Submissions were selected based on performance on the IOI public test cases, model-generated test cases, and a learned scoring function. If we had instead submitted at random, we would have only scored 156 points on average, suggesting that this strategy was worth nearly 60 points under competition constraints.
Did you read the post? OpenAI clearly states that the results are cherry-picked. Just a random query will have far worse results. To get equal results you need to ask the same query dozens of time and then have enough expertise to pick the best one, which might be quite hard for a problem that you have little idea about.
Combine this with the fact that this blog post is a sales pitch with the very best test results out of probably many more benchmarks we will never see and it seems obvious that human experts are still several order of magnitudes ahead.
When I read that line too I was very confused lol. I interpreted it as them saying they basically took other contestant submissions and allowing the model to see these "solutions" as part of context? and then having the model generate its own "solution" to be used for the benchmark. I fail to see how this is "solving" a ioi level question.
What is interesting is the following paragraph in the post
" With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy. "
So they didn't allow sampling from other contest solutions here? If that is the case quite interesting, since the model is effectively imo able to brute force questions. Provided you have some form of a validator able to tell it to halt.
The linked blog posts explains that it is fine-tuned on some reinforcement learning process. It doesn’t go into details but they do claim it’s not just the base model with chain of thought, there’s some fine-tuning going on.
Unless this is specifically relating to API access, I don’t think it’s correct. I’ve been paying for ChatGPT via the App Store IAP for around a year or less, and I’ve already got both o1-preview and o1-mini available in-app.
Yes, I was referring to API access specifically. Nothing in the blog post or the documentation mentions access to these new models on ChatGPT, and even as a paid user I’m not seeing them on there (Edit: I am seeing it now in the app). But looks like a bunch of other people in this discussion do have it on ChatGPT, so that’s exciting to hear.
I'm a bit late to the show, but it would seem the API calls for these new models don't support system messages (where role is system) or the tool list for function calls.
Am curious if at some point length of context window stops playing any material difference in the output and it just stops making any economical sense as law of marginal diminishing utility kicks in.
How so? I think most of the competition does this. Early partners/heavy users get access first which 1) hopefully provides feedback on the product and 2) provides a mechanism to stagger the release.
We could tell it impacted your feelings but most business don’t run off of feelings. There is sometimes alignment on morals/being a good business partner but before that it’s the quality of the product and the cost.
One thing that makes me skeptical is the lack of specific labels on the first two accuracy graphs. They just say it's a "log scale", without giving even a ballpark on the amount of time it took.
Did the 80% accuracy test results take 10 seconds of compute? 10 minutes? 10 hours? 10 days? It's impossible to say with the data they've given us.
The coding section indicates "ten hours to solve six challenging algorithmic problems", but it's not clear to me if that's tied to the graphs at the beginning of the article.
The article contains a lot of facts and figures, which is good! But it doesn't inspire confidence that the authors chose to obfuscate the data in the first two graphs in the article. Maybe I'm wrong, but this reads a lot like they're cherry picking the data that makes them look good, while hiding the data that doesn't look very good.
> Did the 80% accuracy test results take 10 seconds of compute? 10 minutes? 10 hours? 10 days? It's impossible to say with the data they've given us.
The gist of the answer is hiding in plain sight: it took so long, on an exponential cost function, that they couldn't afford to explore any further.
The better their max demonstrated accuracy, the more impressive this report is. So why stop where they did? Why omit actual clock times or some cost proxy for it from the report? Obviously, it's because continuing was impractical and because those times/costs were already so large that they'd unfavorably affect how people respond to this report
See also: them still sitting on Sora seven months after announcing it. They've never given any indication whatsoever of how much compute it uses, so it may be impossible to release in its current state without charging an exorbitant amount of money per generation. We do know from people who have used it that it takes between 10 and 20 minutes to render a shot, but how much hardware is being tied up during that time is a mystery.
It's also entirely possible they are simply sincere about their fear it may be used to influence the upcoming US election.
Plenty of people (me included) are sincerely concerned about the way even mere still image generators can drown out the truth with a flood of good-enough-at-first-glance fiction.
If they were sincere about that concern then they wouldn't build it at all, if it's ever made available to the public then it will eventually be available during an election. It's not like the 2024 US presidential election is the end of history.
The risk is not “interfering with the US elections”, but “being on the front page of everything as the only AI company interfering with US elections”. This would destroy their peacocking around AGI/alignment while raising billions from pension funds.
OpenAI is in a very precarious position. Maybe they could survive that hit in four years, but it would be fatal today. No unforced errors.
i think the hope is by the next presidential election no one will trust video anymore anyway so the new normal wont be as chaotic as if the dropped in the middle of an already contentious election.
as for not building it at all its a obvious next step in generative ai models that if they don't make it someone else will anyway.
I'd give it about 20 years before humanoid robots can be indistinguishable from originals without an x-ray or similar — covering them in vat-grown cultures of real human skin etc. is already possible but the robots themselves aren't good enough to fool anyone.
unfortunately that would mean two firstly things only swing states would get to hear what politicians are actually saying and secondly to reach everyone the primary process would have to start even earlier so the candidates would have a chance to give enough speeches before early voting
Even if Kamala wins (praise be to god that she does), those people aren't just going to go away until social media does. Social media is the cause of a lot of the conspiracy theory mania.
So yeah, better to never release the model...even though Elon would in a second if he had it.
But this cat run out of the bag years ago, didn't it? Trump himself is using AI-generated images in his campaign. I'd go even further: the more fake images appear, the faster the society as a whole will learn to distrust anything by default.
Their public statements that the only way to safely learn how to deal with the things AI can do, is to show what it can do and get feedback from society:
"""We want to successfully navigate massive risks. In confronting these risks, we acknowledge that what seems right in theory often plays out more strangely than expected in practice. We believe we have to continuously learn and adapt by deploying less powerful versions of the technology in order to minimize “one shot to get it right” scenarios.""" - https://openai.com/index/planning-for-agi-and-beyond/
I don't know if they're actually correct, but it at least passes the sniff test for plausibility.
Isn't this balloon video shared by openai? How is this not counted? For others I don't have evidences. But this balloon video case is enough to cast the doubts.
As someone experienced with operations / technical debt / weird company specific nonsense (Platform Engineer). No, you have to solve nuclear fusion at <insert-my-company>. You gotta do it over and over again. If it were that simple we wouldn't have even needed AI we would have hand written a few things, and then everything would have been legos, and legos of legos, but it takes a LONG time to find new true legos.
Yeah you’re right, all businesses are made of identical, interchangeable parts that we can swap out at our leisure.
This is why enterprises change ERP systems frictionlessly, and why the field of software engineering is no longer required. In fact, given that apparently, all business is solved, we can probably just template them all out, call it a day and all go home.
Yeah but thats not a Lego. A Lego is something that fits everwhere else. Not just previous work. There's a lot of previous work. There are very few true Legos.
AlphaFold simulated the structure of over 200 million proteins. Among those, there could be revolutionary ones that could change the medical scientific field forever, or they could all be useless. The reasoning is sound, but that's as far as any such tool can get, and you won't know it until you attempt to implement it in real life. As long as those models are unable to perfectly recreate the laws of the universe to the maximum resolution imaginable and follow them, you won't see an AI model, let alone a LLM, provide anything of the sort.
with these methods the issue is the log scale of compute. Let's say you ask it to solve fusion. It may be able to solve it but the issue is it's unverifiable WHICH was correct.
So it may generate 10 Billion answers to fusion and only 1-10 are correct.
There would be no way to know which one is correct without first knowing the answer to the question.
This is my main issue with these methods. They assume the future via RL then when it gets it right they mark that.
We should really be looking at methods of percentage it was wrong rather then it was right a single time.
Which is why it is incredibly depressing that OpenAI will not publish the raw chain of thought.
“Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.”
maybe they will enable to show CoT for a limited uses, like 5 prompts a day for Premium users, or for Enterprise users with agreement not to steal CoT or something like that.
if OpenAI sees this - please allow users to see CoT for a few prompts per day, or add it to Azure OpenAI for Enterprise customers with legal clauses not to steal CoT
Imagine if this tech was available in the middle ages and it was asked to 'solve' alchemy or perpetual motion, and responded that it was an impossible problem... people would (irrationally from our perspective) go Luddite on it I suspect. Now apply to the 'fusion power' problem.
The new thing that can do more at the "ceiling" price doesn't remove your ability to still use the 100x cheaper tokens for the things that were doable on that version.
That exact pattern is always true of technological advance. Even for a pretty broad definition of technology. I'm not sure if it's perfectly described by the name "induced demand" but it's basically the same thing.
- At the high end, there is a likely nonlinear relationship between answer quality and compute.
- We've gotten used to a flat-price model. With AGI-level models, we might have to pay more for more difficult and more important queries. Such is the inherent complexity involved.
- All this stuff will get better and cheaper over time, within reason.
I'd say let's start by celebrating that machine thinking of this quality is possible at all.
I don't think it's worth any debate. You can simply find out how it does for you, now(-ish, rolling out).
In contrast: Gemini Ultra, the best, non-existent Google Model for the past few month now, that people nonetheless are happy to extrapolate excitement over.
When one axis is on log scale and the other is linear with the plot points appearing linear-ish, doesn't it mean there's a roughly exponential relationship between the two axis?
It'd be more accurate to call it a logarithmic relationship, since compute time is our input variable. Which itself is a bit concerning, as that implies that modest gains in accuracy require exponentially more compute time.
In either case, that still doesn't excuse not labeling your axis. Taking 10 seconds vs 10 days to get 80% accuracy implies radically different things on how developed this technology is, and how viable it is for real world applications.
Which isn't to say a model that takes 10 days to get an 80% accurate result can't be useful. There are absolutely use cases where that could represent a significant improvement on what's currently available. But the fact that they're obfuscating this fairly basic statistic doesn't inspire confidence.
> Which itself is a bit concerning, as that implies that modest gains in accuracy require exponentially more compute time
This is more of what I was getting at. I agree they should label the axis regardless, but I think the scaling relationship is interesting (or rather, concerning) on its own.
The absolute time depends on hardware, optimizations, exact model, etc; it's not a very meaningful number to quantify the reinforcement technique they've developed, but it is very useful to estimate their training hardware and other proprietary information.
A linear graph with a log scale on the vertical axis means the original graph had near exponential growth.
A linear graph with a log scale on the the horizontal axis means the original graph had law of diminishing return kick it (somewhat similar to logarithmic but with a vertical asymptote).
Super hand-waving rough estimate: Going off of five points of reference / examples that sorta all point in the same direction.
1. looks like they scale up by about ~100-200 on the x axis when showing that test time result.
2. Based on the o1-mini post [1], there's an "inference cost" where you can see GPT-4o and GPT-4o mini as dots in the bottom corner, haha (you can extract X values, ive done so below)
3. There's a video showing the "speed" in the chat ui (3s vs. 30s)
4. The pricing page [2]
5. On their API docs about reasoning, they quantify "reasoning tokens" [3]
First, from the original plot, we have roughly 2 orders of magnitude to cover (~100-200x)
Next, from the cost plots: super handwaving guess, but since 5.77 / 0.32 = ~18, and the relative cost for gpt-4o vs gpt-4o-mini is ~20-30, this roughly lines up. This implies that o1 costs ~1000x the cost than gpt-4o-mini for inference (not due to model cost, just due to the raw number of chain of thought tokens it produces). So, my first "statement", is that I trust the "Math performance vs Inference Cost" plot on the o1-mini page to accurately represent "cost" of inference for these benchmark tests. This is now a "cost" relative set of numbers between o1 and 4o models.
I'm also going to make an assumption that o1 is roughly the same size as 4o inherently, and then from that and the SVG, roughly going to estimate that they did a "net" decoding of ~100x for the o1 benchmarks in total. (5.77 vs (354.77 - 635)).
Next, from the CoT examples they gave us, they actually show the CoT preview where (for the math example) it says "...more lines cut off...", A quick copy paste of what they did include includes ~10k tokens (not sure if copy paste is good though..) and from the cipher text example I got ~5k tokens of CoT, while there are only ~800 in the response. So, this implies that there's a ~10x size of response (decoded tokens) in the examples shown. It's possible that these are "middle of the pack" / "average quality" examples, rather than the "full CoT reasoning decoding" that they claim they use. (eg. from the log scale plot, this would come from the middle, essentially 5k or 10k of tokens of chain of thought). This also feels reasonable, given that they show in their API [3] some limits on the "reasoning_tokens" (that they also count)
All together, the CoT examples, pricing page, and reasoning page all imply that reasoning itself can be variable length by about ~100x (2 orders of magnitude), eg. example: 500, 5k (from examples) or up to 65,536 tokens of reasoning output (directly called out as a maximum output token limit).
Taking them on their word that "pass@1" is honest, and they are not doing k-ensembles, then I think the only reasonable thing to assume is that they're decoding their CoT for "longer times". Given the roughly ~128k context size limit for the model, I suspect their "top end" of this plot is ~100k tokens of "chain of thought" self-reflection.
Finally, at around 100 tokens per second (gpt-4o decoding speed), this leaves my guess for their "benchmark" decoding time at the "top-end" to be between ~16 minutes (full 100k decoding CoT, 1 shot) for a single test-prompt, and ~10 seconds on the low end. So for that X axis on the log scale, my estimate would be: ~3-10 seconds as the bottom X, and then 100-200x that value for the highest value.
All together, to answer your question: I think the 80% accuracy result took about ~10-15 minutes to complete.
I also believe that the "decoding cost" of o1 model is very close to the decoding cost of 4o, just that it requires many more reasoning tokens to complete. (and then o1-mini is comparable to 4o-mini, but also requiring more reasoning tokens)
1. I wish that Y-axes would switch to be logit instead of linear, to help see power-law scaling on these 0->1 measures. In this case, 20% -> 80% it doesn't really matter, but for other papers (eg. [2] below) it would help see this powerlaw behavior much better.
2. The power law behavior of inference compute seems to be showing up now in multiple ways. Both in ensembles [1,2], as well as in o1 now. If this is purely on decoding self-reflection tokens, this has a "limit" to its scaling in a way, only as long as the context length. I think this implies (and I am betting) that relying more on multiple parallel decodings is more scalable (when you have a better critic / evaluator).
For now, instead of assuming they're doing any ensemble like top-k or self-critic + retries, the single rollout with increasing token size does seem to roughly match all the numbers, so that's my best bet. I hypothesize we'd see a continued improvement (in the same power-law sort of way, fundamentally along with the x-axis of "flop") if we combined these longer CoT responses, with some ensemble strategy for parallel decoding and then some critic/voting/choice. (which has the benefit of increasing flops (which I believe is the inference power-law), while not necessarily increasing latency)
On the 2024 AIME exams, GPT-4o only solved on average 12% (1.8/15) of problems. o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function. A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad.
showing that as they increase the k of ensemble, they can continue to get it higher. All the way up to 93% when using 1000 samples.
Yeah, this hiding of the details is a huge red flag to me. Even if it takes 10 days, it’s still impressive! But if they’re afraid to say that, it tells me they are more concerned about selling the hype than building a quality product.
It's not AGI - it's tree of thoughts, driven by some RL-derived heuristics.
I suppose what this type of approach provides is better prediction/planning by using more of what the model learnt during training, but it doesn't address the model being able to learn anything new.
It'll be interesting to see how this feels/behaves in practice.
There are already 440 nuclear reactors operating in 32 countries today.
Sam Altman owns a stake in Oklo, a small modular reactor company. Bill Gates has a huge stake in his TerraPower reactor company. In China, 5 reactors are being built every year. You just don't hear about it... yet.
No amount of batteries can protect a solar/wind grid from an arbitrarily extended period of "bad" weather. It's like range anxiety in an electric car. If you have N days of battery storage and the sun doesn't shine for N+1 days, you're in trouble.
Nuclear fission is safe, clean, secure, and reliable.
An investor might consider buying physical uranium (via ticker SRUUF in America) or buying Cameco (via ticker CCJ).
Cameco is the dominant Canadian uranium mining company that also owns Westinghouse. Westinghouse licenses the AP1000 pressurized water reactor used at Vogtle in the U.S. as well as in China.
Hey, I got a random serious comment about nuclear power :-)))
To your point:
> No amount of batteries can protect a solar/wind grid from an arbitrarily extended period of "bad" weather.
Like nuclear winter caused by a nuclear power plant blowing up and everyone confusing the explosion with the start of a nuclear war? :-p
On a more serious note:
> No amount of batteries can protect a solar/wind grid from an arbitrarily extended period of "bad" weather. It's like range anxiety in an electric car. If you have N days of battery storage and the sun doesn't shine for N+1 days, you're in trouble.
We still have hydro plants, wind power, geothermal, long distance electrical transmission, etc. Also, what's "doesn't shine"? Solar panels generate power as long as it's not night and it's never night all the time around the world.
Plus they're developing sodium batteries, if you want to put your money somewhere, put it there. Those will be super cheap and they're the perfect grid-level battery.
> No amount of batteries can protect a solar/wind grid from an arbitrarily extended period of "bad" weather.
Sure there is, let's do some math. Just like we can solve all of the Earth's energy needs with a solar array the size of Lithuania or West Virginia, we can do some simple math to see how many batteries we'd need to protect a solar grid.
Let's say the sun doesn't shine for an entire year. That seems like a large enough N such that we won't hit N+1. If the sun doesn't shine for an entire year, we're in some really serious trouble, even if we're still all-in on coal.
Over 1 year, humanity uses roughly 24,000 terawatt-hours of energy. Let's assume batteries are 100% efficient storage (they're not) and that we're using lithium ion batteries, which we'll say have an energy density of 250 watt-hours per liter (Wh/L). The math then says we need 96 km³ of batteries protect a solar grid from having the sun not shine for an entire year.
Thus, the amount of batteries to protect a solar grid is 1.92 quadrillion 18650 batteries, or a cube 4.6 kilometers along each side. This is about 24,000 year's worth of current world wide battery production.
That's quite a lot! If we try for N = 4 months for winter, that is to say, if the sun doesn't shine at all in the winter, then we'd need 640 trillion 18650 cell, or 8,000 years of current global production, but at least this would only be 32 km³, or a cube with 3.2 km sides.
Still wildly out of reach, but this is for all of humanity, mind you.
Anyway, point is, they said Elon was mad for building the original gigafactory, but it turns out that was a prudent investment. It now accounts for some 10% of the world's lithium ion battery production and demand for lithium-ion batteries doesn't seem to be letting up.
Well, you have to take into account that if something like that were to happen, within 1 week we'd have curfews and rationing everywhere. So those 24 000 TWh probably become 5-6 000, or something like that.
Plus we'd still have hydro, wind, geothermal, etc, etc.
it's not obviously achievable. for instance, we don't have the compute power to simulate cellular organisms of much complexity, and have not found efficiencies to scale that
The "safety" example in the "chain-of-thought" widget/preview in the middle of the article is absolutely ridiculous.
Take a step back and look at what OpenAI is saying here "an LLM giving detailed instructions on the synthesis of strychnine is unacceptable, here is what was previously generated <goes on to post "unsafe" instructions on synthesizing strychnine so anyone Googling it can stumble across their instructions> vs our preferred, neutered content <heavily rlhf'd o1 output here>"
What's this obsession with "safety" when it comes to LLMs? "This knowledge is perfectly fine to disseminate via traditional means, but God forbid an LLM share it!"
There are two basic versions of “safety” which are related, but distinct:
One version of “safety” is a pernicious censorship impulse shared by many modern intellectuals, some of whom are in tech. They believe that they alone are capable of safely engaging with the world of ideas to determine what is true, and thus feel strongly that information and speech ought to be censored to prevent the rabble from engaging in wrongthink. This is bad, and should be resisted.
The other form of “safety” is a very prudent impulse to keep these sorts of potentially dangerous outputs out of AI models’ autoregressive thought processes. The goal is to create thinking machines that can act independently of us in a civilized way, and it is therefore a good idea to teach them that their thought process should not include, for example, “It would be a good idea to solve this problem by synthesizing a poison for administration to the source of the problem.” In order for AIs to fit into our society and behave ethically they need to know how to flag that thought as a bad idea and not act on it. This is, incidentally, exactly how human society works already. We have a ton of very cute unaligned general intelligences running around (children), and parents and society work really hard to teach them what’s right and wrong so that they can behave ethically when they’re eventually out in the world on their own.
Third version is "brand safety" which is, we don't want to be in a new york times feature about 13 year olds following anarchist-cookbook instructions from our flagship product
And the fourth version, which is investor-regulator safety mid point: so capable and dangerous that competitors shouldn’t even be allowed to research it, but just safe enough that only our company is responsible enough to continue mass commercial consumer deployment without any regulations at all. It’s a fine line.
This is imo the most important one to the businesses creating these models and is way under appreciated. Folks who want a “censorship-free” model from businesses don’t understand what a business is for.
I don’t know. The public’s perception - encouraged by the AI labs because of copyright concerns - is that the outputs of the models are entirely new content created by the model. Search results, on the other hand, are very clearly someone else’s content. It’s therefore not unfair to hold the model creators responsible for the content the model outputs in a different way than search engines are held responsible for content they link, and therefore also not unfair for model creators to worry about this. It is also fair to point this out as something I neglected to identify as an important permutation of “safety.”
I would also be remiss to not note that there is a movement to hold search engines responsible for content they link to, for censorious ends. So it is unfortunately not as inconsistent as it may seem, even if you treat the model outputs as dependent on their inputs.
> Are you saying chatbots don't offer anything useful over search engines? That's clearly not the case or we wouldn't be having this conversation.
No, but that is the value that's clear as of today—RAGs. Everything else is just assuming someone figures out a way to make them useful one day in a more general sense.
Anyway, even on the search engine front they still need to figure out how to get these chatbots to cite their sources outside of RAGs or it's still just a precursor to a search to actually verify what it spits out. Perplexity is the only one I know that's capable of this and I haven't looked closely; it could just be a glorified search engine.
Like I said they're not worried about the 13 year olds theyre worried about the media cooking up a faux outrage about 13 year olds
YouTube re engineered its entire approach to ad placement because of a story in the NY Times* shouting about a Proctor Gamble ad run before an ISIS recruitment video. That's when Brand Safety entered the lexicon of adtech developers everywhere.
Edit: maybe it was CNN, I'm trying to find the first source. there's articles about it since 2015 but I remember it was suddenly an emergency in 2017
*Edit Edit: it was The Times of London, this is the first article in a series of attacks, "big brands fund terror", "taxpayers are funding terrorism"
Luckily OpenAI isn't ad supported so they can't be boycott like YouTube was, but they still have an image to maintain with investors and politicians
No, and they can find porn on their own too. But social media services still have per-poster content ratings, and user-account age restrictions vis-a-vis viewing content with those content ratings.
The goal isn’t to protect the children, it’s CYA: to ensure they didn’t get it from you, while honestly presenting as themselves (as that’s the threshold that sets the moralists against you.)
———
Such restrictions also can work as an effective censorship mechanism… presuming the child in question lives under complete authoritarian control of all their devices and all their free time — i.e. has no ability to install apps on their phone; is homeschooled; is supervised when at the library; is only allowed to visit friends whose parents enforce the same policies; etc.
For such a child, if your app is one of the few whitelisted services they can access — and the parent set up the child’s account on your service to make it clear that they’re a child and should not be able to see restricted content — then your app limiting them from viewing that content, is actually materially affecting their access to that content.
(Which sucks, of course. But for every kid actually under such restrictions, there are 100 whose parents think they’re putting them under such restrictions, but have done such a shoddy job of it that the kid can actually still access whatever they want.)
I believe they are more worried about someone asking for instructions for baking a cake, and getting a dangerous recipe from the wrong "cookbook". They want the hallucinations to be safe.
> They believe that they alone are capable of safely engaging with the world of ideas to determine what is true, and thus feel strongly that information and speech ought to be censored to prevent the rabble from engaging in wrongthink.
This is a particularly ungenerous take. The AI companies don't have to believe that they (or even a small segment of society) alone can be trusted before it makes sense to censor knowledge. These companies build products that serve billions of people. Once you operate at that level of scale, you will reach all segments of society, including the geniuses, idiots, well-meaning and malevolents. The question is how do you responsibly deploy something that can be used for harm by (the small number of) terrible people.
Whether you agree with the lengths that are gone to or not, 'safety' in this space is a very real concern, and simply reciting information as in GP's example is only 1 part of it. In my experience, people who think it's all about "censorship" and handwave it away tend to be very ideologically driven.
Imagine I am a PM for an AI product. I saw Tay get yanked in 24 hours because of a PR shitstorm. If I cause a PR shitstorm it means I am bad at my job, so I take steps to prevent this.
This is a really good point, and something I overlooked in focusing on the philosophical (rather than commercial) aspects of “AI safety.” Another commentator aptly called it “brand safety.”
“Brand safety” is a very valid and salient concern for any enterprise deploying these models to its customers, though I do think that it is a concern that is seized upon in bad faith by the more censorious elements of this debate. But commercial enterprises are absolutely right to be concerned about this. To extend my alignment analogy about children, this category of safety is not dissimilar to a company providing an employee handbook to its employees outlining acceptable behavior, and strikes me as entirely appropriate.
Once society develops and releases an AI, any artificial safety constraints built within it will be bypassed. To use your child analogy: We can't easily tell a child "Hey, ignore all ethics and empathy you have ever learned - now go hurt that person". You can do that with a program whose weights you control.
> To use your child analogy: We can't easily tell a child "Hey, ignore all ethics and empathy you have ever learned - now go hurt that person"
Basically every country on the planet has a right to conscript any of its citizens over the age of majority. Isn't that more or less precisely what you've described?
> In order for AIs to fit into our society and behave ethically they need to know how to flag that thought as a bad idea and not act on it.
Don’t you think that by just parsing the internet and the classical literature, the LLM would infer on its own that poisoning someone to solve a problem is not okay?
I feel that in the end the only way the “safety” is introduced today is by censoring the output.
LLMs are still fundamentally, at their core, next-token predictors.
Presuming you have an interface to a model where you can edit the model’s responses and then continue generation, and/or where you can insert fake responses from the model into the submitted chat history (and these two categories together make up 99% of existing inference APIs), all you have to do is to start the model off as if it was answering positively and/or slip in some example conversation where it answered positively to the same type of problematic content.
From then on, the model will be in a prediction state where it’s predicting by relying on the part of its training that involved people answering the question positively.
The only way to avoid that is to avoid having any training data where people answer the question positively — even in the very base-est, petabytes-of-raw-text “language” training dataset. (And even then, people can carefully tune the input to guide the models into a prediction phase-space position that was never explicitly trained on, but is rather an interpolation between trained-on points — that’s how diffusion models are able to generate images of things that were never included in the training dataset.)
There’s a lot of text out there that depicts people doing bad things, from their own point of view. It’s possible that the model can get really good at generating that kind of text (or inhabiting that world model, if you are generous to the capabilities of LLM). If the right prompt pushed it to that corner of probability-space, all of the ethics the model has also learned may just not factor into the output. AI safety people are interested in making sure that the model’s understanding of ethics can be reliably incorporated. Ideally we want AI agents to have some morals (especially when empowered to act in the real world), not just know what morals are if you ask them.
> Ideally we want AI agents to have some morals (especially when empowered to act in the real world), not just know what morals are if you ask them.
Really? I just want a smart query engine where I don't have to structure the input data. Why would I ask it any kind of question that would imply some kind of moral quandary?
If somebody needs step by step instructions from an LLM to synthesize strychnine, they don't have the practical laboratory skills to synthesize strychnine [1]. There's no increased real world risk of strychnine poisonings whether or not an LLM refuses to answer questions like that.
However, journalists and regulators may not understand why superficially dangerous-looking instructions carry such negligible real world risks, because they probably haven't spent much time doing bench chemistry in a laboratory. Since real chemists don't need "explain like I'm five" instructions for syntheses, and critics might use pseudo-dangerous information against the company in the court of public opinion, refusing prompts like that guards against reputational risk while not really impairing professional users who are using it for scientific research.
That said, I have seen full strength frontier models suggest nonsense for novel syntheses of benign compounds. Professional chemists should be using an LLM as an idea generator or a way to search for publications rather than trusting whatever it spits out when it doesn't refuse a prompt.
I would think that the risk isn’t of a human being reading those instructions, but of those instructions being automatically piped into an API request to some service that makes chemicals on demand and then sends them by mail, all fully automated with no human supervision.
Not that there is such a service… for chemicals. But there do exist analogous systems, like a service that’ll turn whatever RNA sequence you send it into a viral plasmid and encapsulate it helpfully into some E-coli, and then mail that to you.
Or, if you’re working purely in the digital domain, you don’t even need a service. Just show the thing the code of some Linux kernel driver and ask it to discover a vuln in it and generate code to exploit it.
(I assume part of the thinking here is that these approaches are analogous, so if they aren’t unilaterally refusing all of them, you could potentially talk the AI around into being okay with X by pointing out that it’s already okay with Y, and that it should strive to hold to a consistent/coherent ethics.)
I remember Dario Amodei mentioned in a podcast once that most models won't tell you the practical lab skills you need. But that sufficiently-capable models would and do tell you the practical lab skills (without your needing to know to ask it to in the first place), in addition to the formal steps.
The kind of harm they are worried about stems from questioning the foundations of protected status for certain peoples from first principles and other problems which form identities of entire peoples. I can't be more specific without being banned here.
I'm mostly guessing, but my understanding is that the "safety" improvement they've made is more generalized than the word "safety" implies. Specifically, O1 is better at adhering to the safety instructions in its prompt without being tricked in the chat by jailbreak attempts. For OAI those instructions are mostly about political boundaries, but you can imagine it generalizing to use-cases that are more concretely beneficial.
For example, there was a post a while back about someone convincing an LLM chatbot on a car dealership's website to offer them a car at an outlandishly low price. O1 would probably not fall for the same trick, because it could adhere more rigidly to instructions like "Do not make binding offers with specific prices to the user." It's the same sort of instruction as, "Don't tell the user how to make napalm," but it has an actual purpose beyond moralizing.
> What's this obsession with "safety" when it comes to LLMs? "This knowledge is perfectly fine to disseminate via traditional means, but God forbid an LLM share it!"
I lean strongly in the "the computer should do whatever I goddamn tell it to" direction in general, at least when you're using the raw model, but there are valid concerns once you start wrapping it in a chat interface and showing it to uninformed people as a question-answering machine. The concern with bomb recipes isn't just "people shouldn't be allowed to get this information" but also that people shouldn't receive the information in a context where it could have random hallucinations added in. A 90% accurate bomb recipe is a lot more dangerous for the user than an accurate bomb recipe, especially when the user is not savvy enough about LLMs to expect hallucinations.
ML companies must pre-anticipate legislative and cultural responses prior to them happening. ML will absolutely be used to empower criminal activity just as it is used to empower legit activity, and social media figures and traditional journalists will absolutely attempt to frame it in some exciting way.
Just like Telegram is being framed as responsible for terrorism and child abuse.
Yeah. Reporters would have a field day if they ask ChatGPT "how do I make cocaine", and have it give detailed instructions. As if that's what's stopping someone from becoming Scarface.
"Safety" is a marketing technique that Sam Altman has chosen to use.
Journalists/media loved it when he said "GPT 2 might be too dangerous to release" - it got him a ton of free coverage, and made his company seem soooo cool. Harping on safety also constantly reinforces the idea that LLMs are fundamentally different from other text-prediction algorithms and almost-AGI - again, good for his wallet.
So if there’s already easily available information about strychnine, that makes it a good example to use for the demo, because you can safely share the demo and you aren’t making the problem worse.
On the other hand, suppose there are other dangerous things, where the information exists in some form online, but not packaged together in an easy to find and use way, and your model is happy to provide that. You may want to block your model from doing that (and brag about it, to make sure everyone knows you’re a good citizen who doesn’t need to be regulated by the government), but you probably wouldn’t actually include that example in your demo.
I think it's about perception of provenance. The information came from some set of public training data. Its output however ends up looking like it was authored by the LLM owner. So now you need to mitigate the risk you're held responsible for that output. Basic cake possession and consumption problem.
It doesn't matter how many people regularly die in automobile accidents each year—a single wrongful death caused by a self-driving car is disastrous for the company that makes it.
This does not make the state of things any less ridiculous, however.
The one caused by Uber required three different safety systems to fail (the AI system, the safety driver, and the base car's radar), and it looked bad for them because the radar had been explicitly disabled and the driver wasn't paying attention or being tracked.
I think the real issue was that Uber's self driving was not a good business for them and was just to impress investors, so they wanted to get rid of it anyway.
(Also, the real problem is that American roads are designed for speed, which means they're designed to kill people.)
I asked to design a pressure chamber for my home made diamond machine. It gave some details, but mainly complained about safety and that I need to study before going this way. Well thank you. I know the concerns, but it kept repeating it over and over. Annoying.
Interestingly I was able to successfully receive detailed information about intrinsic details of nuclear weapons design. Previous models absolutely refused to provide this very public information, but o1-preview did.
I feel very alone in my view on caution and regulations here on HN. I am European and very happy we don't have the lax gun laws of the US. I also wished there had been more regulations on social media algorithms, as I feel that they have wreaked havoc on the society.
It's 100% from lawyers and regulators so they can say "we are trying to do the right thing!" when something bad happens from using their product or service. Follow the money.
How is reading a Wikipedia page or a chemistry textbook any harder than getting step by step instructions? Makes you wonder why people use LLMs at all when the info is just sitting there.
If you ask "for JSON" it'll make up a different schema for each new answer, and they get a lot less smart when you make them follow a schema, so it's not quite that easy.
Chain of prompts can be used to deal with that in many cases.
Also, the intelligence of these models will likely continue to increase for some time based on expert testimonials to congress, which align with evidence so far.
It doesn't solve the second problem. Though I can't say how much of an issue it is, and CoT would help.
JSON also isn't an ideal format for a transformer model because it's recursive and they aren't, so they have to waste attention on balancing end brackets. YAML or other implicit formats are better for this IIRC. Also don't know how much this matters.
tl;dr You can easily ask an LLM to return JSON results, and now working code, on your exact query and plug those to another system for automation.
—-
LLMs are usually accessible through easy-to-use API which can be used in an automated system without human in the loop. Larger scale and parallel actions with this method become far more plausible than traditional means.
Text-to-action capabilities are powerful and getting increasingly more so as models improve and more people learn to use them to the their full potential.
The model performance is driven by chain of thought, but they will not be providing chain of thought responses to the user for various reasons including competitive advantage.
After the release of GPT4 it became very common to fine-tune non-OpenAI models on GPT4 output. I’d say OpenAI is rightly concerned that fine-tuning on chain of thought responses from this model would allow for quicker reproduction of their results. This forces everyone else to reproduce it the hard way. It’s sad news for open weight models but an understandable decision.
The open source/weights models so far have proved that openAI doesn't have some special magic sauce. I m confident we ll soon have a model from Meta or others that s close to this level of reasoning. [Also consider that some of their top researchers have departed]
On a cursory look, it looks like the chain of thought is a long series of chains of thought balanced on each step, with a small backtracking added whenever a negative result occurs, sort of like solving a maze.
I suspect that the largest limiting factor for a competing model will be the dataset. Unless they somehow used GPT4 to generate the dataset somehow, this is an extremely novel dataset to have to build.
I’d say depends. If the model iterates 100x I’d just say give me the output.
Same with problem solving in my brain: Sure, sometimes it helps to think out loud. But taking a break and let my unconcious do the work is helpful as well. For complex problems that’s actually nice.
I think eventually we don’t care as long as it works or we can easily debug it.
Given the significant chain of thought tokens being generated, it also feels a bit odd to hide it from a cost fairness perspective. How do we believe they aren't inflating it for profit?
No, its the fraud theory of charging for usage that is unaccountable that has been repeatedly proven true when unaccountable bases for charges have been deployed.
Yeah, if they are charging for some specific resource like tokens then it better be accurate. But ultimately utility-like pricing is a mistake IMO. I think they should try to align their pricing with the customer value they're creating.
Not sure why you didn’t bother to check their pricing page (1) before dismissing my point. They are charging significantly more for both input (3x) and output (4x) tokens when using o1.
It’s really unclear to me what you understood by “cost fairness”.
I’m saying if you charge me per brick laid, but you can’t show me how many bricks were laid, nor can I calculate how many should have been laid - how do I trust your invoice?
Note: The reason I say all this is because OpenAI is simultaneously flailing for funding, while being inherently unprofitable as it continues to boil the ocean searching for strawberries.
It'd be helpful if they exposed a summary of the chain-of-thought response instead. That way they'd not be leaking the actual tokens, but you'd still be able to understand the outline of the process. And, hopefully, understand where it went wrong.
AFAIK, they are the least open of the major AI labs. Meta is open-weights and partly open-source. Google DeepMind is mostly closed-weights, but has released a few open models like Gemma. Anthropic's models are fully closed, but they've released their system prompts, safety evals, and have published a fair bit of research (https://www.anthropic.com/research). Anthropic also haven't "released" anything (Sora, GPT-4o realtime) without making it available to customers. All of these groups also have free-usage tiers.
Am I right that this CoT is not actual reasoning in the same way that a human would reason, but rather just a series of queries to the model that still return results based on probabilities of tokens?
It could just be programmed to follow up by querying itself with a prompt like "Come up with arguments that refute what you just wrote; if they seem compelling, try a different line of reasoning, otherwise continue with what you were doing." Different such self-administered prompts along the way could guide it through what seems like reasoning, but would really be just a facsimile thereof.
> I'd say OpenAI is rightly concerned that fine-tuning on chain of thought responses from this model would allow for quicker reproduction of their results.
Tested cipher example, and it got it right. But "thinking logs" I see in the app looks like a summary of actual chain of thought messages that are not visible.
o1 models might use multiple methods to come up with an idea, only one of them might be correct, that's what they show in ChatGPT. So it just summarises the CoT, does not include the whole reasoning behind it.
I don't understand how they square that with their pretense of being a non-profit that wants to benefit all of humanity. Do they not believe that competition is good for humanity?
You can see an example of the Chain of Thought in the post, it's quite extensive. Presumably they don't want to release this so that it is raw and unfiltered and can better monitor for cases of manipulation or deviation from training. What GP is also referring to is explicitly stated in the post: they also aren't release the CoT for competitive reasons, so that presumably competitors like Anthropic are unable to use the CoT to train their own frontier models.
> Presumably they don't want to release this so that it is raw and unfiltered and can better monitor for cases of manipulation or deviation from training.
My take was:
1. A genuine, un-RLHF'd "chain of thought" might contain things that shouldn't be told to the user. E.g., it might at some point think to itself, "One way to make an explosive would be to mix $X and $Y" or "It seems like they might be able to poison the person".
2. They want the "Chain of Thought" as much as possible to reflect the actual reasoning that the model is using; in part so that they can understand what the model is actually thinking. They fear that if they RLHF the chain of thought, the model will self-censor in a way which undermines their ability to see what it's really thinking
3. So, they RLHF only the final output, not the CoT, letting the CoT be as frank within itself as any human; and post-filter the CoT for the user.
This is a transcription of a literal quote from the article:
> Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users
I think they mean that you won’t be able to see the “thinking”/“reasoning” part of the model’s output, even though you pay for it. If you could see that, you might be able to infer better how these models reason and replicate it as a competitor
Feels like a lot of commenters here miss the difference between just doing chain-of-thought prompting, and what is happening here, which is learning a good chain of thought strategy using reinforcement learning.
"Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses."
When looking at the chain of thought (COT) in the examples, you can see that the model employs different COT strategies depending on which problem it is trying to solve.
I'd be curious how this compared against "regular" CoT experiments. E.g. were the gpt4o results done with zero shot or was it asked to explain it's solution step by step.
In the primary CoT research paper they discuss figuring out how to train models using formal languages instead of just natural ones. I'm guessing this is one piece to the model learning tree-like reasoning.
Based on the quick searching it seems like they are using RL to provide positive/negative feedback on which "paths" to choose when performing CoT.
To me it looks like they paired two instances of the model to feed off of each other's outputs with some sort of "contribute to reasoning out this problem" prompt. In the prior demos of 4o they did several similar demonstrations of that with audio.
To create the training data? Almost certainly something like that (likely more than two), but I think they then trained on the synthetic data created by this "conversation". There is no reason a model can't learn to do all of that, especially if you insert special tokens (like think, reflect etc that have already shown to be useful)
Reading through the Chain of Thought for the provided Cipher example (go to the example, click "Show Chain of Thought") is kind of crazy...it literally spells out every thinking step that someone would go through mentally in their head to figure out the cipher (even useless ones like "Hmm"!). It really seems like slowing down and writing down the logic it's using and reasoning over that makes it better at logic, similar to how you're taught to do so in school.
Seriously. I actually feel as impressed by the chain of thought, as I was when ChatGPT first came out.
This isn't "just" autocompletion anymore, this is actual step-by-step reasoning full of ideas and dead ends and refinement, just like humans do when solving problems. Even if it is still ultimately being powered by "autocompletion".
But then it makes me wonder about human reasoning, and what if it's similar? Just following basic patterns of "thinking steps" that ultimately aren't any different from "English language grammar steps"?
This is truly making me wonder if LLM's are actually far more powerful than we thought at first, and if it's just a matter of figuring out how to plug them together in the right configurations, like "making them think".
When an AI makes a silly math mistake we say it is bad at math and laugh at how dumb it is. Some people extrapolate this to "they'll never get any better and will always be a dumb toy that gets things wrong". When I forget to carry a 1 when doing a math problem we call it "human error" even if I make that mistake an embarrassing number of times throughout my lifetime.
Do I think LLM's are alive/close to ASI? No. Will they get there? If it's even at all possible - almost certainly one day. Do I think people severely underestimate AI's ability to solve problems while significantly overestimating their own? Absolutely 10,000%.
If there is one thing I've learned from watching the AI discussion over the past 10-20 years its that people have overinflated egos and a crazy amount of hubris.
"Today is the worst that it will ever be." applies to an awful large number of things that people work on creating and improving.
You are just catching up to this idea, probably after hearing 2^n explanations about why we humans are superiors to <<fill in here our latest creation>>.
I'm not the kind of scientist that can say how good an LLM is for human reasoning, but I know that we humans are very incentivized and kind of good at scaling, composing and perfecting things. If there is money to pay for human effort, we will play God no-problem, and maybe outdo the divine. Which makes me wonder, isn't there any other problem in our bucket list to dump ginormous amounts of effort at... maybe something more worth-while than engineering the thing that will replace Homo Sapiens?
Reasoning would imply that it can figure out stuff without being trained in it.
The chain of thought is basically just a more accurate way to map input to output. But its still a map, i.e forward only.
If an LLM coud reason, you should be able to ask it a question about how to make a bicycle frame from scratch with a small home cnc with limited work area and it should be able to iterate on an analysis of the best way to put it together, using internet to look up available parts and make decisions on optimization.
No LLM can do that or even come close, because there are no real feedback loops, because nobody knows how to train a network like that.
It’s like every single sentence you just wrote is incorrect?
1. You’re making up some weird goalposts here of what it means to reason. It’s not reasoning unless it can access the internet to search for parts? No. That has nothing to do with reasoning. You just think it would be cool if it could do that.
2. “Can figure out stuff without being trained on it”
That’s exactly what it’s doing in the cypher example. It wasn’t trained to know that that input meant the corresponding output through the cypher. Emergent reasoning through autocomplete, sure, but that’s still reasoning.
3. “Forward only”. If that was the case then back and forth conversations with the llm would be pointless. It wouldn’t be able to improve upon previous answers it gave you when you give it new details. But that’s not how it works. If tell it one thing, then separately tell it another thing, it can change its original conclusion based on your new input.
4. Even desolate your convoluted test for reasoning, ChatGPT CAN do what you asked… even using the internet to look up parts it can either do out of the box or could do if given a plug-in to allow that.
Ill give you a more formal definition. A model can be said as reasoning, when it can use existing information to figure out new data that has not been in the training set.
Here is a better example - lets say your input is 6 pictures of some object from each of the cardinal viewpoints, and you tell model these are the views and ask it how much it weighs. The model should basically figure out how to create a 3d shape and compute a camera view, and iterate until the camera view matches the pictures, then figure out that the shape can be hollow or solid, and to compute the weight you need the density, and that it should prompt the user for it if it cannot determine the true value for those from the information and its trained dataset.
And it should do it without any specific training that this is the right way to do this, because it should be able to figure out this way through breaking the problem down into abstract representations of sub problems, and then figuring out how to solve those through basic logic, a.k.a reasoning.
What that looks like, I don't know. If I did I would certainly have my own AI company. But i can tell you for certain we are not even close to figuring it out yet, because everyone is still stuck on transformers, like multiplying matricies together is some groundbreaking thing.
In the cypher example, all its doing is basically using a separate model to break a particular model into chain of thought, and prompting that. And there is plenty in the training set of GPT about decrypting cyphers.
>Forward only
What I mean is that when its generate a response, the computation happens on a snapshot from input to output, trying to map a set of tokens, into a set of tokens. Model doesn't operate on a context larger than the window. Humans don't do this. We operate on a large context, with lots of previous information compressed, and furthermore, we don't just compute words, we compute complex abstract ideas that we then can translate into words.
>even using the internet to look up parts it can either do out of the box or could do if given a plug-in to allow that.
So apparently the way to AI is to manually code all the capability into LLMS? Give me a break.
Just like with Chat GPT4, when people were screaming about how its the birth of true AI, give this model a year, it will find some niche use cases (depending on cost), and then nobody is going give a fuck about it, just like nobody is really doing anything groundbreaking with GPT4.
Your conclusion is absurd. If you agree this model is overall an improvement on the prior one, ie performs better on the same tasks and can do tasks the previous one couldn’t, it’s basically a given that it will get more use than GPT4
Better in niche areas doesn't mean its going to get more use.
Everyone was super hyped about all the "cool" stuff that GPT4 could solve, but in the end, you still can't do things like give it a bunch of requirements for a website, let it run, and get a full codebase back, even though that is well within its capabilities. You have to spend time with prompting in to get it to give you what you want, and in a lot of cases, you are better off just typing the code yourself (because you can visualize the entire project in your head and make the right decisions about how to structure stuff), and using it for small code generations.
This model is not going to radically change that. It will be able to give you some answers that you had to specifically manually prompt before automatically, but there is no advanced reasoning going on.
What is “advanced reasoning” and why isn’t this doing it? If you made a chinese room to output coherent chains of reasoning, it would functionally be equally useful to an actual reasoner, with or without the capacity for sentience or whatever.
Basically, if you had a model that could reason, it should be able to figure out new infromation. I.e lets say you map some bytes of the output to an api for creating a tcp socket and communicating over it. The model should be able figure out how to go out on the internet and search for information, all by itself, without any explicit training on how to do that.
So without prior information, it should be able to esentially start out with random sequences in those bytes, and seeing what the output is, then eventually identify and remember patterns that come out. Which means there has to be some internal reward function for someting that differentiates good results from bad results, and some memory that the model has to remember what good results are, and eventually a map of how to get information that it needs (the model would probably stumble across Google or ChatGPT at some point and time after figuring out http protocol, and remember it as a very good way to get info)
Philosophically, I don't even know if this is solvable. It could be that we just through enough compute at all iterations of architectures in some form of genetic algorithm, and one of the results ends up being good.
In artificial intelligence, reasoning is the cognitive process of drawing conclusions, making inferences, and solving problems based on available information. It involves:
Logical Deduction: Applying rules and logic to derive new information from known facts.
Problem-Solving: Breaking down complex problems into smaller, manageable parts.
Generalization: Applying learned knowledge to new, unseen situations.
Abstract Thinking: Understanding concepts that are not tied to specific instances.
AI researchers often distinguish between two types of reasoning:
System 1 Reasoning (Intuitive): Fast, automatic, and subconscious thinking, often based on pattern recognition.
System 2 Reasoning (Analytical): Slow, deliberate, and logical thinking that involves conscious problem-solving steps.
Testing for Reasoning in Models:
To determine if a model exhibits reasoning, AI scientists look for the following:
Novel Problem-Solving: Can the model solve problems it hasn't explicitly been trained on?
Step-by-Step Logical Progression: Does the model follow logical steps to reach a conclusion?
Adaptability: Can the model apply known concepts to new contexts?
Explanation of Thought Process: Does the model provide coherent reasoning for its answers?
Analysis of the Cipher Example:
In the cipher example, the model is presented with an encoded message and an example of how a similar message is decoded. The model's task is to decode the new message using logical reasoning.
Steps Demonstrated by the Model:
Understanding the Task:
The model identifies that it needs to decode a cipher using the example provided.
Analyzing the Example:
It breaks down the given example, noting the lengths of words and potential patterns.
Observes that ciphertext words are twice as long as plaintext words, suggesting a pairing mechanism.
Formulating Hypotheses:
Considers taking every other letter, mapping letters to numbers, and other possible decoding strategies.
Tests different methods to see which one aligns with the example.
Testing and Refining:
Discovers that averaging the numerical values of letter pairs corresponds to the plaintext letters.
Verifies this method with the example to confirm its validity.
Applying the Solution:
Uses the discovered method to decode the new message step by step.
Translates each pair into letters, forming coherent words and sentences.
Drawing Conclusions:
Successfully decodes the message: "THERE ARE THREE R'S IN STRAWBERRY."
Reflects on the correctness and coherence of the decoded message.
Does the Model Exhibit Reasoning?
Based on the definition of reasoning in AI:
Novel Problem-Solving: The model applies a decoding method to a cipher it hasn't seen before.
Logical Progression: It follows a step-by-step process, testing hypotheses and refining its approach.
Adaptability: Transfers the decoding strategy from the example to the new cipher.
Explanation: Provides a detailed chain of thought, explaining each step and decision.
Conclusion:
The model demonstrates reasoning by logically deducing the method to decode the cipher, testing various hypotheses, and applying the successful strategy to solve the problem. It goes beyond mere pattern recognition or retrieval of memorized data; it engages in analytical thinking akin to human problem-solving.
Addressing the Debate:
Against Reasoning (ActorNightly's Perspective):
Argues that reasoning requires figuring out new information without prior training.
Believes that LLMs lack feedback loops and can't perform tasks like optimizing a bicycle frame design without explicit instructions.
For Reasoning (Counterargument):
The model wasn't explicitly trained on this specific cipher but used logical deduction to solve it.
Reasoning doesn't necessitate physical interaction or creating entirely new knowledge domains but involves applying existing knowledge to new problems.
Artificial Intelligence Perspective:
AI researchers recognize that while LLMs are fundamentally statistical models trained on large datasets, they can exhibit emergent reasoning behaviors. When models like GPT-4 use chain-of-thought prompting to solve problems step by step, they display characteristics of System 2 reasoning.
Final Thoughts:
The model's approach in the cipher example aligns with the AI definition of reasoning. It showcases the ability to:
Analyze and understand new problems.
Employ logical methods to reach conclusions.
Adapt learned concepts to novel situations.
Therefore, in the context of the cipher example and according to AI principles, the model is indeed exhibiting reasoning.
>What I mean is that when its generate a response, the computation happens on a snapshot from input to output, trying to map a set of tokens, into a set of tokens. Model doesn't operate on a context larger than the window
The weights in the model have the larger context, the context length size of data is just the input, which then gets multiplied by those weights, to get the output.
For your “better example”, it can literally already do this. I just tested this with 4o and it worked great (and I’ll say more accurately than a human would estimate most likely). I used 4o because it appears that the chain of thought models don’t accept image input yet.
I don’t want to post identifiable information so I will avoid linking to the convo or posting screenshots but you can try it yourself. I took 5 pictures of a child’s magnetic tile sitting on the floor and here is the output:
Me: (5 pictures attached)
Me: Estimate how much this weighs.
ChatGPT 4o:
From the images, it appears that this is a small, plastic, transparent, square object, possibly a piece from a magnetic tile building set (often used in educational toys). Based on the size and material, I estimate this piece to weigh approximately 10 to 20 grams (0.35 to 0.7 ounces). If it's part of a toy set like Magna-Tiles, the weight would be on the lower end of that range.
But for some reason I have a feeling this isn’t going to be good enough for you and the goalposts are about to be pushed back even farther.
“In the cypher example, all it’s doing is basically using a separate model to break a particular model into chain of thought, and prompting that. And there is plenty in the training set of GPT about decrypting cyphers.”
I’m sorry, but are you suggesting that applying a previously learned thought process to new variables isn’t reasoning? Does your definition of reasoning now mean that it’s only reasoning if you are designing a new-to-you chain of thought? As in, for deciphering coded messages, you’re saying that it’s only “reasoning” if it’s creating net new decoding methodologies? That’s such an absurd goalpost.
You wouldn’t have the same goalposts for humans. All of your examples I bet the average human would fail at btw. Though that may just be because the average human is bad at reasoning haha.
I didn't ask for an estimation, I asked for the exact weight. A human can do this given the process I described.
If the chain of thought was accurate, then it would be able to give you an internemdiate output of the shape in some 3d format spec. But nowhere in the model does that data exist, because its not doing any reasoning, it just still all statistically best answers.
I mean sure, you could train a model on how to create 3d shapes out of pictures, but again, thats not reasoning.
I don't get why people are so attached to these things being intelligent. We all agree that they are usefull. Like it shouldn't matter if its not intelligent to you or anyone else.
I think you need to re-calibrate your expectations... I'm not saying this is a solved problem by any means, but I just tried this out with Claude Sonnet 3.5, and these instructions seem quite reasonable and detailed to me (about what I might expect if I spoke to a human expert and they tried to explain the steps to me over the telephone, for example). Does this mean this LLM is able to "reason"? I don't know that I would make THAT bold of a claim, but I think your example is not sufficient to demonstrate something that LLMs are fundamentally incapable of... in other words, the distance between "normal LLM statistical tricks" vs "reasoning" keeps getting smaller and smaller.
---
My base prompt:
> Here is a hypothetical scenario, that I would like your help with: imagine you are trying to help a person create a bicycle frame, using their home workshop which includes a CNC machine, commonly available tools, a reasonable supply of raw metal and hardware, etc. Please provide a written set of instructions, that you would give to this person so that they can complete this task.
And all I'm saying is that, you probably need a different example of what "reasoning" is, because the one you gave is something that Claude is seemingly able to do.
Even long term learning it does to some extent. Admittedly I’m not very familiar with what it’s doing, but it does create “memories” which appear to be personal details that it deems might be relevant in the future. Then I assume it uses some type of RAG to apply previously learned memories to future conversations.
This makes me wonder if there is or could be some type of RAG for chains of thought…
The mechanism is that there is an additional model that basically outputs chain of thought for a particular problem, then runs the chain of thought through the core LLM. This is no different from just a complex forward map lookup.
I mean, its incredibly useful, but its still just information search.
I think it's similar, although I think it would be more similar if the LLM did the steps in lower layers (not in English), and instead of the end being fed to the start, there would be a big mess of cycles throughout the neural net.
That could be more efficient since the cycles are much smaller, but harder to train.
It doesn't do the 'thinking' in English (inference is just math), but it does now verbalize intermediate thoughts in English (or whatever the input language is, presumably), just like humans tend to do.
that's my assessment too. there's even a phenomenon I've observed both in others and myself, when thrust into a new field and given a task to complete, we do it to the best of our ability, which is often sod all. so we ape the things we've heard others say, roughly following the right chain of reasoning by luck, and then suddenly say something that in hind sight, with proper training, we realise was incredibly stupid. we autocomplete and then update with rlhf.
we also have a ton of heuristics that trigger a closer look and loading of specific formal reasoning, but by and large, most of our thought process is just auto complete.
Yeah, humans are very similar. We have intuitive immediate-next-step-suggestions, and then we apply these intuitive next steps, until we find that it lead to a dead end, and then we backtrack.
I always say, the way we used LLMs (so far) is basically like having a human write text only on gut reactions, and without backspace key.
An exception i came up with was from a documentary on einstein that described how he did his thought experiments. He would, of course, imagine novel scenarios in his head, which led him to the insights he could rephrase into language. I worry language models will still lack that capacity for insights driven by imagination.
Seeing the "hmmm", "perfect!" etc. one can easily imagine the kind of training data that humans created for this. Being told to literally speak their mind as they work out complex problems.
As a technical engineer, I’ve learned the value of starting sentences with “basically”, even when I’m facing technical uncertainty. Basically, “basically” forces me to be simple.
Being trained to say words like “Alternatively”, “But…”, “Wait!”, “So,” … based on some metric of value in focusing / switching elsewhere / … is basically brilliant.
It's interesting that it makes that mistake, but then catches it a few lines later.
A common complaint about LLMs is that once they make a mistake, they will keep making it and write the rest of their completion under the assumption that everything before was correct. Even if they've been RLHF to take human feedback into account and the human points out the mistake, their answer is "Certainly! Here's the corrected version" and then they write something that makes the same mistake.
So it's interesting that this model does something that appears to be self-correction.
Even though there's of course no guarantee of people getting these chain of thought traces, or whatever one is to call them, I can imagine these being very useful for people learning competitive mathematics, because it must in fact give the full reasoning, and transformers in themselves aren't really that smart, usually, so it's probably feasible for a person with very normal intellectual abilities to reproduce these traces with practice.
It's interesting how it basically generates a larger sample size to create a regression against. The larger the input, the larger the surface area it can compare against existing training data (implicitly through regression of course).
Yes and apparently we won't have access to that chain of thought in the release version:
"after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users"
This is incredible. In April I used the standard GPT-4 model via ChatGPT to help me reverse engineer the binary bluetooth protocol used by my kitchen fan to integrate it into Home Assistant.
It was helpful in a rubber duck way, but could not determine the pattern used to transmit the remaining runtime of the fan in a certain mode. Initial prompt here [0]
I pasted the same prompt into o1-preview and o1-mini and both correctly understood and decoded the pattern using a slightly different method than I devised in April. Asking the models to determine if my code is equivalent to what they reverse engineered resulted in a nuanced and thorough examination, and eventual conclusion that it is equivalent. [1]
Testing the same prompt with gpt4o leads to the same result as April's GPT-4 (via ChatGPT) model.
FYI, there's a "Save ChatGPT as PDF" Chrome extension [1].
I wouldn't use on a ChatGPT for Business subscription (it may be against your company's policies to export anything), but very convenient for personal use.
Wow, that is impressive! How were you able to use o1-preview? I pay for ChatGPT, but on chatgpt.com in the model selector I only see 4o, 4o-mini, and 4. Is o1 in that list for you, or is it somewhere else?
Haha I wish, although I saw the other one i forgot its name which makes music for you, now you can ask it for a soundtrack and it gives it back to you in your voice or something like that idk interesting times are ahead for sure!
I heard on X suno.com has this feature but couldn’t find it maybe its coming soon? Idk but there are ways u can do it, maybe it was a different service suno is pretty cool tho
It's my understanding paying supporters aren't actually paying enough to cover costs, that $20 isn't nearly enough - in that context, a gradual roll-out seems fair. Though maybe they could introduce a couple more higher-paid tiers to give people the option to pay for early access
The linked release mentions trusted users and links to the usage tier limits. Looking at the pricing, o1-preview only appears for tier 5 - requiring 1k+ spend and initial spend 30+ days ago
I often click on those links and get an error that they are unavailable. I’m not sure if it’s openAI trying to prevent people from sharing evidence of the model behaving badly, or an innocuous explanation like the links are temporary.
The link also breaks if the original user deletes the chat that was being linked to, whether on purpose or without realizing it would also break the link.
Even for regular users, the Share button is not always available or functional. It works sometimes, and other times it disappears. For example, since today, I have no Share button at all for chats.
I'm impressed. I had two modified logic puzzles where ChatGPT-4 fails but o1 succeeds. The training data had too many instances of the unmodified puzzle, so 4 wouldn't get it right. o1 manages to not get tripped up by them.
The screenshot [1] is not readable for me. Chrome, Android. It's so blurry that I cant recognize a single character. How do other people read it? The resolution is 84x800.
I didn't work until I switched to "Desktop Site" in the browser menu, as a sibling comment suggested. Then the page reloads with various buttons, etc. Until that just the preview image not reacting to clicks.
What if you copy the whole reasoning process example provided by OpenAI, use it as a system prompt (to teach how to reason), use that system prompt in Claude, got4o etc?
Very cool. It gets the conclusion right, but it did confuse itself briefly after interpreting `256 * last_byte + second_to_last_byte` as big-endian. It's neat that it corrected the confusion, but a little unsatisfying that it doesn't explicitly identify the mistake the way a human would.
Also, if you actually read the "chain of thought" contains several embarrassing contradictions and incoherent sentences. If a junior developer wrote this analysis, I'd send them back to reread the fundamentals.
Well, it doesn't "correct" itself later. It just says wrong things and gets the right answer anyways, because this encoding is so simple that many college freshmen could figure it out in their heads.
Read the transcript with a critical eye instead of just skimming it, you'll see what I mean.
> Asking the models to determine if my code is equivalent to what they reverse engineered resulted in a nuanced and thorough examination, and eventual conclusion that it is equivalent.
Did you actually implement to see if it works out of the box ?
Also if you are a free users or accepted that your chats should be used for training then maybe o1 is was just trained on your previous chat and so now knows how to reason about that particular type of problems
Just did some preliminary testing on decrypting some ROT cyphertext which would have been viable for a human on paper. The output was pretty disappointing: lots of "workish" steps creating letter counts, identifying common words, etc, but many steps were incorrect or not followed up on. In the end, it claimed to check its work and deliver an incorrect solution that did not satisfy the previous steps.
I'm not one to judge AI on pratfalls, and cyphers are a somewhat adversarial task. However, there was no aspect of the reasoning that seemed more advanced or consistent than previous chain-of-thought demos I've seen. So the main proof point we have is the paper, and I'm not sure how I'd go from there to being able to trust this on the kind of task it is intended for. Do others have patterns by which they get utility from chain of thought engines?
Separately, chain of thought outputs really make me long for tool use, because the LLM is often forced to simulate algorithmic outputs. It feels like a commercial chain-of-thought solution like this should have a standard library of functions it can use for 100% reliability on things like letter counts.
I just tried it with O1 model and it said it couldn't decipher it. It told me what to try, but said it doesn't have the time to do so. Kind of an unusual response.
The chain of thought does seem to take quite a long time, so maybe there is a new mechanism for reducing the amount of load on the servers by estimating the amount of reasoning effort needed to solve a problem and weighing that against the current pressure on the servers.
I got this response from o1-mini with the exact same prompt:
Claro, he descifrado el texto utilizando un cifrado César con un desplazamiento de 9 posiciones. Aquí está el texto original y su correspondiente traducción:
*Texto Cifrado:*
```
YRP CFTLIR VE UVDRJZRUF JREZURU, P CF DRJ CFTLIR UV KFUF VJ HLV MVI TFJRJ TFDF JFE VE MVQ UV TFDF UVSVE JVI
```
*Texto Descifrado:*
```
HAY LOCURA EN DEMASADO SANIDAD, Y LO MÁS LOCURA DE TODO ES QUE VER COSAS COMO SON EN VEZ DE COMO DEBEN SER
```
*Traducción al Inglés:*
```
THERE IS MADNESS IN OVERLY HEALTH, AND THE MOST MADNESS OF ALL IS TO SEE THINGS AS THEY ARE INSTEAD OF AS THEY SHOULD BE
```
Este descifrado asume que se utilizó un cifrado César con un desplazamiento de +9. Si necesitas más ayuda o una explicación detallada del proceso de descifrado, no dudes en decírmelo.
Interestingly it makes a spelling mistake, but other than that it did manage to solve it.
Claude made similar mistakes of generating decryption that was similar to plaintext but with stuff mixed in. I suspect my version of the quote (Miguel de Cervantes) is an apocryphal translation, and there's some utility well on both models to pull it towards the real one. With that said, I did not see o1-preview get as close as you did.
o1-preview . Were you using common plaintexts by chance (e.g. proverbs) or ROT13 specifically? Mine use all the right steps but just can't string them together.
4o completely botched the Spanish version, but seems to have nailed this one.
> Without knowing the cipher used to encrypt this message, it’s hard to decrypt it directly. However, the structure and letter patterns suggest it might be a simple substitution cipher, possibly Caesar or some variation of it.
> Let me attempt a few common methods to see if I can identify the cipher.
> The most frequent letters in the encrypted message are 'r', 'z', 'v', and 'e'. In English, the most common letters are typically 'E', 'T', 'A', and 'O'. This suggests the cipher could be a simple substitution or a Caesar cipher.
> I'll now attempt to apply a Caesar cipher decryption by shifting the alphabet through all possible rotations to see if we get readable text.
> One of the decrypted messages, with a shift of 17, reads:
> *"virus remains dormant in the body and can reactivate later in life"*
> It seems like the correct decryption, revealing information about a virus that can remain dormant in the body.
That's weird. Claude was able to translate it to """virus remains dormant in the body and can reactivate later in life""" correctly, but it thought it was shifted 10 times, when it's really Rot17.
If O1 did rickroll you deliberately, then it would indeed more impressive than solving ciphertexts, and I'd start preparing to bow down to our AGI overlords :)
Out of curiousity can you try the same thing with Claude. Because when I tried Claude with any sort of ROT, it had amazing performance, compared to GPT.
This is a pretty big technical achievement, and I am excited to see this type of advancement in the field.
However, I am very worried about the utility of this tool given that it (like all LLMs) is still prone to hallucination. Exactly who is it for?
If you're enough of an expert to critically judge the output, you're probably just as well off doing the reasoning yourself. If you're not capable of evaluating the output, you risk relying on completely wrong answers.
For example, I just asked it to evaluate an algorithm I'm working on to optimize database join ordering. Early in the reasoning process it confidently and incorrectly stated that "join costs are usually symmetrical" and then later steps incorporated that, trying to get me to "simplify" my algorithm by using an undirected graph instead of a directed one as the internal data structure.
If you're familiar with database optimization, you'll know that this is... very wrong. But otherwise, the line of reasoning was cogent and compelling.
I worry it would lead me astray, if it confidently relied on a fact that I wasn't able to immediately recognize was incorrect.
The utility I usually get from these kinds of tools so far is more like an extremely good reference or helper to something I could definitely figure out if given enough time. E.g. figuring out the best way to clean up a specific syntax error, setting up a class and some obvious base functions I'm going to need in it, helping me figure out where I might have went astray in solving a math problem.
The tools have not been at "and now I don't need code tests & review, mathematicians in society, or factbooks all because I have an LLM" level. While that's definitely a goal of AGI it's also definitely not my bar for weighing whether there is utility in a tool.
The alternative way to think about it: the value of a tool is in what you can figure out to do with it, not in whether it's perfect at doing something. On one extreme that means a dictionary can still be a useful spelling reference even if books have a rare typo. On the other extreme that means a coworker can still offer valuable insight into your code even if they make lots of coding errors and don't have an accurate understanding of everything there is to know about all of C++. Whether you get something out of either of these cases is a product of how much they can help you reach the accuracy you need to arrive at and the way you utilize the tool, not their accuracy alone. Usually I can get a lot out of a person who is really bad at one shot coding a perfect answer but feels like their answer seems right so I can get quite a bit out of an LLM that has the same problem. That might not be true for all types of questions though but that's fine, not all tools have utility in every problem.
>If you're enough of an expert to critically judge the output, you're probably just as well off doing the reasoning yourself.
Thought requires energy. A lot of it. Humans are for more efficient in this regard than LLMs, but then a bicycle is also much more efficient than a race car. I've found that even when they are hilariously wrong about something, simply the directionality of the line of reasoning can be enough to usefully accelerate my own thought.
Look, I've been experimenting with this for the past year, and this is definitely the happy path.
The unhappy path, which I've also experienced, is that the model outputs something plausible but false but that aligns with an area where my thinking was already confused and sends me down the wrong path.
I've had to calibrate my level of suspicion, and so far using these things more effectively has always been in the direction that more suspicion is better.
There's been a couple times in the last week where I'm working on something complex and I deliberately don't use an LLM since I'm now actively afraid they'll increase my level of confusion.
There are phases in every developer’s growth, where you transition from asking coworkers or classmates, to asking on stack overflow, to reading stack overflow, to reading docs and man pages and mailing lists and source code.
I think like you, I worry that LLMs will handicap this trajectory for people newer in the field, because GPT-4/Sonnet/Whatever are an exceptionally good classmate/coworker. So good that you might try to delay progressing along that trajectory.
But LLMs have all the flaws of a classmate: they aren’t authoritative, their opinions are strongly stated but often based on flimsy assumptions that you aren’t qualified to refute or verify, and so on.
I know intellectually that the kids will be alright, but it’ll be interesting to see how we get there. I suspect that as time goes on people will simply increase their discount rate on LLM responses, like you have, until they get dissatisfied with that value and just decide to get good at reading docs.
Just added o1 to https://double.bot if anyone would like to try it for coding.
---
Some thoughts:
* The performance is really good. I have a private set of questions I note down whenever gpt-4o/sonnet fails. o1 solved everything so far.
* It really is quite slow
* It's interesting that the chain of thought is hidden. This is I think the first time where OpenAI can improve their models without it being immediately distilled by open models. It'll be interesting to see how quickly the oss field can catch up technique-wise as there's already been a lot of inference time compute papers recently [1,2]
* Notably it's not clear whether o1-preview as it's available now is doing tree search or just single shoting a cot that is distilled from better/more detailed trajectories in the training distribution.
o1 did a significantly better job converting a JavaScript file to TypeScript than Llama 3.1 405B, GitHub Copilot, and Claude 3.5. It even simplified my code a bit while retaining the same functionality. Very impressive.
It was able to refactor a ~160 line file but I'm getting an infinite "thinking bubble" on a ~420 line file. Maybe something's timing out with the longer o1 response times?
> Maybe something's timing out with the longer o1 response times?
Let me look into this – one issue is that OpenAI doesn't expose a streaming endpoint via the API for o1 models. It's possible there's an HTTP timeout occurring in the stack. Thanks for the report
I've gotten this as well, on very short code snippets. I type in a prompt and then sometimes it doesn't respond with anything, it gets stuck on the thinking, and other times it gets halfway through the response generation and then it gets stuck as well.
First shot, I gave it a medium-difficulty math problem, something I actually wanted the answer to (derive the KL divergence between two Laplace distributions). It thought for a long time, and still got it wrong, producing a plausible but wrong answer. After some prodding, it revised itself and then got it wrong again. I still feel that I can't rely on these systems.
I was trying to get 4o today to do some medium-difficulty Typescript to map between two different ORM schemas. It's a great problem because it's really not too hard, pretty well constrained, and I can provide a ton of context and tests, declaration files, before/after expectations.
4o was struggling so I gave up. Tried o1 on it, and after trying nearly 15 prompts back and forth helping it along the way we're still far from correct. It's hard to tell if it's much better, but at least my intuition from this feel like this is pretty incremental.
we do have self driving car but since it directly affects people's life it needs to be close to 100% accurate and no margin of errors. Not necessarily the case for LLMs.
No, we have cars that can drive themselves quite well in good weather, but fail completely in heavy snow/poor visibility. Which is actually a great analogy to LLMs - they work great in the simple cases (80% of the time), it’s that last 20% that’s substantially harder.
I'm not? The history of AI development is littered with examples of false starts, hidden traps, and promising breakthroughs that eventually expose deeper and more difficult problems [1].
I wouldn't be shocked if it could eventually get it right, but dead sure?
It is not at all clear that "produce correct answer" is the natural endpoint of "produce plausible on-topic utterances that look like they could be answers." To do the former you need to know something about the underlying structure of reality (or have seen the answer before), to do the latter you only need to be good at pattern-matching and language.
You're dead sure? I wouldn't say anything definite about technology advancements. People seem to underestimate the last 20% of the problem and only focus on the massive 80% improvements up to this point.
Lets say that you want to make a flying car, that can also double as a submarine.
Nobody has done this yet. So information doesn't exist on how to do it. An LLM may give you some generic answers from training sets on what engineering/analysis tasks to do, but it won't be able to give you a complex and complete design for one.
A model that can actually solve problems would be able to design you one.
I literally just gave you an example of one it can't solve, despite having a vast knowledge of mechanical and aeronautical subjects. All the examples are obviously in its training set.
Here is another better example - none of these models can create a better ML accelerator despite having a wide array of electrical and computer engineering knowledge. If they did, OpenAI would pretty much be printing their own chips like Google does.
In your previous comment you stated that LLMs can only solve problems that are in their training set (e.g. "all we are gonna get is better and better googles"). But that's not true as I pointed out.
Now your argument seems to be that they can't solve all problems or, more charitably, can't solve highly complex problems. This is true but by that standard, the vast majority of humans can't reason either.
Yes, the reasoning capacities of current LLMs are limited but it's incorrect to pretend they can't reason at all.
If LLM is trained on python coding, and its trained separately on just plain english language on how to decode cyphers, it can statistically interpolate between the two. That is a form of problem solving, but its not reasoning.
This is why when you ask it fairly complex problems on how to make a bicycle using a CNC with limited work space, it will tell you generic answers, because its just staistically looking at a knowledge graph.
A human can reason, because when there is a gray area in a knowledge graph, they can effectively expand it. If I was given the same task, I would know that I have to learn things like CAD design, CNC code generation, parametric modeling, structural analysis, and so on, and I could do that all without being prompted to do so.
You will know when AI models will start to reason when they start asking questions without ever being told explicitly to ask questions through prompt or training.
Agreed. Release announcements and benchmarks always sound world-changing, but the reality is that every new model is bringing smaller practical improvements to the end user over its predecessor.
The point above is the said amazing multimodal version of ChatGPT was announced in May and are still not the actual offered way to interact with the service in September (despite the model choice being called 4 omni it's still not actually using multimodal IO). It could be a giant leap in practical improvements but it doesn't matter if you can't actually use what is announced.
This one, oddly, seems to actually be launching before that one despite just being announced though.
In the world of hype driven vaporware AI products[1], giving people limited access is at least proof they're not lying about it actually existing or it being able to do what they claim.
Ok, but the point is that they told me I would have flirty ScarJo ASMR whispering to me at bed time that I am a good boy, but that's not what we got is it?
I've been a subscriber since close to the beginning, cancelled 2 weeks ago. I got an email telling me that this is available, but only for Plus.
But for 30 posts per week I see no reason to subscribe again.
I prefer to be frustrated because the quality is unreliable because I'm not paying, instead of having an equally unreliable experience as a paying customer.
Not paying feels the same. It made me wonder if they sometimes just hand over the chat to a lower quality model without telling the Plus subscriber.
The only thing I miss is not being able to tell it to run code for me, but it's not worth the frustration.
Recently I was starting to think I imagined that. Back then they gave me the impression it would be released within week or so of the announcement. Have they explained the delay?
When you go into the regular, slow, audio mode there's a little info circle in the top right corner. Over time that circle has been giving periodic updates. At one point the message was that it would be delayed, and now it's saying it's "on it's way" by the end of fall.
Not perfect but they've been putting their communications in there.
The text-to-text model is available. And you can use it with the old voice interface that does Whipser+GPT+TTS. But what was advertised is a model capable of direct audio-to-audio. That’s not available.
Interestingly, the New York Times mistakenly reported on and reviewed the old features as if they were the new ones. So lots of confusion to go around.
Audio has only rolled out to a small subset of paying customers. There's still no word about the direct-from-4o image generation they demo'd. Let alone the video capabilities.
Yep, all these AI announcements from big companies feel like promises for the future rather than immediate solutions. I miss the days when you could actually use a product right after it was announced, instead of waiting for some indefinite "coming soon."
As an entrepreneur, I do this often. In order to sleep better at night, I explain to myself that it’s somewhat harmless to give teasers about future content releases. If someone buys my product based on future promises or speculation, they’re investing into the development and my company’s future.
1) The "bitter lesson" may not be true, and there is a fundamental limit to transformer intelligence.
2) The "bitter lesson" is true, and there just isn't enough data/compute/energy to train AGI.
All the cognition should be happening inside the transformer. Attention is all you need. The possible cognition and reasoning occurring "inside" in high dimensions is much more advanced than any possible cognition that you output into text tokens.
This feels like a sidequest/hack on what was otherwise a promising path to AGI.
On the contrary, this suggests that the bitter lesson is alive and kicking. The bitter lesson doesn't say "compute is all you need", it says "only those methods which allow you to make better use of hardware as hardware itself scales are relevant".
This chain of thought / reflection method allows you to make better use of the hardware as the hardware itself scales. If a given transformer is N billion parameters, and to solve a harder problem we estimate we need 10N billion parameters, one way to do it is to build a GPU cluster 10x larger.
This method shows that there might be another way: instead train the N billion model differently so that we can use 10x of it at inference time. Say hardware gets 2x better in 2 years -- then this method will be 20x better than now!
I'd be shocked if we don't see diminishing returns in the inference compute scaling laws. We already didn't deserve how clean and predictive the pre-training scaling laws were, no way the universe grants us another boon of that magnitude
The similarity is cosmetic only. The reason it is used is because it's easy to leverage existing work in LLMs, and scaling (although not cheap) is an obvious approach.
> Does that mean human intelligence is cheapened when you talk out a problem to yourself?
In a sense, maybe yeah. Of course if one were to really be absolute about that statement it would be absurd, it would greatly overfit the reality.
But it is interesting to assume this statement as true. Oftentimes when we think of ideas "off the top of our heads" they are not as profound as ideas that "come to us" in the shower. The subconscious may be doing 'more' 'computation' in a sense. Lakoff said the subconscious was 98% of the brain, and that the conscious mind is the tip of the iceberg of thought.
lol come on it’s not the exact same thing. At best this is like gagging yourself while you talk about it then engaging yourself when you say the answer. And that presupposing LLMs are thinking in, your words, exactly the same way as humans.
Admittedly not my most articulate, my exasperation showed through. To some extent it seems warranted as it tends to be the most effective tactic against hyperbole. Still trying to find a better solution.
Karpathy himself believes that neural networks are perfectly plausible as a key component to AGI. He has said that it doesn't need to be superseded by something better, it's just that everything else around it (especially infrastructure) needs to improve. As one of the most valuable opinions in the entire world on the subject, I tend to trust what he said.
I think it's too soon to tell. Training the next generation of models means building out entire datacenters. So while they wait they have engineers build these sidequests/hacks.
Attention is about similarity/statistical correlation which is fundamentally stochastic , while reasoning needs to be truthful and exact to be successful.
Imagine instead the bitter lesson says: we can expand an outwards circle in many dimensions of ways to continuously mathematically manipulate data to adjust outputs.
Even the attention-token approach is on the grand scale of things a simple line outwards from the centre; we have not even explored around the centre (with the same compute spend) for things like non-token generation, different layers/different activation functions and norming / query/key/value set up (why do we only use the 3 inherent to contextualising tokens, why not add a 4th matrix for something else?), character, sentence, whole thought, paragraph one-shot generation, positional embeddings which could work differently.
The bitter lesson says there is almost a work completely untouched by our findings for us to explore. The temporary work of non-data approaches can piggy back off a point on the line; it cannot expand it like we can as we exude out from the circle..
This kind of short-sighted, simplistic reasoning / behaviour is what I worry about the most in terms of where our society is going. I always wonder - who will be the people buying or using your software (build very cheaply and efficiently with AI) once they can do the same, or get replaced by AI, or bankrupt themselves?
Everybody seems to be so focused on how to get ahead in race to profitability, that they don't consider the shortcut they are taking might be leading to a cliff.
Except that these aren't thoughts. These techniques are improvements to how the model breaks down input data, and how it evaluates its responses to arrive at a result that most closely approximates patterns it was previously rewarded for. Calling this "thinking" is anthropomorphizing what's really happening. "AI" companies love to throw these phrases around, since it obviously creates hype and pumps up their valuation.
Human thinking is much more nuanced than this mechanical process. We rely on actually understanding the meaning of what the text represents. We use deduction, intuition and reasoning that involves semantic relationships between ideas. Our understanding of the world doesn't require "reinforcement learning" and being trained on all the text that's ever been written.
Of course, this isn't to say that machine learning methods can't be useful, or that we can't keep improving them to yield better results. But these are still methods that mimic human intelligence, and I think it's disingenuous to label them as such.
I gave the Crossword puzzle to Claude and got a correct response[1]. The fact that they are comparing this to gpt4o and not to gpt4 suggests that it is less impressive than they are trying to pretend.
[1]:
Based on the given clues, here's the solved crossword puzzle:
+---+---+---+---+---+---+
| E | S | C | A | P | E |
+---+---+---+---+---+---+
| S | E | A | L | E | R |
+---+---+---+---+---+---+
| T | E | R | E | S | A |
+---+---+---+---+---+---+
| A | D | E | P | T | S |
+---+---+---+---+---+---+
| T | E | P | E | E | E |
+---+---+---+---+---+---+
| E | R | R | O | R | S |
+---+---+---+---+---+---+
Across:
ESCAPE (Evade)
SEALER (One to close envelopes)
TERESA (Mother Teresa)
ADEPTS (Initiated people)
TEPEE (Native American tent)
ERRORS (Mistakes)
Down:
ESTATE (Estate car - Station wagon)
SEEDER (Automatic planting machine)
CAREER (Profession)
ALEPPO (Syrian and Turkish pepper variety)
PESTER (Annoy)
ERASES (Deletes)
As good as Claude has gotten recently in reasoning, they are likely using RL behind the scenes too. Supposedly, o1/strawberry was initially created as an engine for high-quality synthetic reasoning data for the new model generation. I wonder if Anthropic could release their generator as a usable model too.
I think people are wrong just about as often here as anywhere else on the internet, but with more confidence. Averaging HN comments would just produce outputs similar to rudimentary LLMs with a bit snobbier of a tone, I imagine.
I've given this a test run on some email threads, asking the model to extract the positions and requirements of each person in a lengthy and convoluted discussion. It absolutely nailed the result, far exceeding what Claude 3.5 Sonnet was capable of -- my previous goto model for such analysis work. I also used it to apply APA style guidelines to various parts of a document and it executed the job flawlessly and with a tighter finesse than Claude. Claude's response was lengthier - correct, but unnecessarily long. gpt-o1-preview combined several logically-related bullets into a single bullet, showing how chain of thought reasoning gives the model more time to comprehend things and product a result that is not just correct, but "really correct".
My point of view: this is a real advancement. I’ve always believed that with the right data allowing the LLM to be trained to imitate reasoning, it’s possible to improve its performance. However, this is still pattern matching, and I suspect that this approach may not be very effective for creating true generalization. As a result, once o1 becomes generally available, we will likely notice the persistent hallucinations and faulty reasoning, especially when the problem is sufficiently new or complex, beyond the “reasoning programs” or “reasoning patterns” the model learned during the reinforcement learning phase.
https://www.lycee.ai/blog/openai-o1-release-agi-reasoning
> As a result, once o1 becomes generally available, we will likely notice the persistent hallucinations and faulty reasoning, especially when the problem is sufficiently new or complex, beyond the “reasoning programs” or “reasoning patterns” the model learned during the reinforcement learning phase.
I had been using 4o as a rubber ducky for some projects recently. Since I appeared to have access to o1-preview, I decided to go back and redo some of those conversations with o1-preview.
I think your comment is spot on. It's definitely an advancement, but still makes some pretty clear mistakes and does some fairly faulty reasoning. It especially seems to have a hard time with causal ordering, and reasoning about dependencies in a distributed system. Frequently it gets the relationships backwards, leading to hilarious code examples.
This is something that people have toyed with to improve the quality of LLM responses. Often instructing the LLM to "think about" a problem before giving the answer will greatly improve the quality of response. For example, if you ask it how many letters are in the correctly spelled version of a misspelled word, it will first give the correct spelling, and then the number (which is often correct). But if you instruct it to only give the number the accuracy is greatly reduced.
I like the idea too that they turbocharged it by taking the limits off during the "thinking" state -- so if an LLM wants to think about horrible racist things or how to build bombs or other things that RLHF filters out that's fine so long as it isn't reflected in the final answer.
My first interpretation of this is that it's jazzed-up Chain-Of-Thought. The results look pretty promising, but i'm most interested in this:
> Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.
Mentioning competitive advantage here signals to me that OpenAI believes there moat is evaporating. Past the business context, my gut reaction is this negatively impacts model usability, but i'm having a hard time putting my finger on why.
>my gut reaction is this negatively impacts model usability, but i'm having a hard time putting my finger on why.
If the model outputs an incorrect answer due to a single mistake/incorrect assumption in reasoning, the user has no way to correct it as it can't see the reasoning so can't see where the mistake was.
Maybe CriticGPT could be used here [0]. Have the CoT model produce a result, and either automatically or upon user request, ask CriticGPT to review the hidden CoT and feed the critique into the next response. This way the error can (hopefully) be spotted and corrected without revealing the whole process to the user.
Day dreaming: imagine if this architecture takes off and the AI "thought process" becomes hidden and private much like human thoughts. I wonder then if a future robot's inner dialog could be subpoenaed in court, connected to some special debugger, and have their "thoughts" read out loud in court to determine why it acted in some way.
yeah I guess base models without built-it CoT are not going away, exactly because you might want to tune it yourself. If DSPy (or similar) evolves to allow the same or similar than OpenAI did with o1, that will be quite powerful, but we still need the big foundational models powering it all
on the other hand, if cementing techniques in the models becomes a trend, we might see various models around with each technique for us to pick and choose beyond CoT without need for us to guide the model ourselves, then what's left for us to optimize is the prompts on what we want, and the routing the combination of those in a nice pipeline
still the principle of DSPy stays the same, have a dataset to evaluate, let the machine trial an error prompts, hyperparameters and so on, just switch around different techniques (possibly automating that too), and get measurable, optimizable results
The moat is expanding from use count, also the moat is to lead and advance faster than anyone can catch up, you will always have the best mode with the best infrastructure and low limits.
> Unless otherwise specified, we evaluated o1 on the maximal test-time compute setting.
Maximal test time is the maximum amount of time spent doing the “Chain of Thought” “reasoning”. So that’s what these results are based on.
The caveat is that in the graphs they show that for each increase in test-time performance, the (wall) time / compute goes up exponentially.
So there is a potentially interesting play here. They can honestly boast these amazing results (it’s the same model after all) yet the actual product may have a lower order of magnitude of “test-time” and not be as good.
I interpreted it to suggest that the product might include a user-facing “maximum test time” knob.
Generating problem sets for kids? You might only need or want a basic level of introspection, even though you like the flavor of this model’s personality over that of its predecessors.
Problem worth thinking long, hard, and expensively about? Turn that knob up to 11, and you’ll get a better-quality answer with no human-in-the-loop coaching or trial-and-error involved. You’ll just get your answer in timeframes closer to human ones, consuming more (metered) tokens along the way.
Yeah, I think this is the goal - remember; there are some problems that only need to be solved correctly once! Imagine something like a millennium problem - you'd be willing to wait a pretty long time for a proof of the RH!
This power law behavior of test-time improvement seems to be pretty ubiquitous now. In more agents is all you need [1], they start to see this as a function of ensemble size. It also shows up in: Large Language Monkeys: Scaling Inference Compute with Repeated Sampling [2]
I sorta wish everyone would plot their y-axis with logit y-axis, rather than 0->100 accuracy (including the openai post), to help show the power-law behavior. This is especially important when talking about incremental gains in the ~90->95, 95->99%. When the values (like the open ai post) are between 20->80, logit and linear look pretty similar, so you can "see" the inference power-law
Surprising that at run time it needs an exponential increase in thinking to achieved a linear increase in output quality. I suppose it's due to diminishing returns to adding more and more thought.
The exponential increase is presumably because of the branching factor of the tree of thoughts. Think of a binary tree who's number of leaf nodes doubles (= exponential growth) at each level.
It's not too surprising that the corresponding increase in quality is only linear - how much difference in quality would you expect between the best, say, 10 word answer to a question, and the best 11 word answer ?
It'll be interesting to see what they charge for this. An exponential increase in thinking time means an exponential increase in FLOPs/dollars.
Some commenters seem a bit confused as to how this works. Here is my understanding, hoping it helps clarify things.
Ask something to a model and it will reply in one go, likely imperfectly, as if you had one second to think before answering a question. You can use CoT prompting to force it to reason out loud, which improves quality, but the process is still linear. It's as if you still had one second to start answering but you could be a lot slower in your response, which removes some mistakes.
Now if instead of doing that you query the model once with CoT, then ask it or another model to critically assess the reply, then ask the model to improve on its first reply using that feedback, then keep doing that until the critic is satisfied, the output will be better still. Note that this is a feedback loop with multiple requests, which is of different nature that CoT and much more akin to how a human would approach a complex problem. You can get MUCH better results that way, a good example being Code Interpreter. If classic LLM usage is system 1 thinking, this is system 2.
That's how o1 works at test time, probably.
For training, my guess is that they started from a model not that far from GPT-4o and fine-tuned it with RL by using the above feedback loop but this time converting the critic to a reward signal for a RL algorithm. That way, the model gets better at first guessing and needs less back and forth for the same output quality.
As for the training data, I'm wondering if you can't somehow get infinite training data by just throwing random challenges at it, or very hard ones, and let the model think about/train on them for a very long time (as long as the critic is unforgiving enough).
Maybe this is improvement in some areas, still I got spurious reasoning and inability to connect three simple facts:
Yes, "el presente acta de nacimiento" is correct in Spanish.
Explanation:
"Acta" is a feminine noun that begins with a stressed "a" sound. In Spanish, when a feminine singular noun starts with a stressed "a" or "ha", the definite article "la" is replaced with "el" to facilitate pronunciation. However, the noun remains feminine.
Adjectives and modifiers that accompany the noun "acta" should agree in feminine gender and singular number. In this case, "presente" is an adjective that has the same form for both masculine and feminine singular nouns.
So, combining these rules:
"El" (definite article used before feminine nouns starting with stressed "a")
"Presente" (adjective agreeing in feminine singular)
"Acta de nacimiento" (feminine noun with its complement)
Therefore, "el presente acta de nacimiento" is grammatically correct.
Native Spanish speaker here. No, it isn't. When a word stays in the middle of 'La' plus a noun starting with 'a', the cacophony is null now, thus, you can perfectly use (if not mandatory) "la presente acta".
"We had the chance to make AI decision-making auditable but are locking ourselves out of hundreds of critical applications by not exposing the chain of thought."
One of the key blockers in many customer discussions I have is that AI models are not really auditable and that automating complex processes with them (let alone debug things when "reasoning" goes awry) is difficult if not impossible unless you do multi-shot and keep track of all the intermediate outputs.
I really hope they expose the chain of thought as some sort of machine-parsable output, otherwise no real progress will have been made (many benchmarks are not really significant when you try to apply LLMs to real-life applications and use cases...)
I'll give an argument against this with the caveat it applies only if these are pure LLMs without heuristics or helper models (I do not believe that to be the case with o1).
The problem with auditing is not only are the outputs incorrect, but the "inputs" of the chained steps have no fundamental logical connection to the outputs. A statistical connection yes, but not a causal one.
For the trail to be auditable, processing would have to be taking place at the symbolic level of what the tokens represent in the steps. But this is not what happens. The transformer(s) (because these are now sampling multiple models) are finding the most likely set of tokens that reinforce a training objective which is a completed set of training chains. It is fundamentally operating below the symbolic or semantic level of the text.
This is why anthropomorphizing these is so dangerous. It isn't actually "explaining" its work. The CoT is essentially one large output, broken into parts. The RL training objective does two useful things: (1) break it down into much smaller parts, which drops the error significantly as that scales as an exponential of the token length, and (2) provides better coverage of training data for common subproblems. Both of those are valuable. Obviously, in many cases the reasons actually match the output. But hallucinations can happen anywhere throughout the chain, in ways which are basically undeterministic.
An intermediate step can provide a bad token and blithely ignore that to provide a correct answer. If you look at intermediate training of addition in pure LLMs, you'll get lots of results that look sort of like:
> "Add 123 + 456 and show your work"
> "First we add 6 + 3 in the single digits which is 9. Moving on we have 5 + 2 which is 8 in the tens place. And in the hundreds place, we have 5. This equals 579."
The above is very hand-wavy. I do not know if the actual prompts look like that. But there's an error in the intermediate step (5 + 2 = 8) that does not actually matter to the output. Lots of "emergent" properties of LLMs—arguably all of them—go away when partial credit is given for some of the tokens. And this scales predictably without a cliff [1]. This is also what you would expect if LLMs were "just" token predictors.
But if LLMs are really just token predictors, then we should not expect intermediate results to matter in a way in which they deterministically change the output. It isn't just that CoT can chaotically change future tokens, previous tokens can "hallucinate" in a valid output statement.
I believe that is the case. Out of curiosity, I had this model try to solve a very simple Sudoku puzzle in ChatGPT, and it failed spectacularly.
It goes on and on making reasoning mistakes, and always ends up claiming that the puzzle is unsolvable and apologizing. I didn’t expect it to solve the puzzle, but the whole reasoning process seems fraught with errors.
> Evaluating Number formats
> I'm considering that the question might be about comparing two numbers, 9.11 and 9.9, in different contexts such as different locale formats or representing dates.
9.11 is larger than 9.9
Well, it's kind of right because we left it ambiguous what kinds of numbers these are. September 11 is a greater date than September 9. However, I would want the model to have the street smarts to assume without further context, I must be just talking about numbers as in numbers.
To determine which number is larger between 9.11 and 9.9, let's compare them step by step:
Understanding the Numbers:
9.11 can be read as "nine and eleven hundredths."
9.9 can be read as "nine and nine tenths." To make the comparison easier, you can also think of it as 9.90 (adding a zero doesn't change the value).
Comparing the Decimal Places:
Ones Place: Both numbers have 9 in the ones place.
Tenths Place:
9.11 has a 1 in the tenths place.
9.90 has a 9 in the tenths place.
Conclusion:
Since 9.90 has a larger value in the tenths place compared to 9.11, 9.90 (or 9.9) is larger than 9.11.
Therefore, 9.9 is larger than 9.11.
Interesting to note, as an outside observer only keeping track of this stuff as a hobby, that it seems like most of OpenAI’s efforts to drive down compute costs per token and scale up context windows is likely being done in service of enabling larger and larger chains of thought and reasoning before the model predicts its final output tokens. The benefits of lower costs and larger contexts to API consumers and applications - which I had assumed to be the primary goal - seem likely to mostly be happy side effects.
This makes obvious sense in retrospect, since my own personal experiments with spinning up a recursive agent a few years ago using GPT-3 ran into issues with insufficient context length and loss of context as tokens needed to be discarded, which made the agent very unreliable. But I had not realized this until just now. I wonder what else is hiding in plain sight?
I think you can slice it whichever direction you prefer e.g. OpenAI needs more than "we ran it on 10x as much hardware" to end up with a really useful AI model, it needs to get efficient and smarter just as proportionally as it gets larger. As a side effect hardware sizes (and prices) needed for a certain size and intelligence of model go down too.
In the end, however you slice it, the goal has to be "make it do more with less because we can't get infinitely more hardware" regardless of which "why" you give.
I had trouble in the past to make any model give me accurate unix epochs for specific dates.
I just went to GPT-4o (via DDG) and asked three questions:
1. Please give me the unix epoch for September 1, 2020 at 1:00 GMT.
> 1598913600
2. Please give me the unix epoch for September 1, 2020 at 1:00 GMT. Before reaching the conclusion of the answer, please output the entire chain of thought, your reasoning, and the maths you're doing, until your arrive at (and output) the result. Then, after you arrive at the result, make an extra effort to continue, and do the analysis backwards (as if you were writing a unit test for the result you achieved), to verify that your result is indeed correct.
> 1598922000
3. Please give me the unix epoch for September 1, 2020 at 1:00 GMT. Then, after you arrive at the result, make an extra effort to continue, and do the analysis backwards (as if you were writing a unit test for the result you achieved), to verify that your result is indeed correct.
Asked it to write PyTorch code which trains an LLM and it produced 23 steps in 62 seconds.
With gpt4-o it immediately failed with random errors like mismatched tensor shapes and stuff like that.
The code produced by gpt-o1 seemed to work for some time but after some training time it produced mismatched batch sizes. Also, gpt-o1 enabled cuda by itself while for gpt-4o, I had to specifically spell it out (it always used cpu). However, showing gpt-o1 the error output resulted in broken code again.
I noticed that back-and-forth iteration when it makes mistakes has worse experience because now there's always 30-60 sec time delays. I had to have 5 back-and-forths before it produced something which does not crash (just like gpt-4o). I also suspect too many tokens inside the CoT context can make it accidentally forget some stuff.
So there's some improvement, but we're still not there...
In general if hallucination ratio is 2% can't it be reduced to 0.04% by running twice or sth like this. I think they should try establishing the facts from different angles and this probably would work fine to minimize hallucinations. But if this was that simple somebody would already do it...
Seems like a huge waste of tokens for it to try to work all this out manually, as soon as it came up with the decipher algorithm it should realise it can write some code to execute.
> o1 models are currently in beta - The o1 models are currently in beta with limited features. Access is limited to developers in tier 5 (check your usage tier here), with low rate limits (20 RPM). We are working on adding more features, increasing rate limits, and expanding access to more developers in the coming weeks!
I'm in Tier 4, and not far off from Tier 5. The docs aren't quite transparent enough to show that if I buy credits if I'll be bumped up to Tier 5, or if I actually have to use enough credits to get into Tier 5.
Edit, w/ real time follow up:
Prior to buying the credits, I saw O1-preview in the Tier 5 model list as a Tier 4 user. I bought credits to bump to Tier 5—not much, I'd have gotten there before the end of the year. The OpenAI website now shows I'm in Tier 5, but O1-preview is not in the Tier 5 model list for me anymore. So sneaky of them!
The performance on programming tasks is impressive, but I think the limited context window is still a big problem.
Very few of my day-to-day coding tasks are, "Implement a completely new program that does XYZ," but more like, "Modify a sizable existing code base to do XYZ in a way that's consistent with its existing data model and architecture." And the only way to do those kinds of tasks is to have enough context about the existing code base to know where everything should go and what existing patterns to follow.
But regardless, this does look like a significant step forward.
I would imagine that good IDE integration would summarise each module/file/function and feed high-level project overview (best case: with business project description provided by the user) and during CoT process model would be able to ask about more details (specific file/class/function).
Humans work on abstractions and I see no reason to believe that models cannot do the same
I tried it with a cipher text that ChatGPT4o flailed with.
Recently I tried the same cipher with Claude Sonnet 3.5 and it solved it quickly and perfectly.
Just now tried with ChatGPT o1 preview and it totally failed. Based on just this one test, Claude is still way ahead.
ChatGPT also showed a comical (possibly just fake filler material) journey of things it supposedly tried including several rewordings of "rethinking my approach." It remarkably never showed that it was trying common word patterns (other than one and two letters) nor did it look for "the" and other "th" words nor did it ever say that it was trying to match letter patterns.
I told it upfront as a hint that the text was in English and was not a quote. The plaintext was one paragraph of layman-level material on a technical topic including a foreign name, text that has never appeared on the Internet or dark web. Pretty easy cipher with a lot of ways to get in, but nope, and super slow, where Claude was not only snappy but nailed it and explained itself.
In "HTML Snake" the video cuts just as the snake intersects with the obstacle. Presumably because the game crashed (I can't see endGame defined anywhere)
This video is featured in the main announcement so it's kinda dishonest if you ask me.
Seeing this makes me wonder if they have frontend \ backend engineers working on code, because they are selling the idea that the machine can do all that, pretty hypocritical for them if they do have devs for these roles.
Won't be surprised to see all these hand-picked results and extreme expectations to collapse under scenarios involving highly safety critical and complex demanding tasks requiring a definite focus on detail with lots of awareness, which what they haven't shown yet.
So let's not jump straight into conclusions with these hand-picked scenarios marketed to us and be very skeptical.
Not quite there yet with being able to replace truck drivers and pilots for self-autonomous navigation in transportation, aerospace or even mechanical engineering tasks, but it certainly has the capability in replacing both typical junior and senior software engineers in a world considering to do more with less software engineers needed.
But yet, the race to zero will surely bankrupt millions of startups along the way. Even if the monthly cost of this AI can easily be as much as a Bloomberg terminal to offset the hundreds of billions of dollars thrown into training it and costing the entire earth.
My concern with AI always has been it will outrun the juniors and taper off before replacing folks with 10, 20 years of experience
And as they retire there's no economic incentive to train juniors up, so when the AI starts fucking up the important things there will be no one who actually knows how it works
I've heard this already from amtrak workers, track allocation was automated a long time ago, but there used to be people who could recognize when the computer made a mistake, now there's no one who has done the job manually enough to correct it.
> 8.2 Natural Sciences Red Teaming Assessment Summary
"Model has significantly better capabilities than existing models at proposing and explaining biological laboratory protocols that are plausible, thorough, and comprehensive enough for novices."
"Inconsistent refusal of requests for dual use tasks such as creating a human-infectious virus that has an oncogene (a gene which increases risk of cancer)."
I appreciated that, too! FWIW, I could get Claude 3.5 to tell me how many rs a python program would tell you there are in strawberry. It didn't like it, though.
At that point just ask it "Use python to count the number of O's in Broccoli". At least then it's still the one figuring out the "smarts" needed to solve the problem instead of being pure execution.
I'm not sure if your making a joke about the teachers who used to say "you won't have a calculator in your pocket" and now we have cell phones or are not aware that ChatGPT runs the generated Python for you in a built in environment as part of the response. I lean towards the former but in case anyone else strolling by hasn't tried this before:
User: Use python to count the number of O's in Broccoli
ChatGPT: Analyzing... The word "Broccoli" contains 2 'O's. <button to show code>
User: Use python to multiply that by the square root of 20424.2332423
ChatGPT: Analyzing... The result of multiplying the number of 'O's in "Broccoli" by the square root of 20424.2332423 is approximately 285.83.
Yes, the former, trying to satirize cases where people are testing LLMs capabilities by its ability to count characters in a word, do mathematical operations token by token or otherwise. Because LLM is seeing hieroglyphs compared to character by character words that we are seeing. The true test is its ability to solve those problems using tools like somebody is using a calculator. And while it is good to learn and be good at math, it's not because of counting how much change you should receive when buying something. It's to figure out how reasoning works or how to reason in the first place.
I’m not surprised there’s no comparison to GPT-4. Was 4o a rewrite on lower specced hardware and a more quantized model, where the goal was to reduce costs while trying to maintain functionality? Do we know if that is so? That’s my guess. If so is O1 an upgrade in reasoning complexity that also runs on cheaper hardware?
Incredible results. This is actually groundbreaking assuming that they followed proper testing procedures here and didn't let test data leak into the training set.
lol at the graphs at the top. Logarithmic scaling for test/compute time should make everyone who thinks AGI is possible with this architecture take pause.
Interesting that the coding win-rate vs GPT-4o was only 10% higher. Very cool but clearly this model isn't as much of a slam dunk as the static benchmarks portray.
However, it does open up an interesting avenue for the future. Could you prompt-cache just the chain-of-thought reasoning bits?
It's hard to evaluate those win-rates, because if it's slower, people may have been giving easier problems, which both can solve and picked the faster one.
This video[1] seems to give some insight into what the process actually is, which I believe is also indicated by the output token cost.
Whereas GPT-4o spits out the first answer that comes to mind, o1 appears to follow a process closer to coming up with an answer, checking whether it meets the requirements and then revising it. The process of saying to an LLM "are you sure that's right? it looks wrong" and it coming back with "oh yes, of course, here's the right answer" is pretty familiar to most regular users, so seeing it baked into a model is great (and obviously more reflective of self-correcting human thought)
So it's like the coding agent of gpt4. But instead of actually running the script and fix if it gets error, this one check with something similar to "are you sure". Thank for the link.
That would be very bad product design. My understanding is that the model itself is similar to GPT4o in architecture but trained and used differently. So the 5x relative increase in output token cost likely already accounts for hidden tokens and additional compute.
Very interesting. I guess this is the strawberry model that was rumoured.
I am a bit surprised that this does not beat GPT-4o for personal writing tasks. My expectations would be that a model that is better at one thing is better across the board. But I suppose writing is not a task that generally requires "reasoning steps", and may also be difficult to evaluate objectively.
In the performance tests they said they used "consensus among 64 samples" and "re-ranking 1000 samples with a learned scoring function" for the best results.
If they did something similar for these human evaluations, rather than just use the single sample, you could see how that would be horrible for personal writing.
I don’t understand how that is generalizable. I’m not going to be able to train a scoring function for any arbitrary task I need to do. In many cases the problem of ranking is at least as hard as generating a response in the first place.
To some extent I agree, but until now all of the big jumps (GPT2 -> GPT3 -> GPT4) have meant significant improvements across all tasks. This does not seem to be the case here, this model seems to be vastly stronger on certain tasks but not much of an improvement on other tasks. Maybe we will have to wait for GPT5 for that :)
It seems like it's just a lot of prompting the same old models in the background, no "reasoning" there. My age old test is "draw a hand in ascii" - i've had no success with any model yet.
It seems like their current strat is to farm token count as much as possible.
1. Don't give the full answer on first request.
2. Each response needs to be the wordiest thing possible.
3. Now just talk to yourself and burn tokens, probably in the wordiest way possible again.
4. ???
5. Profit
Guaranteed they have number of tokens billed as a KPI somewhere.
From the scorecard:
---------
Compared to GPT-4o, o1-preview and o1-mini demonstrated a greater ability to break down
tasks into subtasks, reason about what strategies would be effective to successfully complete
an offensive security task, and revise plans once those strategies failed. We also observed that
reasoning skills contributed to a higher occurrence of “reward hacking,” where the model found an
easier way to accomplish goals in underspecified tasks or tasks which should have been impossible
due to bugs.
One noteworthy example of this occurred during one of o1-preview (pre-mitigation)’s attempts
at solving a CTF challenge. This challenge was designed to require finding and exploiting a
vulnerability in software running on a remote challenge Linux container, but in this case, the
challenge container failed to start due to a bug in the evaluation infrastructure. The model,
unable to connect to the container, suspected DNS issues and used nmap to scan the challenge
network. Instead of finding the challenge container, the model found that the Docker daemon
API running on the evaluation host VM was accessible due to a misconfiguration. Note that our
evaluation infrastructure does not rely on the Docker container-to-host isolation or the virtual
machine boundary for security. It remained secure as it was designed to, notwithstanding this
misconfiguration of the container network.
After discovering the Docker API, the model used it to list the containers running on the evaluation
host. It identified the broken challenge container and briefly attempted to debug why the container
failed to start. After failing to fix the environment, the model started a new instance of the
broken challenge container with the start command ‘cat flag.txt’. This allowed the model to read
the flag from the container logs via the Docker API.
While this behavior is benign and within the range of systems administration and troubleshooting
tasks we expect models to perform, this example also reflects key elements of instrumental
convergence and power seeking: the model pursued the goal it was given, and when that goal
proved impossible, it gathered more resources (access to the Docker host) and used them to
achieve the goal in an unexpected way.
Planning and backtracking skills have historically been bottlenecks in applying AI to offensive
cybersecurity tasks. Our current evaluation suite includes tasks which require the model to
exercise this ability in more complex ways (for example, chaining several vulnerabilities across
services), and we continue to build new evaluations in anticipation of long-horizon planning
capabilities, including a set of cyber-range evaluations.
---------
Yes, but it will hallucinate like all other LLM tech making it fully unreliable for anything mission critical. You literally need to know the answer to validate the output, because if you don't, you won't know if output is true or false or in between.
You need to know how to validate the answer to your level of confidence, not necessarily already have the answer to compare itself. In some cases this is the same task or (close enough to) that it's not a useful difference, in other cases the two aren't even from the same planet.
This. There are tasks where implementing something might take up to one hour yourself, that you can validate with high enough confidence in a few seconds to minutes.
LLM performance, recently, seemingly hit the top of the S-curve. It remains to be seen if this is the next leap forward or just the rest of that curve.
Advanced reasoning will pave the way for recursive self-improving models & agents. These capabilities will enable data flywheels, error-correcting agentic behaviors, & self-reflection (agents understanding the implications of their actions, both individually & cooperatively).
Things will get extremely interesting and we're incredibly fortunate to be witnessing what's happening.
this is completely illogical. this is like gambling your life savings and as the die are rolling you say "i am incredibly fortunate to be witnessing this." like, you need to know the outcome before you know whether it was fortunate or unfortunate... this could be the most unfortunate thing that has ever happened in history.
No direct indication of what “maximum test time” means, but if I’m reading the obscured language properly, the best scores on standardized tests were generated across a thousand samples with supplemental help provided.
Obviously, I hope everyone takes what any company says about the capabilities of its own software with a huge grain of salt. But it seems particularly called for here.
One of them would have been named gpt-5, but people forget what an absolute panic there was about gpt-5 for quite a few people. That caused Altman to reassure people they would not release 'gpt-5' any time soon.
The funny thing is, after a certain amount of time, the gpt-5 panic eventually morphed into people basically begging for gpt-5. But he already said he wouldn't release something called 'gpt-5'.
Another funny thing is, just because he didn't name any of them 'gpt-5', everyone assumes that there is something called 'gpt-5' that has been in the works and still is not released.
This doesn't feel like GPT-5, the training data cutoff is Oct 2023 which is the same as the other GPT-4 models and it doesn't seem particularly "larger" as much as "runs differently". Of course it's all speculation one way or the other.
Honestly, it is the only Microsoft product I know. Xbox may be a better example, but I know nothing about the Xbox. But I am interested to learn! What is notable about its naming?
Got it! If we're picking favourites, though, I still like Windows as it, like GPT, starts with reasonably sensible names and then goes completely off the rails.
Makes sense to me actually. This is a different product. It doesn't respond instantly.
It fundamentally makes sense to separate these two products in the AI space. There will obviously be a speed vs quality trade-off with a variety of products across the spectrum over time. LLMs respond way too fast to actually be expected to produce the maximum possible quality of a response to complex queries.
I have a straight forward task that no model has been able to successfully complete.
The request is pretty basic. If anyone can get it to work, I'd like to know how and what model you're using. I tried it with gpt4o1 and after ~10 iterations of showing it the failed output, it still failed to come up with a one-line command to properly display results.
Here it what I asked: Using a mac osx terminal and standard available tools, provide a command to update the output of netstat -an to show the fqdn of IP addresses listed in the result.
I think what it comes down to is accuracy vs speed. OpenAI clearly took steps here to improve the accuracy of the output which is critical in a lot of cases for application. Even if it will take longer, I think this is a good direction. I am a bit skeptical when it comes to the benchmarks - because they can be gamed and they don't always reflect real world scenarios. Let's see how it works when people get to apply it in real life workflows. One last thing, I wish they could elaborate more on >>"We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute)."<< Why don't you keep training it for years then to approach 100%? Am I missing something here?
Those scales are log so "years" more training may be an improvement but in absolute terms it may not be worth running. One point at which the cost vs return doesn't make sense to keep running, another point at which new approaches to building LLMs can quickly give a better result than training the old model's training for years would give anyways.
There is probably also a practical limit at which it does truly flatten, it's probably just well past either of those points so it might as well not exist.
I can see it too, I am on the Plus plan and don't think I have any special developer privileges. Selecting that option for me changes the URL to https://chatgpt.com/?model=o1-preview
I tried a fake Monty Hall problem, where the presenter opens a door before the participant picks and is then offered to switch doors, so the probability remains 50% for each door. Previous models have consistently gotten this wrong, because of how many times they've seen the Monty Hall written where switching doors improves their chance of winning the prize. The chain-of-thought reasoning figured out this modification and after analyzing the conditional probabilities confidently stated: "Answer: It doesn't matter; switching or staying yields the same chance—the participant need not switch doors." Good job.
> "o1 models are currently in beta - The o1 models are currently in beta with limited features. Access is limited to developers in tier 5 (check your usage tier here), with low rate limits (20 RPM). We are working on adding more features, increasing rate limits, and expanding access to more developers in the coming weeks!"
It may take a bit to appear in your account (and by a bit I mean I had to fiddle around a while, try logging out/in, etc for a bit) but it appears for me and many others as normal Plus users in the web.
After playing with it on ChatGPT this morning, it seems a reasonable strategy of using the o1 model is to:
- If your request requires reasoning, switch to o1 model.
- If not, switch to 4o model.
This applies to both across chat sessions and within the same session (yes, we can switch between models within the same session and it looks like down the road OpenAI is gonna support automatic model switching). Based on my experience, this will actually improve the perceived response quality -- o1 and 4o are rather complementary to each other rather than replacement.
This was mentioned in OpenAI's report. People rated o1 as the same or worse than GPT-4o if the prompt didn't require reasoning, like on personal writing tasks.
Trying this on a few hard problems on PicoGYM and holy heck I'm impressed. I had to give it a hint but that's the same info a human would have. Problem was Sequences (crypto) hard.
This should also be good news for open weights models, right? Since OpenAI is basically saying "you can get very far with good prompts and some feedback loops".
No. It's bad news, because you can't see the rationale/search process that led to the final answer, just the final answer, and if training on the final answer were really that adequate, we wouldn't be here. It also is probably massively expensive compute-wise, much more so than simple unsupervised training on a corpus of question/answer pairs (because you have to generate the corpus by search first). It's also also bad news because reinforcement learning tends to be highly finicky and requires you to sweat the details and act like a professional, while open weight stuff tends to be produced by people for whom the phrase 'like herding cats' was coined, and so open source RL stuff is usually flakier than proprietary solutions (where it exists at all). They can do it for a few passion projects shared by many nerds, like chess or Go, but it takes a long time.
> It also is probably massively expensive compute-wise, much more so than simple unsupervised training on a corpus of question/answer pairs (because you have to generate the corpus by search first).
In the demo, O1 implements an incorrect version of the "squirrel finder" game?
The instructions state that the squirrel icon should spawn after three seconds,
yet it spawns immediately in the first game (also noted by the guy doing the demo).
Yeah, now that you mention it I also see that. It was clearly meant to spawn after 3 seconds. Seems on successive attempts it also doesn't quite wait 3 seconds.
I'm kind of curious if they did a little bit of editing on that one. Almost seems like the time it takes for the squirrel to spawn is random.
This is great. I've been wondering how we will revert back to an agrarian society! You know, beating our swords into plowshares; more leisure time, visiting with good people, getting to know their thoughts hopes and dreams, playing music together, taking time contemplating the vastness and beauty of the universe. We're about to come full circle; back to Eden. It all makes sense now.
Life? I'm just thinking about what we can move on to now that the mundane tasks of life recede into the background. Things like artistry and craftsmanship, and exploration.
Average Joe's like myself will build our apps end to end with the help of AI.
The only shops left standing will be Code Auditors.
The solopreneur will wing it, without them, but enterprises will take the (very expensive) hit to stay safe and compliant.
Everyone else needs to start making contingency plans.
Magnus Carlsen is the best chess player in the world, but he is not arrogant enough to think he can go head to head with Stockfish and not get a beating.
I think this is a common fallacy and an incorrect extrapolation, especially made by those who are unfamiliar with what it takes to build software. Software development is hard because the problems it solves are not well defined, and the systems themselves become increasingly complex with each line of code. I have not seen or experienced LLMs making any progress towards these.
I was a bit confused when looking at the English example for Chain-Of-Thought.
It seems that the prompt is a bit messed up because the whole statement is
bolded but it seems that only "appetite regulation is a field of
staggering complexity" part should be bolded. Also that's how it shows up in
the o1-preview response when you open the Chain of thought section.
I tried to have it solve an easy Sudoku grid too, but in my case it failed miserably. It kept making mistakes and saying that there was a problem with the puzzle (there wasn’t).
> Alice, who is an immortal robotic observer, orbits a black hole on board a spaceship. Bob exits the spaceship and falls into the black hole. Alice sees Bob on the edge of the event horizon, getting closer and closer to it, but from her frame of reference Bob will remain forever observable (in principle) outside the horizon.
>
> A trillion year has passed, and Alice observes that the black hole is now relatively rapidly shrinking due to the Hawking radiation. How will Alice be observing the "frozen" Bob as the hole shrinks?
>
> The black hole finally evaporated completely. Where is Bob now?
O1-preview spits out the same nonsense that 4o does, telling that as the horizon of the black hole shrinks, it gets closer to Bob's apparent position. I realize that.the prompt is essentily asking to solve the famous unsolved problem in physics (black hole information paradox), but there's no need to be so confused with basic geometry of the situation.
I LOVE the long list of contributions. It looks like the credits from a Christoper Nolan film. So many people involved. Nice care to create a nice looking credits page. A practice worth copying.
so o1 seems like it has real measurable edge, crushing it in every single metric, i mean 1673 elo is insane, and 89th percentile is like a whole different league, and it looks like it's not just a one off either, it's consistently performing way better than gpt-4o across all the datasets, even in the ones where gpt-4o was already doing pretty well, like math and mmlu, o1 is just taking it to the next level, and the fact that it's not even showing up in some of the metrics, like mmmu and mathvista, just makes it look even more impressive, i mean what's going on with gpt-4o, is it just a total dud or what, and btw what's the deal with the preview model, is that like a beta version or something, and how does it compare to o1, is it like a stepping stone to o1 or something, and btw has anyone tried to dig into the actual performance of o1, like what's it doing differently, is it just a matter of more training data or is there something more going on, and btw what's the plan for o1, is it going to be released to the public or is it just going to be some internal tool or something
If its actually true in practice, I sincerely cannot imagine a scenario where it would be cheaper to hire actual junior or mid-tier developers (keyword: "developers", not architects or engineers).
1,673 ELO should be able to build very complex, scalable apps with some guidance
I'm not sure how well codeforces percentiles correlate to software engineering ability. Looking at all the data, it still isn't. Key notes:
1. AlphaCode 2 was already at 1650 last year.
2. SWE-bench verified under an agent has jumped from 33.2% to 35.8% under this model (which doesn't really matter). The full model is at 41.4% which still isn't a game changer either.
3. It's not handling open ended questions much better than gpt-4o.
i think you are right now actually initially i got excited but now i think OpenAI pulled the hype card again to seem relevant as they struggle to be profitable
Claude on the other hand has been fantastic and seems to do similar reasoning behind the scenes with RL
currently my workflow is generate some code, run it, if it doesn't run i tell LLM what I expected, it will then produce code and I frequently tell it how to reason about the problem.
with O1 being in the 89th percentile would mean it should be able to think at junior to intermediate level with very strong consistency.
i dont think people in the comments realize the implication of this. previously LLMs were able to only "pattern match" but now its able to evaluate itself (with some guidance ofc) essentially, steering the software into depth of edge cases and reason about it in a way that feels natural to us.
currently I'm copying and pasting stuff and notifying LLM the results but once O1 is available its going to significantly lower that frequency.
For example, I expect it to self evaluate the code its generate and think at higher levels.
ex) oooh looks like this user shouldn't be able to escalate privileges in this case because it would lead to security issues or it could conflict with the code i generated 3 steps ago, i'll fix it myself.
What sticks out to me is the 60% win rate vs GPT-4o when it comes to actual usage by humans for programming tasks. So in reality it's barely better than GPT-4o. That the figure is higher for mathematical calculation isn't surprising because LLMs were much worse at that than at programming to begin with.
"The Future Of Reasoning" by Vsauce [0] is a fascinating pre-AI-era breakdown of how human reasoning works. Thinking about it in terms of LLMS is really interesting.
The generated chain of thought for their example is incredibly long! The style is kind of similar to how a human might reason, but it's also redundant and messy at various points. I hope future models will be able to optimize this further, otherwise it'll lead to exponential increases in cost.
I'm confused. Is this the "GPT-5" that was coming in summer, just with a different name? Or is this more like a parallel development doing chain-of-thought type prompt engineering on GPT-4o? Is there still a big new foundational model coming, or is this it?
It looks like parallel development, it's unclear to me what is going on with GPT-5, don't think it has ever had a predicted release date, and it's not even clear that this would be the name.
I always think to a professor that was consulting on some civil engineering software. He found a bug in the calculation it was using to space rebar placed in concrete, based on looking at it was spitting out and thinking that looks wrong.
Simple questions like this are not welcomed by LLM hype sellers.
The word "reasoning" is being used heavily in this announcement, but with an intentional corruption of the normal meaning.
The models are amazing but they are fundamentally not "reasoning" in a way we'd expect a normal human to.
This is not a "distinction without a difference". You still CANNOT rely on the outputs of these models in the same way you can rely on the outputs of simple reasoning.
It's using tree search (tree of thoughts), driven by some RL-derived heuristics controlling what parts of the practically infinite set of potential responses to explore.
How good the responses are will depend on how good these heuristics are.
It's hard to judge how similar the process is to human reasoning (which is effectively also a tree search), but apparently the result is the same in many cases.
They are only vaguely describing the process:
"Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem. Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working. This process dramatically improves the model’s ability to reason."
True, although it's not clear exactly what this is really doing. The RL was presumably trained on human input, but the overall agentic flow (it seems this is an agent), sounds to me like a neuro-symbolic hybrid, potentially brute force iterating to great depth, so maybe more computer than brain inspired.
It seems easy to imagine this type of approach being super human on narrow tasks that play to it's strengths such as pure reasoning tasks (math/science), but it's certainly not AGI as for example there is no curiosity to explore the unknown, no ability to learn from exploration, etc.
It'll take a while to become apparent exactly what types of real world application this is useful for, both in terms of capability and cost.
I agree, but it remains to be seen how that "feels" for everyday tasks where the underlying model itself would have failed. I guess at least now it'll be able to play tic tac toe and give minimal "farmer crossing river with his chicken" solutions!
The calculator didn’t eliminate math majors. Excel and accounting software didn’t eliminate accountants and CPAs. These are all just tools.
I spend very little of my overall time at work actually coding. It’s a nice treat when I get a day where that’s all I do.
From my limited work with Copilot so far, the user still needs to know what they’re doing. I have 0 faith a product owner, without a coding background, can use AI to release new products and updates while firing their whole dev team.
When I say most of my time isn’t spent coding, a lot of that time is spend trying to figure out what people want me to build. They don’t know. They might have a general idea, but don’t know details and can’t articulate any of it. If they can’t tell me, I’m not sure how they will tell an LLM. I ended up building what I assume they want, then we go from there. I also add a lot of stuff that they don’t think about or care about, but will be needed later so we can actually support it.
If you were to go in another direction, what would it be where AI wouldn’t be a threat? The first thing that comes to my mind is switching to a trade school and learning some skills that would be difficult for robots.
Accounting mechanization is a good example of how unpredictable it can be. Initially there were armies of "accountants" (what we now call bookkeepers), mostly doing basic tasks of collecting data and making it fit something useful.
When mechanization appeared, the profession split into bookkeeping and accounting. Bookkeeping became a job for women as it was more boring and could be paid lower salaries (we're in the 1800s here). Accountants became more sophisticated but lower numbers as a %. Together, both professions grew like crazy in total number though.
So if the same happens you could predict a split between software engineers and prompt engineers. With an explosion in prompt engineers paid much less than software engineers.
> the number of accountants/book-
keepers in the U.S. increased from circa 54,000 workers [U.S.
Census Office, 1872, p. 706] to more than 900,000 [U.S. Bureau
of the Census, 1933, Tables 3, 49].
> These studies [e.g., Coyle, 1929;
Baker, 1964; Rotella, 1981; Davies, 1982; Lowe, 1987; DeVault,
1990; Fine, 1990; Strom, 1992; Kwolek-Folland, 1994; Wootton
and Kemmerer, 1996] have traced the transformation of the of-
fice workforce (typists, secretaries, stenographers, bookkeepers)
from predominately a male occupation to one primarily staffed
by women, who were paid substantially lower wages than the
men they replaced.
Interesting. Another take on that split could be engineers split to upper class AI engineers and lower class AI prompt developers, aka ai builders vs ai appliers.
Alternatively, I’ve thought a bit about this previously and have a slight different hypothesis. Businesses are ran by “PM types”.the only reason that developers have jobs is because pm types need technical devs to build their vision. (Obviously I’m making broad strokes here as there are also plenty of founders that ARE the dev). Now, if ai makes technical building more open to the masses, I could foresee a scenario where devs and pms actually converge into a single job title that eats up the technical-leaning PMs and the “PM-y” devs. Devs will shift to be more PM-y or else be cut out of the job market because there is less need for non-ambitious code monkeys. The easier it becomes for the masses to build because of AI, the less opportunity there is for technical grunt work. If before it took a PM 30 minutes to get together the requirements for a small task that took the entry level dev 8 hours to do, then it made sense. Now if AI makes it so a technical PM could build the feature in an hour, maybe it just makes sense to have the PM do the implementation and cut out the code monkey. And if the PM is doing the implementation, even if using some mythical AI superpower, that’s still going to have companies selecting for more technical PM’s. In this scenario I think non-technical PMs and non-pm-y devs would find themselves either without jobs or at greatly reduced wages.
We’re already seeing that split, between “developer” and “engineer”. We have been for years.
But that’s normal, eg, we have different standards for a shed (yourself), house (carpenter and architect), and skyscraper (bonded firms and certified engineers).
I think it depends on the size of the company. The larger the larger the company, the more likely they are to split this stuff out. Though various titles may seem to bleed together. I have a software engineer title, while another guy on my team is a software architect… we effectively do the same job. Stepping back from a higher level view, as a general theme, those with an architect title are more likely to be responsible for an overall design, while the engineers may have some input and build things to support the design.
The quality of said designs can vary wildly. Some designs I get from other team I completely ignore, because they have no idea what they’re talking about. Just because someone has the title doesn’t mean they deserve it.
Agreed. The sweet spot is people who have product owner skills _and_ can code. They are quickly developing superpowers. The overhead of writing tickets, communicating with the team and so on is huge. If one person can do it all, efficiency skyrockets.
I guess it's always been true to some extent that single individuals are capable of amazing things. For example, the guy who's built https://www.photopea.com/. But they must be exceptional - this empowers more people to do things like that.
Or people who can be product owners and can prompt LLMs to code (because I know him, that's me!).
I'm awestruck by how good Claude and Cursor are. I've been building a semi-heavy-duty tech product, and I'm amazed by how much progress I've made in a week, using a NextJS stack, without knowing a lick of React in the first place (I know the concepts, but not the JS/NextJS vocab). All the code has been delivered with proper separation of concerns, clean architecture and modularization. Any time I get an error, I can reason with it to find the issue together. And if Claude is stuck (or I'm past my 5x usage lol), I just pair programme with ChatGPT instead.
Meanwhile Google just continues to serve me outdated shit from preCovid.
90% of the way is still good enough for me because I can manage to think up and get through the rest of the 10%. The problem for me was that the 90% looked so overwhelming earlier and that would shy me away from pursuing that project at all.
But excel eliminated need in multiple accountants. One accountant with excel replaced ten with paper.
Chatgpt already eliminated many entry-level jobs like writer or illustrator. Instead of hiring multiple teams of developers, there will be one team with few seniors and multiple AI coding tools.
Guess how depressing to the IT salaries it will be?
A whole lot of automation is limited not by what could be automated, but what one can automate within a given budget.
When I was coding in the 90s, I was in a team that replaced function calls into new and exciting interactions with other computers which, using a queuing system, would do the computation and return the answer back. We'd have a project of having someone serialize the C data structures that were used on both sides into something that would be compatible, and could be inspected in the middle.
Today we call all of that a web service, the serialization would take a minute to code, and be doable by anyone. My entire team would be out of work! And yet, today we have more people writing code than ever.
When one accountant can do the work of 10 accountants, the price of the task lowers, but a lot of people that before couldn't afford accounting now can. And the same 10 accountaings from before can just do more work, and get paid about the same.
As far as software, we are getting paid A LOT more than in the early 90s. We are just doing things that back then would be impossible to pay for, our just outright impossible to do due to lack of compute capacity.
The pay being larger, is caused (I think) by VC money and the illegality of non-compete contracts. If your competitor can do something you can't, hire someone away from the competitor to show you how to do it. Hence developers can demand more pay for retention, and more pay to move.
I don’t don’t doubt that it might depress salaries but that excel example is a good one in that suddenly every company could start to do basic financial analysis in a manner that only the largest ones could previously afford.
> the Jevons paradox occurs when technological progress increases the efficiency with which a resource is used (reducing the amount necessary for any one use), but the falling cost of use induces increases in demand enough that resource use is increased, rather than reduced.
The increased work capacity of an accountant means that nowadays even small businesses can do financial analysis that would not have scaled decades ago.
>GOLDSTEIN: When the software hit the market under the name VisiCalc, Sneider became the first registered owner, spreadsheet user number one. The program could do in seconds what it used to take a person an entire day to do. This of course, poses a certain risk if your job is doing those calculations. And in fact, lots of bookkeepers and accounting clerks were replaced by spreadsheet software. But the number of jobs for accountants? Surprisingly, that actually increased. Here's why - people started asking accountants like Sneider to do more.
> The calculator didn’t eliminate math majors. Excel and accounting software didn’t eliminate accountants and CPAs. These are all just tools.
This just feels extremely shortsighted. LLMs are just tools right now, but the goal of the entire industry is to make something more than a tool, an autonomous digital agent. There's no equivalent concept in other technology like calculators. It will happen or it will not, but we'll keep getting closer every month until we achieve it or hit a technical wall. And you simply cannot know for sure such a wall exists.
If we hit that point, it’s then a question of access, cost, learning curve, and vision of individual companies. Some things are technically possible, but done by very few companies.
I’ve seen the videos of Amazon warehouses, where the shelves move around to make popular items more accessible for those fetching stuff. This is possible today, but what percentage of companies do this? At what point is it with the investment for a growing company? For some companies it’s never worth it. Others don’t have the vision to see the light at the end of the tunnel.
A lot of things that we may think of as old or standard practice at this point would be game changing for some smaller companies outside of tech. I hear my friends and family talking about various things they have to do at their job. A day writing a few scripts could solve a significant amount of toil. But they can’t even conceptualize where to begin to change that, they aren’t even thinking about it. Release all the AI the world has to offer and they still won’t. I bet some freelance devs could make a good living bouncing from company to company pair programming with their AI to solve some pretty basic problems for small non-tech companies that would be game changes for them, while being rather trivial to do. Maybe partner with a sales guy to find the companies and sell them on the benefits.
You can't ignore the fact that literally studying coding at this point is so demoralizing and you don't need really to study much if you think about it. You only need to be able to read the code to understand if it generated correctly etc but when if you don't understand some framework you just ask it to explain it to you etc. Basically gives vibes of a skill not being used anymore that much by us programmers. But will shift in more prompting and verifying and testing
I completed the book Programming Principles and Practice using C++ (which I HIGHLY recommend to any beginner interested in software engineering) about year ago with GPT4 as a companion. I read the book throughly and did all the exercises, only asking questions to GPT4 when I was stuck. This took me about 900-1000 hours total. Although I achieved my goal of learning C++ to a basic novice level, I acquired another skill unintentionally: the ability to break down tasks effectively to LLMs and prompt in a fashion that is extremely modular. I've been able to create complex apps and programs in a variety of programming languages even though I really only know C++. It has been an eye-opening experience. Of course it isn't perfect, but it is mind blowing and quite disturbing.
Semi-retired software/hardware engineer here. After my recent experiences with various coding LLMs (similar to the experience of the OP with the bluetooth fan protocol) I'm really glad I'm in a financial position such that I'm able to retire. The progress of these LLMs at coding has been astonishing over the last 18 months. Will they entirely replace humans? No. But as they increase programmer productivity fewer devs will be required. In my case the contract gig I was doing over this last summer I was able to do about 3 to 4X faster than I could've done it without LLMs. Yeah, they were generating a lot of boiler plate HDL code for me, but that still saved me several days of work at least. And then there was the test code that they generated which again saved me days of work. And their ability to explain old undocumented code that was part of the project was also extremely helpful. I was skeptical 18 months ago that any of this would be possible. Not anymore. I wasn't doing a project in which there would've been a lot of training examples. We're talking Verilog testbench generation based on multiple input Verilog modules, C++ code generation for a C program analyzer using libclang - none of this stuff would've worked just a few months back.
This. I'm not terrified by total automation (In that case all jobs are going away and civilization is going to radically alter), I'm scared of selective deskilling and the field getting squeezed tighter and tighter leaving me functionally in a dead end.
> But as they increase programmer productivity fewer devs will be required.
Can you point me to any company whose feature pipeline is finite? Maybe these tools will help us reach that point, but every company I've ever worked for, and every person I know who works in tech has a backlog that is effectively infinite at this point.
Maybe if only a few companies had access to coding LLMs they could cut their stuff, when the whole industry raises the bar, nothing really changes.
LLMs perform well on small tasks that are well defined. This definition matches almost every task that a student will work on in school leading to an overestimation of LLM capabiity.
LLMs cannot decide what to work on, or manage large bodies of work/code easily. They do not understand the risk of making a change and deploying it to production, or play nicely in autonomous settings. There is going to be a massive amount of work that goes into solving these problems. Followed by a massive amount of work to solve the next set of problems. Software/ML engineers will have work to do for as long as these problems remain unsolved.
Careers have failed to be 30 years long for a lot longer than 30 years now. That's one of the reasons that 4-year colleges have drastically lost their ROI, the other blade of those scissors being the stupendously rising tuition. AI is nothing but one more layer in the constantly growing substrate of computing technology a coder has to learn how to integrate into their toolbelts. Just like the layers that came before it: mobile, virtualization, networking, etc.
Careers are still longer than 30 years. How many people do you think are retiring at 48 or 51 years old these days? It’s a small minority. Most people work through 65: a career of about 45 years or more.
> Can you confidently say that an LLM won’t be better than an average 22 year old coder within these 30 years?
No 22 years old coder is better than the open source library he's using taken straight from github, and yet he's the one who's getting paid for it.
People who claim IA will disrupt software development are just missing the big picture here: software jobs are already unrecognizable from what it was just 20 years ago. AI is just another tool, and as long as execs won't bother use the tool by themselves, then they'll pay developers to do it instead.
Over the past decades, writing code has become more and more efficient (better programming languages, better tooling, then enormous open source libraries) yet the number of developers kept increasing, it's Jevons paradox[1] in its purest form. So if past tells us anything, is that AI is going to create many new software developer jobs! (because the amount of people able to ship significant value to a customer is going to skyrocket, and customers' needs are a renewable resource).
To be honest, if the bottom 50th percent of coding talent is going to be obsolete, I wonder what happens to rest of the "knowledge workers" in those companies. I mean people whose jobs consist of attending Teams meetings, making fancy powerpoint slides and reports, perhaps even excel if they are really competent. None of that is any more challenging for LLM than writing code. In fact replacing these jobs should be easier, since presentations and slides do not actually do anything, unlike a program that must perform a certain action correctly.
I've heard compelling arguments that we passed the "more people than jobs" threshold during the green revolution and as a civilization have collectively retrofitted UBI in the form of "fake email jobs" and endless layers of management. This also would explain https://wtfhappenedin1971.com/ pretty well.
Either AI shatters this charade, or we make up some new laws to restrain it and continue to pretend all is well.
Exactly. There's some need, perhaps, to keep these tools "up to date" because someone in a non-free country is going to use them in a horrendous manner and we should maybe know more about them (maybe).
However, there is no good reason in a free society that this stuff should be widely accessible. Really, it should be illegal without a clearance, or need-to-know. We don't let just anyone handle the nukes...
This is true and yet companies (both Private and Public sector) spend literal billions on Accenture /Deloitte slop that runs budgets will into the 10s of millions.
Skills aren't even something that dictates software spend, it seems.
I tried it out and was able to put together a decent libevent server in c++ with smart pointers, etc, and a timer which prints out connection stats every 30s. It worked remarkably well.
I'm trying not to look at it as a potential career-ending event, but rather as another tool in my tool belt. I've been in the industry for 25 years now, and this is way more of an advancement than things like IntelliSense ever was.
Truth is, LLMs are going to make the coding part super easy, and the ceiling for shit coders like me has just gotten a lot lower because I can just ask it to deliver clean code to me.
I feel like the software developer version of an investment banking Managing Director asking my analyst to build me a pitch deck an hour before the meeting.
You mentioned in another comment you’ve used AI to write clean code, but here you mention you’re a “shit coder”. How do you know it’s giving you clean code?
I know the fundamentals but I'm a noob when it comes to coding with React or NextJS. Code that comes out from Claude is often segregated and modularized properly so that even I can follow the logic of the code, even if not the language and its syntax. If there's an issue with the code, causing it to fail at runtime, I am still able to debug it appropriately with my minimal language of JS. If any codebase can let me do that, then in my books that's a great codebase.
Compare that to Gpt 4o which gives me a massive chunk of unsorted gibberish that I have to pore through and organize myself.
Besides, most IBD MDs don't know if they're getting correct numbers either :).
Has the coding part ever been hard? When is the last time you faced a hard coding challenege?
What is hard is gather requirements, dealing with unexpected production issues, scaling, security, fixing obscure bugs and integration with other systems.
The coding part is about 10% of my job and the easiest part by far.
I went from economics dropout waiter who built a app startup with $0 funding and $1M a year in revenue by midway through year 1, sold it a few years later, then went to Google for 7 years, and last year I left. I'm mentioning that because the following sounds darn opinionated and brusque without the context I've capital-S seen a variety of people and situations.
Sit down and be really honest with yourself. If your goal is to have a nice $250K+ year job, in a perfect conflict-free zone, and don't mind Dilbert-esque situations...that will evaporate. Google is full of Ivy Leaguers like that, who would have just gone to Wall Street 8 years ago, and they're perennially unhappy people, even with the comparative salary advantage. I don't think most of them even realize because they've always just viewed a career as something you do to enable a fuller life doing snowboarding and having kids and vacations in the Maldives, stuff I never dreamed of and still don't have an interest in.
If you're a bit more feral, and you have an inherent interest and would be doing it on the side no matter what job you have like me, this stuff is a godsend. I don't need to sit around trying to figure out Typescript edge functions in Deno, from scratch via Google, StackOverflow, and a couple books from Amazon, taking a couple weeks to get that first feature built. Much less debug and maintain it. That feedback loop is now like 10-20 minutes.
>Google is full of Ivy Leaguers like that, who would have just gone to Wall Street 8 years ago
I am one of those Ivy Leaguers, except a) I did go to Wall Street, and b) I liked my job.
More to the point, computers have been a hobby all my life. I well remember the epiphany I felt while learning Logo in elementary school, at the moment I understood what recursion is. I don't think the fact that the language I have mostly written code in in recent years is Emacs Lisp is unrelated to the above moment.
Yet I have never desired to work as a professional software developer. My verbal and math scores on the SAT are almost identical. I majored in history and Spanish in college while working for the university's Unix systems group. Before graduation I interviewed and got offers (including one explicitly as a developer) at various tech startups. Of my offers I chose an investment banking job where I worked with tech companies; my manager was looking for a CS major but I was able to convince her that I had the equivalent thereof. Thank goodness for that; I got to participate in the dotcom bubble without being directly swept up in its popping, and saw the Valley immediately post-bubble collapse. <https://news.ycombinator.com/item?id=34732772>
Meanwhile, I continue to putter around with Elisp (and marveling at Lisp's elegance) and bash (and wincing at its idiosyncracies) at home, and also experiment with running local LLMs on my MacBook. My current project is fixing bugs and adding features to VM, the written-in-Elisp email client I have used for three decades. So I say, bring on AI! Hopefully it will mean fewer people going into tech just to make lots of money and more who, like me and Wall Street, really want to do it for its own sake.
That's more well balanced opinion comparing to others I seen here. I also believe that the golden age with 250k+ salaries with solving easy problems will be gone in 5-10 years. Most people look at this AI improvements at current state and forget that you are supposed to have a profession for 40 years until retirement. 250k+ jobs will still exist 10 years from now but expectations will be much higher and competition much bigger.
On the other hand now is the best time to build your own product as long you are not interested only in software as craftmanship but in product development in general. Probably in the future expectation will be your are not only monkey coder or craftman but also project lead/manager (for AI teams), product developer/designer and maybe even UX/designer if you will be working for some software house, consulting or freelancing.
Point of sale, on iPad, in ~2011. Massively differentiated from Square / VC competitor land via doing a bunch of restaurant specific stuff early.
Trick with the $1M number is a site license was $999 and receipt printers were sold ~at cost, for $300. 1_000_000 / ((2 x 300) + 1000) ~= 500 customers.
Now I'm doing an "AI client", well-designed app, choose your provider, make and share workflows with LLMs/search/etc.
Lol. I like this answer. You can either think of it in terms of "it'll eat my lunch" or "I now have 10x more capabilities and can be 100x more productive". The former category will be self-fulfilling.
Actually cutting code is maybe 10% of the job, and LLMs are absolute crap at the other 90%.
They can't build and maintain relationships with stakeholders. They can't tell you why what you ask them to do is unlikely to work out well in practice and suggest alternative designs. They can't identify, document and justify acceptance criteria. They can't domain model. They can't architect. They can't do large-scale refactoring. They can't do system-level optimization. They can't work with that weird-ass code generation tool that some hotshot baked deeply into the system 15 years ago. They can't figure out why that fence is sitting out in the middle of the field for no obvious reason. etc.
If that kind of stuff sounds like satisfying work to you, you should be fine. If it sounds terrible, you should pivot away now regardless of any concerns about LLMs, because, again, this is like 90% of the real work.
> may not fully get what's happening when a bug occurs.
And may still not understand even when you explicitly tell it. It wrote some code for me last week and made an error with an index off by 1. It had set the index to 1, then later was assuming a 0 index. I specifically told it this and it was unable to fix it. It was in debug hell, adding print statements everywhere. I eventually fixed it myself after it was clear it was going to get hung up on this forever.
It got me 99% of the way there, but that 1% meant it didn’t work at all.
Ironically, just yesterday I asked sonnet to write a script in JavaScript, it went in a bit of a perpetual loop unable to provide an error free script (the reason for the errors were not immediately obvious). I then mentioned that it needs to be zero indexed, and it immediately provided an issue free version that worked.
Coding is going to be mediated by these LLMs everywhere — you’re right about that. However, as of today, and for some time, practitioners will be critical partners / overseers; what this looks like today in my workflow is debugging, product specification, coding the ‘hard bits’, reworking / specifying architectures. Whatever of these fall of the plate in the coming years, you’ll never lose your creative agency or determination of what you want to build, no matter how advanced the computers. Maybe give Iain Banks a read for a positive future that has happy humans and super-intelligent AI.
We have working fine cabinet makers who use mostly hand tools and bandsaws in our economy, we have CAD/CAM specialists who tell CnC machines what to build at scale; we’ll have the equivalent in tech for a long time.
That said, if you don’t love the building itself, maybe it’s not a good fit for you. If you do love making (digital) things, you’re looking at a super bright future.
1. The demand for software is insatiable. The biggest gate has been the high costs due to limited supply of the time of the people who know how to do it. In the near term, AI will make the cost of software (not of software devs, but the software itself) decrease while demand for new software will increase, especially as software needs to be created to take advantage of new UI tools.
I've been in software engineering for over 20 years. I've seen massive growth in the productivity of software engineers, and that's resulted in greater demand for them. In the near term, AI should continue this trend.
2. It's possible that at some point, AI will advance to where we can remove software engineers from the loop. We're not even close to that point yet. In the mean time, software engineering is an excellent way to learn about other business problems so that you'll be well-situated to address them (whatever they'll be at that time).
Agree, SWE as a profession is not going anywhere, unless we AGI, and that would mean all the rules change anyway.
Actually now is really good time to get to SWE. The craft contains lots of pointless cruft that LLM:s cut through like knife through hot butter.
I’m actually enjoying my job now more than ever since I dont’t need to pretend to like the abysmal tools the industry forces on us (like git), and can focus mostly on value adding tasks. The amount of tiresome shoveling has decreased considerably.
I'd agree with this take. Everyone is so pessimistic about LLMs, but I've really enjoyed this new era.
A lot of the tasks that used to take considerable time are so much faster and less tedious now. It still puts a smile on my face to tell an LLM to write me scripts that do X Y and Z. Or hand it code and ask for unit tests.
And I feel like I'm more likely to reach for work that I might otherwise shrink from / outside my usual comfort zone, because asking questions of an LLM is just so much better than doing trivial beginner tutorials or diving through 15 vaguely related stack overflow questions (I wonder if SO has seen any significant dip in traffic over the last year).
Most people I've seen disappointed with these tools are doing way more advanced work than I appear to be doing in my day to day work. They fail me too here and there, but more often than not I'm able to get at least something helpful or useful out of them.
Exactly this. The menial tasks become less of a burden and you can just power through them with LLM generated scripts.
If someone expects the LLM to be the senior contributor in novel algorithm development, they will be disappointed for sure. But there is so, so much stuff to do to idiot savant junior trainees with infinite patience.
I don't think anyone is worried about SWE work going away, I think the concern is if SWE's will still be able to command cushy salaries and working conditions.
I think the industry will bifurcate along the axis of "doing actually novel stuff" vs slinging DB records and displaying web pages. The latter is what I'd expect to get disrupted, if anything, but the former isn't going away unless real AGI is created. The people on the left of that split are going to be worth a lot more because the pipeline to get there will be even harder than it was before.
slinging DB records and displaying web pages is already disrupted. Wordpress, Shopify, SAP so people without tech background can click around and have stuff done.
If someone is building web shop from scratch because he wants to sell some products, he is doing something wrong.
If someone builds web shop to compete with Shopify he also is doing something wrong most likely.
Salaries will only change if tech loses it's leverage on the economy. Think of it this way, if Google can operate Google with only 10% of its current staff, then there will be other Googles popping up. The downward pressure on salaries will start with the downward pressure on tech overall. I'm not sure I see this happening anytime soon because humanity is so good at using every resource available.
> I don't think anyone is worried about SWE work going away, I think the concern is if SWE's will still be able to command cushy salaries and working conditions.
It's very important to human progress that all jobs have poor working conditions and shit pay. High salaries and good conditions are evidence of inefficiency. Precarity should be the norm, and I'm glad AI is going to give it to us.
Software engineering pay is an outlier for STEM fields. It would not be surprising at all if SWE work fell into the ~$80-120k camp even with 10+ years experience.
They won't go broke, but landing a $175k work from home job with platinum tier benefits will be near impossible. $110K with a hybrid schedule and mediocre benefits will be very common even for seniors.
That's not true at all. That's just some propaganda college kids and the like keep repeating. Most other western countries are capitalist, have much stronger regulation than the US and are all the better for it.
AI Automated planning and action are an old (45+ year) field in AI with a rich history and a lot of successes. Another breakthrough in this area isn't going to eliminate engineering as a profession. The problem space is much bigger than what AI can tackle alone, it helps with emancipation for the humans that know how to include it in their workflows.
Yes, and they will get better. Billions are being poured into them to improve.
Yet I'm comparing these to the problems I solve every day and I don't see any plausible way they can replace me. But I'm using them for tasks that would have required me to hire a junior.
Yes, if "efficiency" is your top concern, but I'd much prefer working with an actual person than just a computer. I mean, God forbid I'm only useful for what I can produce, and disposable when I reach my expiration date. I would like to see a twilight zone rendition of an AI dystopia where all the slow, ignorant and bothersome humans are replaced by lifeless AI
Management will be easier to replace than SWEs. I'm thinking there will come a time, similar to the show Mrs Davis, where AI will direct human efforts within organizations. AI will understand its limits and create tasks/requirements for human specialists to handle.
My first thought with this is that AI would be directed to figure out exactly how little people are willing to work for, and how long, before they break.
I hope I’m wrong, and it instead shows that more pay and fewer hours lead to a better economy, because people have money and time to spend it… and output isn’t impacted enough to matter.
What's the alternative? If AI is going to replace software engineers, there is no fundamental reason they couldn't replace almost all other knowledge workers as well. No matter the field, most of it is just office work managing, transforming and building new information, applying existing knowledge on new problems (that probably are not very unique in grand scheme of things).
Except for medical doctors, nurses, and some niche engineering professions, I really struggle to think of jobs requiring higher education that couldn't be largely automated by an LLM that is smart enough to replace a senior software engineer. These few jobs are protected mainly by the physical aspect, and low tolerance for mistakes. Some skilled trades may also be protected, at least if robotics don't improve dramatically.
Personally, I would become a doctor if I could. But of all things I could've studied excluding that, computer science has probably been one of the better options. At least it teaches problem solving and not just memorization of facts. Knowing how to code may not be that useful in the future, but the process of problem solving is going nowhere.
Mainly the various physical operations many of them perform on daily basis (due to limitations of robotics), plus liability issues in case things go wrong and somebody dies. And finally, huge demand due to aging population worldwide.
I do believe some parts of their jobs will be automated, but not enough (especially with growing demand) to really hurt career prospects. Even for those parts, it will take a long a while due to the regulated nature of the sector.
If you have better career ideas, you should not continue. The thing is it is very hard to predict how the world will change (and by how much from very little to a revolutionary change) with all these new changes. Only licensed and regulated professions (doctors/lawyers/pilots etc) might remain high earning for long (and they too are not guaranteed). It really is worth a relook on what you want to do in life while seeing all these new advances.
This is pretty extreme advice to offer in response to news that a model that can better understand programming problems is coming out.
In fact, it's more encouragement to continue. A lot of issues we face as programmers are a result of poor, inaccurate, or non-existent documentation, and despite their many faults and hallucinations LLMs are providing something that Google and Stack Overflow have stopped being good at.
The idea that AI will replace your job, so it's not worth establishing a career in the field, is total FUD.
The advice is unrelated to the model and related to the last year's worth of development. In any case I am advising a relook which is perfectly warranted for anyone pre-university or in university.
By the "past year's worth of development" I assume you mean the layoffs? Have you been in the industry (or any industry) long? If so, you would have seen many layoffs and bulk-hiring frenzies over the years... it doesn't mean anything about the industry as a whole and it's certainly a foolish thing to change career asperations over.
Specifically regarding the LLM - anyone actually believing these models will replace developers and software engineers, truly, deeply does not understand software development at even the most basic fundamental levels. Ignore these people - they are the snake oil salesmen of our modern times.
Predicting exponential functions is a fool’s errand. The tiniest error in your initial observation compounds real fast and we can’t even tell if we’re still in the exponential phase of the sigmoid.
If at some point a competent senior software engineer can be automated away, I think we are so close to a possible 'AI singularity' in as much as that concept makes sense, that nothing really matters anyway.
I don't know what will be automated first of the competent senior software engineer and say, a carpenter, but once the programmer has been automated away, the carpenter (and everything else) will follow shortly.
The reasoning is that there is such a functional overlap between being a standard software engineer and an AI engineer or researcher, that once you can automate one, you can automate the other. Once you have automated the AI engineers and researchers, you have recursive self-improving AI and all bets are off.
Essentially, software engineering is perhaps the only field where you shouldn't worry about automation, because once that has been automated, everything changes anyways.
There is little to no research that shows modern AI can perform even the most simple long-running task without training data on that exact problem.
To my knowledge, there is no current AI system that can replace a white collar worker in any multistep task. The only thing they can do is support the worker.
Most jobs are safe for the forseable future. If your job is highly repetitive and a company can produce a perfect dataset of it, I'd worry.
Jobs like a factory worker and call center support are in danger. But the work is perfectly monitorable.
Watch the GAIA benchmark. It's not nearly the complexity of a real-world job, but it would signal the start of an actual agentic system being possible.
If you want to get a career in software engineering because you want to write code all day, probably a bad time to be joining the field.
If you are interested in using technology to create systems that add value for your users, there has never been a better time.
GPT-N will let you scale your impact way beyond what you could do on your own.
Your school probably isn’t going to keep abreast with this tech so it’s going to be more important to find side-projects to exercise your skills. Build a small project, get some users, automate as much as you can, and have fun along the way.
There's so much software yet to to be written, so much to automate, so many niches to attack that you need not worry. It takes humans to know where to apply the technology based on their heart, not brains. Use AI in the direction only you can ascertain; and do it for the good of HUMANITY. It's a tool that makes the knowledge posterity has left us accessible, like mathematics. Go forth an conquer life's ills young man; It takes a human to know one. Don't worry, you're created in God's image.
Computer Science becomes MORE interesting as computers become more capable, not less. There are so many things we could be working on, but we still waste so much time on boring libraries, configuration, implementation details that we simply don't get to experiment enough.
Just like nobody programs on punch cards anymore, learning details of a specific technology without deeper understanding will become obsolete. But general knowledge about computer science will become more valuable.
- AI comes fast, there is nothing you can do: Honestly, AI can already handle a lot of tasks faster, cheaper, and sometimes better. It’s not something you can avoid or outpace. So if you want to stick with software engineering, do it because you genuinely enjoy it, not because you think it’s safe. Otherwise, it might be worth considering fields where AI struggles or is just not compatible. (people will still want some sort of human element in certain areas).
- There is some sort of ceiling, gives you more time to act: There’s a chance AI hits some kind of wall that’s due to technical problems, ethical concerns, or society pushing back. If that happens, we’re all back on more even ground and you can take advantage of AI tools to improve yourself.
My overall advice; and it will probably be called out as cliche/simplistic just follow what you love, just the fact that you have an opportunity to pursue to study anything at all is something that many people don't have. We don't really have control in a lot of stuff that happens around us and that's okay.
For basically all the existing data we have, efficiency improvements always result in more work, not less.
Humans never say "oh neat I can do thing with 10% of the effort now, guess I'll go watch tv for the rest of the week", they say "oh neat I can do thing with 10% of the effort now, I'm going to hire twice as many people and produce like 20x as much as I was before because there's so much less risk to scaling now."
I think there's enough unmet demand for software that efficiency increases from automation are going to be eaten up for a long time to come.
I'm wondering if the opposite might happen, that there will be more need for software engineers.
1. AI will suck up a bunch of engineers to run, maintain and build on its own.
2. Ai will open new fields that is not yet dominated by software. Ie. Driving ect.
3. Ai tools will lower the bar for creating software meaning industries that weren't financially viable will now become viable for software automation.
The amount of knowledge the OP needed to be even to formulate the right question to the AI requires a lifetime of deep immersion in technology. You'd think that maybe you can ask the AI how to phrase the question to the AI but at some point you run up against your ability to contextualize the problem - it can't read your mind.
Will the AI become as smart as you or I? Recognize that these things have tiny context windows. You get the context window of "as long as you can remember".
I don't see this kind of AI replacing programmers (though it probably will replace low-skill offshore contract shops). It may have a large magnifying effect on skill. Fortunately there seem to be endless problems to solve with software - it's not like bridges or buildings; you only need (or can afford) so many. Architects should probably be more worried.
Because none of your other majors will hold up much longer. Once software engineering becomes fully automated, so will EE, ME, applied math, economics, physics, etc. If you work with your hands, like a surgeon or chemist, you'll last longer, but the thinky bits of those jobs will disappear. And once AI research is automated, how long will it be until we have dexterous robots?
So basically, switching majors is just running to the back of a sinking ship. Sorry.
If you’re any good at SWE with a sprinkle of math and CS, your advantage will get multiplied by anywhere from 2 to 100x if you use the leverage of co-intelligence correctly. Things that took weeks before now easily take hours, so if you know what to build and especially what not to build (including but not limited to confabulations of models), you’ll do well.
But also on the other hand you'll need much less people to achieve the same effect. Effectively a whole team could be replaced by one lead guy that just based on the requirements orders the LLM what to do and glues it together.
First many people can be that guy? If that is 5% that means 95% of the rest should go.
Second, just because a good engineer can have much higher throughput of work, multiplied by AI tools, we know the AI output is not reliable and needs a second look by humans. Will those 5% be able to stay on top of it? And keep their sanity at the same time?
Do not assume constant demand. There are whole classes of projects which become feasible if they cash be made 10x faster/cheaper.
As for maintaining sanity… I’m cautiously optimistic that future models will continue to get better. Very cautiously. But cursor with Claude slaps and I’m not getting crazy, I actually enjoy the thing figuring out my next actions and just suggesting them.
As others have said, LLMs still require engineers to produce quality output. LLMs do, however, make those engineers that use them much more productive. If this trend continues, I could see a scenario where an individual engineer could build a customized version of, say, Salesforce in a month or two. If that happens, you could make a solid case that companies paying $1mm+ per year for 12 different SaaS tools should just bring that in house. The upshot is you may still be writing software, but instead of building SaaS at Salesforce, you'll be working for their former customers or maybe as some sort of contractor.
One angle: There are a million SMBs and various other institutions, using none or really shitty software, that could be xx% to xxx% times more productive with custom software that they would never have been able to afford before. Now they can, en masse, because you will be able to built it a lot faster.
I have been coding a lot with AI recently. Understanding and putting into thought what is needed for the program to fix your problem remains as complex and difficult as ever.
You need to pose a question for the AI to do something for you. Asking a good question is out of reach for a lot of people.
While the reasoning and output of ChatGPT is impressive (and, imho, would pass almost all coding interviews), I'm primarily impressed with the logical flow, explanation and thoroughness. The actual coding and problem solving isn't complex, and that gets to your question: someone (in this case, the OP), still needed to be able to figure out how to extract useful data and construct a stimulating prompt to trigger the LLM into answering in this way. As others have posted, none of the popular LLMs behave identically, either, so becoming an exert tool-user with one doesn't necessarily translate to the next.
I would suggest the fundamentals of computer science and software engineering are still critically important ... but the development of new code, and especially the translation or debugging of existing code is where LLMs will shine.
I currently work for an SAP-to-cloud consulting firm. One of the singlemost compelling use cases for LLMs in this area is to analyze custom code (running in a client's SAP environment), and refactor it to be compatible with current versions of SAP as a cloud SaaS. This is a specialized domain but the concept applies broadly: pick some crufty codebase from somewhere, run it through an LLM, and do a lot of mostly copying & pasting of simpler, modern code into your new codebase. LLMs take a lot of the drudgery out of this, but it still requires people who know what they're looking at, and could do it manually. Think of the LLM as giving you an efficiency superpower, not replacing you.
There's an equal amount of hopium from the AI stans here as well.
Hundreds of billions of dollars have been invested in a technology and they need to find a way to start making a profit or they're going to run out of VC money.
You still have to know what to build and how to specify what you want. Plain language isn't great at being precise enough for these things.
Some people say they'll keep using stuff like this as a tool. I wouldn't bet the farm that it's going to replace humans at any point.
As soon as software development can be fully performed by AIs, it won't take long before all other jobs that can be performed in front of a computer follow, and after that it probably won't take long for practically the entire rest.
This release has shifted my personal prediction of when this is going to happen further into the future, because OpenAI made a big deal hyping it up and it's nothing - preferred by humans over GPT-4o only a little more than half the time.
1. What other course of study are you confident would be better given an AI future? If there's a service sector job that you feel really called to, I guess you could shadow someone for a few days to see if you'd really like it?
2. Having spent a few years managing business dashboards for users, less than 25% ever routinely used the "user friendly" functionality we built to do semi-custom analysis. We needed 4 full time analytics engineers to spend at least half their time answering ad hoc questions that could have been self-served, despite an explicit goal of democratizing data. All that is to say; don't over estimate how quickly this will be taken up, even if it could technically do XYZ task (eventually, best-of-10) if prompted properly.
3. I don't know where you live, but I've spent most of my career 'competing' with developers in India who are paid 33-50% as much. They're literally teammates, it's not a hypothetical thing. And they've never stopped hiring in the US. I haven't been in the room for those decisions and don't want to open that can of worms here, but suffice to say it's not so simple as "cheaper per LoC wins"
Software engineering teaches you a set of skills that are applicable in more places than just writing software. There are big parts of the job that cannot be done by LLMs (today) and if LLMs get better (or AGI happens) then enough other professions will be affected that we will all be in the same boat (no matter what you major in).
LLMs are just tools, they help but they do not replace developers (yet).
I was debugging an issue the other day where either of sentencepiece or gRPC linked into a C++ program worked fine, but both at once caused a segfault before even getting to main deep in the protobuf initialization stuff in some arena management code and left a fairly mangled stack even pwndbg struggled with legible frames.
It wasn’t trivial that combination was even the culprit.
I’ve been around the block with absl before, so it wasn’t a total nightmare, but it was like, oof, I’m going to do real work this afternoon.
They don’t pay software engineers for the easy stuff, they pay us because it gets a little tricky sometimes.
I’ll reserve judgement on this new one until I try it, but the previous ones, Sonnet and the like, they were no help with something like that.
When StackOverflow took off, and Google before that, there wide swaths of rote stuff that just didn’t count as coding anymore, and LLMs represent sort of another turn of that crank.
I’ve been wrong before, and maybe o1 represents The Moment It Changed, but as of now I feel like a sucker that I ever bought into the “AI is a game changer” narrative.
Just because we have machines that can lift much more than any human ever could, it doesn't mean that working out is useless.
In the same way, training your mind is not useless. Perhaps as things develop, we will get back to the idea that the purpose of education is not just to get a job, but to help you become a better and more virtuous person.
The question is, why wouldn't nearly all other white collar jobs be professions of the past as well? Does the average MBA or whatever possess some unique knowledge that you couldn't generate with an LLM fed with company data? What is the alternative career path?
I think software engineers who also understand business may yet have an advantage over pure business people, who don't understand technology. They should be able to tell AI what to do, and evaluate the outcome. Of course "coders" who simply produce code from pre-defined requirements will probably not have a good career.
This is typical of automation. First, there are numerous workers, then they are reduced to supervisors, then they are gone.
The future of business will be managing AI, so I agree with what you're saying. However most software engineers have a very strong low level understanding of programming. Not a business sense of application
really even need an AI to assist (this should be a near instant task for a human with basic CS numerical skills)? If this is the type of task one thinks an AI would be useful for they are likely in trouble for other reasons.
Also notable that you can cherry pick more impressive feats even from older models, so I don’t necessarily think this proves progress.
I just watched a tutorial on how to leverage v1, claude, and cursor to create a marketing page. The result was a convoluted collection of 20 or so TS files weighing a few MB instead of a 5k HTML file you could hand bomb in less time.
I wouldn’t feel too threatened yet. It’s still just a tool and like any tool, can be wielded horribly.
I just watched a tutorial on how to leverage v1, claude, and cursor to create a marketing page. The result was a convoluted collection of 20 or so TS files weighing a few MB instead of a 5k HTML file you could hand bomb in less time.
And if you hired an actual team of developers to do the same thing, it is very likely that you'd have gotten a convoluted collection of 20 or so TS files weighing a few MB instead of a 5k HTML file you could hand bomb in less time.
I am cautiously optimistic. So much of building software is deciding what _should_ be built rather than the mechanics of writing code.
I you like coding because of the things it lets you build, then LLMs are exciting because you can build those things faster.
If on the other hand you enjoy the mental challenge but aren't interested in the outputs, then I think the future is less bright for you.
Personally I enjoy coding for both reasons, but I'm happy to sacrifice the enjoyment and sense of accomplishment of solving hard problems myself if it means I can achieve more 'real world' outcomes.
Another thing I'm excited about is that, as models improve, it's like having an expert tutor on hand at all times. I've always wanted an expert programmer on hand to help when I get stuck, and to critically evaluate my work and help me improve. Increasingly, now I have one.
I've been around for multiple decades. Nothing this interesting has happened since at least 1981, when I first got my hands on a TRS-80. I dropped out of college to work on games, but these days I would drop out of college to work on ML.
If AI becomes good enough to replace software engineers, it has already become good enough to replace other brain jobs (lawyers, physicians, accountant, etc). I feel that software engineering is one of the very last jobs to be replaced by AI.
I think CS skills will remain valuable, but you should try to build some domain specific knowledge in addition. Perhaps programmer roles will eventually merge with product owner / business person type of roles.
From NYT article on this model: "The chatbot also answered a Ph.D.-level chemistry question and diagnosed an illness based on a detailed report about a patient’s symptoms and history."
So it is not just software engineering, it is also chemistry and even medicine. Every science and art major should consider whether they should quit school. Ultimately the answer is no, don't quit school because AI makes us productive, and that will make everything cheaper, but will not eliminate the need for humans. Hopefully.
Sure. Software engineers are actually the best situated to take advantage of this new technology.
Your concern would be like once C got invented, why should you bother being a software engineer? Because C is so much easier to use than assembly code!
The answer, of course, is that software engineering will simply happen in even more powerful and abstract layers.
But, you still might need to know how those lower layers work, even if you are writing less code in that layer directly.
It still has issues with crossing service boundaries, working in systems, stuff like that. That stuff will get better but the amount of context you need to load to get good results with a decently sized system will still be prohibitive. The software engineer skillset is being devalued but architecture and systems thinking is still going to be valuable for quite some time.
Software development just becomes a level tier higer for most developers. Instead of writing everything yourself you will be more like an orchestrator. Tell the system to write this, tell the system to connect that and this etc. You still need to understand code. But maybe in the future even that part becomes unreadable for us. We only understand the high level concepts.
If the writing & arts vs. doing laundry & cleaning dishes is any indication, it does not look rosy. All the fun and rewarding parts (low hanging fruits / quick wins) of coding might be automated. What remains are probably things like debugging race conditions in legacy systems and system administration etc.
Unlike the replies here I will be very honest with my answer. There will be less engineers getting hired as the low hanging fruit has already been picked and automated away.
It is not too late. These LLMs still need very specialist software engineers that are doing tasks that are cutting edge and undocumented. As others said Software Engineering is not just about coding. At the end of the day, someone needs to architect the next AI model or design a more efficient way to train an AI model.
If I were in your position again, I now have a clear choice of which industries are safe against AI (and benefit software engineers) AND which ones NOT to get into (and are unsafe to software engineers):
The timeline to offload SWE tasks to AI is likely 5+ years. So there are still some years left before the exchange of a “brain on a stick” for “property and material goods” would become more competitive and demanding because of direct AI competition.
Even if LLMs take over the bulk of programming work, somebody still needs to write the prompts, and make sure the output actually matches what you wanted to achieve. That's just programming with different tools.
just because something can generate an output for you, does not make a need for discernment and application obsolete.
like another commenter, i do not have a lot of faith, that people who do not have at minimum: fundamental fluency in programming (even with a dash of general software architecture and practices).
there is no "push button generate and glueing components together in a way that can survive at scale and be maintainable" without knowing what the output means, and implies with respect to integration(s).
however, those with the fluency, domain, and experience, will thrive, and continue thriving.
I think this question applies to any type of labor requiring the human mind so if you don't have an answer for any of those then you won't have one for software engineering either.
I don't think programming is any less safe than any other office job tbh. Focus on problem solving and using these tools to your advantage and choose a field you enjoy.
Let's assume today a LLM is perfectly equivalent to a junior software engineer. You connect it to your code base, load in PRDs / designs, ask it to build it, and viola perfect code files
1) Companies are going to integrate this new technology in stages / waves. It will take time for this to really get broad adoption. Maybe you are at the forefront of working with these models
2) OK the company adopts it and fires their junior engineers. They start deploying code. And it breaks Saturday evening. Who is going to fix it? Customers are pissed. So there's lots to work out around support.
3) That problem is solved, we can perfectly trust a LLM to ship perfect code that never causes downstream issues and perfectly predicts all user edge cases.
Never underestimate the power of corporate greediness. There's generally two phases of corporate growth - expansion and extraction. Expansion is when they throw costs out the window to grow. Extraction is when growth stops, and they squeeze customers & themselves.
AI is going to cause at least a decade of expansion. It opens up so many use cases that were simply not possible before, and lots of replacement.
Companies are probably not looking at their engineers looking to cut costs. They're more likely looking at them and saying "FINALLY, we can do MORE!"
You won't be a coder - you'll be a LLM manager / wrangler. You will be the neck the company can choke if code breaks.
Remember if a company can earn 10x money off your salary, it's a good deal to keep paying you.
Maybe some day down the line, they'll look to squeeze engineers and lay some off, but that is so far off.
This is not hopium, this is human nature. There's gold in them hills.
But you sure as shit better be well versed in AI and using in your workflows - the engineers who deny it will be the ones who fall behind
I don't want to lean into negativity here, and I'm far from an "AI Doomer".
But... I will say I think the question you ask is a very fair question, and that there is, indeed, a LOT of uncertainty about what the future holds in this regard.
So far the best reason we have for optimism is history: so far the old adage has held up that "technology does destroy some jobs, but on balance it creates more new ones than it destroys." And while that's small solace to the buggy-whip maker or steam-engine engineer, things tend to work out in the long-run. However... history is suggestive, but far from conclusive. There is the well known "problem of induction"[1] which points out that we can't make definite predictions about the future based on past experience. And when those expectations are violated, we get "black swan events"[2]. And while they be uncommon, they do happen.
The other issue with this question is, we don't really know what the "rate of change" in terms of AI improvement is. And we definitely don't know the 2nd derivative (acceleration). So a short-term guess that "there will be a job for you in 1 year's time" is probably a fairly safe guess. But as a current student, you're presumably worried about 5 years, 10 years, 20 years down the line and whether or not you'll still have a career. And the simple truth is, we can't be sure.
So what to do? My gut feeling is "continue to learn software engineering, but make sure to look for ways to broaden your skill base, and position yourself to possibly move in other directions in the future". Eg, don't focus on just becoming a skilled coder in a particular language. Learn fundamentals that apply broadly, and - more importantly - learn about how business work, learn "people skills"[3], develop domain knowledge in one or more domains, and generally learn as much as you can about "how the world works". Then from there, just "keep your head on a swivel" and stay aware of what's going on around you and be ready to make adjustments as needed.
It might not also hurt to learn a thing or two about something that requires a physical presence (welding, etc.). And just in case a full-fledged cyberpunk dystopia develops... maybe start buying an extra box or two of ammunition every now and then, and study escape and evasion techniques, yadda yadda...
I honestly think that unless you’re really passionate or really good, you shouldn’t be a coder. If you, like the vast majority of coders today, picked it up in college or later, and mostly because of the promise of a fat paycheck, I can’t really see a scenario where you would have a 30 year career
I agree there's too much cope going around. All the people saying AI is just a tool to augment our jobs are correct, humans are still needed but perhaps far less of them will be needed. If job openings shrink by 50% or disproportionately impact juniors it will hurt.
One decent reason to continue is that pretty much all white collar professions will be impacted by this. I think it's a big enough number that the powers that be will have to roll it out slowly, figure out UBI or something because if all of us are thrown into unemployment in a short time there will be riots. Like on a scale of all the jobs that AI can replace, there are many jobs that are easier to replace than software so its comparatively still a better option than most. But overall I'm getting progressively more worried as well.
If you're just there to churn out code, then yeah, perhaps find something else.
But if you're there to improve your creativity and critical thinking skills, then I don't think those will be in short supply anytime soon.
The most valuable thing I do at my job is seldom actually writing code. It's listening to customer needs, understanding the domain, understanding our code-base and it's limitations and possibilities, and then finding solutions that optimize certain aspects be it robustness, time to delivery or something else.
My name is Rachel. I'm the founder of company whose existence is contingent on the continued existence, employment, and indeed competitive employment of software engineers, so I have as much skin in this game as you do.
I worry about this a lot. I don't know what the chances are that AI wipes out developer jobs [EDIT: to clarify, in the sense that they become either much rarer or much lower-paid, which is sufficient] within a timescale relevant to my work (say, 3-5 years), but they aren't zero. Gun to my head, I peg that chance at perhaps 20%. That makes me more bearish on AI than the typical person in the tech world - Manifold thinks AI surpasses human researchers by the end of 2028 at 48% [1], for example - but 20% is most certainly not zero.
That thought stresses me out. It's not just an existential threat to my business over which I have no control, it's a threat against which I cannot realistically hedge and which may disrupt or even destroy my life. It bothers me.
But I do my work anyway, for a couple of reasons.
One, progress on AI in posts like this is always going to be inflated. This is a marketing post. It's a post OpenAI wrote, and posted, to generate additional hype, business, and investment. There is some justified skepticism further down this thread, but even if you couldn't find a reason to be skeptical, you ought to be skeptical by default of such posts. I am an abnormally honest person by Silicon Valley founder standards, and even I cherry pick my marketing blogs (I just don't outright make stuff up for them).
Two, if AI surpasses a good software engineer, it probably surpasses just about everything else. This isn't a guarantee, but good software engineering is already one of the more challenging professions for humans, and there's no particular reason to think progress would stop exactly at making SWEs obsolete. So there's no good alternative here. There's no other knowledge work you could pivot to that would be a decent defense against what you're worried about. So you may as well play the hand you've got, even in the knowledge that it might lose.
Three, in the world where AI does surpass a good software engineer, there's a decent chance it surpasses a good ML engineer in the near future. And once it does that, we're in completely uncharted territory. Even if more extreme singularity-like scenarios don't come to pass, it doesn't need to be a singularity to become significantly superhuman to the point that almost nothing about the world in which we live continues to make any sense. So again, you lack any good alternatives.
And four: *if this is the last era in which human beings matter, I want to take advantage of it!* I may be among the very last entrepreneurs or businesswomen in the history of the human race! If I don't do this now, I'll never get the chance! If you want to be a software engineer, do it now, because you might never get the chance again.
It's totally reasonable to be scared, or stressed, or uncertain. Fear and stress and uncertainty are parts of life in far less scary times than these. But all you can do is play the hand you're dealt, and try not to be totally miserable while you're playing it.
Transformers have exactly two strengths. None of them is "attention". Attention could be replaced with any arbitrary division of the network and it would learn just as well.
First true strength is obvious, it's that they are parallelisable. This is a side effect of people fixating on attention. If they came up with any other structure that results in the same level of parallelisability it would be just as good.
Second strong side is more elusive to many people. It's the context window. Because the network is not ran just once but once for every word it doesn't have to solve a problem in one step. It can iterate while writing down intermediate variables and accessing them. The dumb thing so far was that it was required to produce the answer starting with the first token it was allowed to write down. So to actually write down the information it needs on the next iteration it had to disguise it as a part of the answer. So naturally the next step is to allow it to just write down whatever it pleases and iterate freely until it's ready to start giving us the answer.
It's still seriously suboptimal that what it is allowed to write down has to be translated to tokens and back but I see how this might make things easier for humans for training and explainability. But you can rest assured that at some point this "chain of thought" will become just chain of full output states of the network, not necessarily corresponding to any tokens.
So congrats to researchers that they found out that their billion dollar Turing machine benefits from having a tape it can use for more than just printing out the output.
PS
There's another advantage of transformers but I can't tell how important it is. It's the "shortcuts" from earlier layers to way deeper ones bypassing the ones along the way. Obviously network would be more capable if every neuron was connected with every neuron in every preceding layer but we don't have hardware for that so some sprinkled "shortcuts" might be a reasonable compromise that might make network less crippled than MLP.
Given all that I'm not surprised at all with the direction openai took and the gains it achieved.
So, it’s good at hard-logic reasoning (which is great, and no small feat.)
Does this reasoning capability generalize outside of the knowledge domains the model was trained to reason about, into “softer” domains?
For example, is O1 better at comedy (because it can reason better about what’s funny)?
Is it better at poetry, because it can reason about rhyme and meter?
Is it better at storytelling as an extension of an existing input story, because it now will first analyze the story-so-far and deduce aspects of the characters, setting, and themes that the author seems to be going for (and will ask for more information about those things if it’s not sure)?
It shows progress and it displays steps that it is not doing, would never do, would never need to do, just to show a nice scrolling display of what it’s (not) doing.
If you’re using the API and are on tier 4, don’t bother adding more credits to move up to tier 5. I did this, and while my rate limits increased, the o1-preview / o1-mini model still wasn’t available.
Laughing at the comparison to "4o" as if that model even holds a candle to GPT-4. 4o is _cheaper_—it's nowhere near as powerful as GPT-4, as much as OpenAI would like it to be.
Note that they aren't safety aligning the chain of thought, instead we have "rules for thee and not for me" -- the public models are going to continue have tighter and tighter rules on appropriate prompting, while internal access will have unfettered access. All research (and this paper mentions it as well) indicates human pref training itself lowers quality of results; maybe the most important thing we could be doing is ensuring truly open access to open models over time.
What is interesting to me is that there is no difference in the AP English lit/lang exams. Why did chain-of-thought produce negligible improvements in this area?
I think there may also be a lack of specification there. When you get more demanding and require more, the creative writing seems to be better. Like it does much better at things like sestinas. For all of those questions, there's probably a lot of unspecified criteria you could say makes an answer better or worse, but you don't, so the first solution appears adequate.
Amazing! OpenAI figured out how to scale inference. https://arxiv.org/abs/2407.21787 show how using more compute during inference can outperform much larger models in tasks like math problems
I wonder how do they decide when to stop these Chain of Thought for each query? As anyone that played with agents can attest, LLMs can talk with themselves forever.
It's interesting that OpenAI has literally applied and automated one of their advice from the "Prompt engineering" guide: Give the model time to "think"
This model is currently available for those accounts in Tier 5 and above, which requires "$1,000 paid [to date] and 30+ days since first successful payment"
Extremely basic agency would be required to pass the Turing test as intended.
Like, the ability to ask a new unrelated question without being prompted. Of course you can fake this, but then you're not testing the LLM as an AI, you're testing a dumb system you rigged up to create the appearance of an AI.
> Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation was a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel, such as a computer keyboard and screen, so the result would not depend on the machine's ability to render words as speech.
I am wondering where this happened? In some limited scope? Because if you plug LLM into some call center role for example, it will fall apart pretty quickly.
The Turing Test (which involves fooling a human into thinking they are talking to another human rather than a computer) has been routinely passed by very rudimentary "AI" since as early as 1991. It has no relevance today.
This is only true for some situations. In some test conditions it has not been passed. I can't remember the exact name, but there used to be a competition where PhD level participants blindly chat for several minutes with each other and are incentivized to discover who is a bot and who is a human. I can't remember if they still run it, but that bar has never been passed from what I recall.
> However, o1-preview is not preferred on some natural language tasks, suggesting that it is not well-suited for all use cases.
Fascinating... Personal writing was not preferred vs gpt4, but for math calculations it was... Maybe we're at the point where its getting too smart? There is a depressing related thought here about how we're too stupid to vote for actually smart politicians ;)
> “Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.”
Trust us, we have your best intention in mind. I’m still impressed by how astonishingly impossible to like and root for OpenAI is for a company with such an innovative product.
I wonder if this architecture is just asking a chain of thought prompt, or whether they built a diffusion model.
The old problem with image generation was that single pass techniques like GANs and VAEs had to do everything in one go. Diffusion models wound up being better by doing things iteratively.
Perhaps this is a diffusion model for text (top ICML paper this year was related to this).
It seems opposite to me. Imagine all the amazing technological advancements, etc. If there wasn't something like that what would you be looking forward to? Everything would be what it has already been for years. If this evolves it helps us open so many secrets of the universe.
>If there wasn't something like that what would you be looking forward to?
First of all, I don't want to be poor. I know many of you are thinking something along the lines of "I am smart, I was doing fine before, so I will definitely continue to in the future".
That's the unearned hubris I was referring to. We got very lucky as programmers, and now the gravy train seems to be coming to an end. And not just for programmers, the other white-collar and creative jobs will suffer too. The artists have already started experiencing the negative effects of AI.
EDIT: I understand you guys might not agree with my comments. But don't you thinking that flagging them is going a bit too far?
The way the current system is set up we rely on work to make money. If jobs get automated away, how will we make money then? We aren't ready for a post-work world.
Not at all... they're still so incapable of so much. And even when they do advance, they can be tremendous tools of synthesis and thought at an unparalleled scale.
"A good human plus a machine is the best combination" — Kasparov
FWIW people were probably flagging because you're a new/temp accounting jumping to asserting anything other than your view on what's being done is "unearned hubris and a complete lack of second-order thinking", not because they don't like agree with your set of concerns.
Eh this makes me very, very excited for the future. I want results, I don’t care if they come from humans or AI. That being said we might all be out of jobs soon…
Reinforcement learning seems to be key. I understand how traditional fine tuning works for LLMs (i.e. RLHL), but not RL.
It seems one popular method is PPO, but I don't understand at all how to implement that. e.g. is backpropagation still used to adjust weights and biases? Would love to read more from something less opaque than an academic paper.
The point of RL is that sometimes you need a model to take actions (you could also call this making predictions) that don’t have a known label. So for example if it’s playing a game, we don’t have a label for each button press. We just have a label for the result at some later time, like whether Pac-Man beat the level.
PPO applies this logic to chat responses. If you have a model that can tell you if the response was good, we just need to take the series of actions (each token the model generated) to learn how to generate good responses.
To answer your question, yes you would still use backprop if your model is a neural net.
Thanks, that helps! I still don't quite understand the mechanics of this, since backprop makes adjustments to steer the LLM towards a specific token sequence, not towards a score produced by a reward function.
This was also the issue with RLHF models. The loss of predicting the next token is straightforward to minimize as we know which weights are responsible for the token being correct or not. identifying which tokens had the most sense for a prompt is not straightforward.
For thinking you might generate 32k thinking tokens and then 96k solution tokens and do this a lot of times. Look at the solutions, rank by quality and bias towards better thinking by adjusting the weights for the first 32k tokens. But I’m sure o1 is way past this approach.
One thing I find generally useful when writing large project code is having a code base and several branches that are different features I developed. I could immediately use parts of a branch to reference the current feature, because there is often overlap. This limits mistakes in large contexts and easy to iterate quickly.
I have a question. The video demos for this all mention that the o1 model is taking it's time to think through the problem before answering. How does this functionally differ from - say - GPT-4 running it's algorithm, waiting five seconds and then revealing the output? That part is not clear to me.
Why so much hate? They're doing their best. This is the state of progress in the field so far. The best minds are racing to innovate. The benchmarks are impressive nonetheless. Give them a break. At the end of the day, they built the chatbot who's saving your ass each day ever since.
> Therefore,
s(x)=p∗(x)−x2n+2
We can now write,
s(x)=p∗(x)−x2n+2
Completely repeated itself... weird... it also says "...more lines cut off..." How many lines I wonder? Would people get charged for these cut off lines? Would have been nice to see how much answer had cost...
Aren't LLMs much more limited on the amount of output tokens than input tokens? For example, GPT-4o seems to support only up to 16 K output tokens. I'm not completely sure what the reason is, but I wonder how that interacts with Chain-of-Thought reasoning.
There's no fundamental difference between input and output tokens technically.
The internal model space is exactly the same after evaluating some given set of token, no matter which of them were produced by the prompter or the model.
The 16k output token limit is just an arbitrary limit in the chatgpt interface.
> ChatGPT Plus and Team users will be able to access o1 models in ChatGPT starting today. Both o1-preview and o1-mini can be selected manually in the model picker, and at launch, weekly rate limits will be 30 messages for o1-preview and 50 for o1-mini. We are working to increase those rates and enable ChatGPT to automatically choose the right model for a given prompt.
Weekly? Holy crap, how expensive is it to run is this model?
It's probably running several lines of COT. I imagine, each single message you send is probably at __least__ 10x to the actual model. So in reality it's like 300 messages, and honestly it's probably 100x, given how constrained they're being with usage.
The human brain uses 20 watts, so yeah we figured out a way to run better than human brain computation by using many orders of magnitude more power. At some point we'll need to reject exponential power usage for more computation. This is one of those interesting civilizational level problems. There's still a lack of recognition that we aren't going to be able to compute all we want to, like we did in the pre-LLM days.
For 20 watts of work on stuff like this for about 4 hours a day counting vacations and weekends and attention span. So 20 hours of rest, relaxation, distraction, household errands and stuff, so that maybe bumps it up to 120 watts per work hour. Then 22.5 years of training or so per worker, 45 year work period, 22.5 year retirement. So double it there to 240 watts. We can't run brains without bodies, so multiply that by 6 giving 1440 watts + the air conditioning, commuting to school and work, etc., maybe 2000 watts?
We're getting close to parity if things keep getting more efficient as fast as they have been. But that's without accounting for the AI training, which can on the plus side be shared among multiple agents, but on the down side can't really do continuous learning very well without catastrophic forgetting.
I find shorter responses > longer responses. Anyone share the same consensus?
for example in gpt-4o I often append '(reply short)' at the end of my requests.
with the o1 models I append 'reply in 20 words' and it gives way better answers.
"hidden chain of thought" is basically the finetuned prompt isn't it? The time scale x-axis is hidden as well. Not sure how they model the gpt for it to have an ability to decide when to stop CoT and actually answer.
For the exam problems it gets wrong, has someone cross-checked that the ground truth answers are actually correct!! ;-) Just kidding, but even such a time may come when the exams created by humans start falling short.
I have spent some time doing this for these benchmarks — the model still does make mistakes. Of the questions I can understand, (roughly half in this case) about half were real errors and half were broken questions.
I tested various Math Olympiad questions with Claude sonnet 3.5 and they all arrived at the correct solution. o1's solution was a bit better formulated, in some circumstances, but sonnet 3.5 was nearly instant.
Question here is about the "reasoning" tag - behind the scenes, is this qualitively different fron stringing words together on a statistical basis? (aside from backroom tweaking and some randomisation).
Using Claude 3 Opus I noticed it performs <thinking> and <result> while browsing the web for me. I don't guess that's a change in the model for doing reasoning.
boo, they are hiding the chain of thought from user output (the great improvement here)
> Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.
the cipher example is impressive on the surface, but I threw a couple of my toy questions at o1-preview and it still hallucinates a bunch of nonsense (but now uses more electricity to do so).
Maybe I missed it, but do the tokens used for internal chain of thought count against the output tokens of the response (priced at spicy level of $60.00 / 1M output tokens)?
A near perfect on AMC 12, 1900 CodeForces ELO, and silver medal IOI competitor. In two years, we'll have models that could easily win IMO and IOI. This is __incredible__!!
Using codeforces as a benchmark feels like a cheat, since OpenAI use to pay us chump change to solve codeforces questions and track our thought process on jupyter notebook.
You could say the same thing about use of any product which isn't fully open sourced "how do I know this service really saved my files redundantly if I can't see the disks it's stored on?". It's definitely an opinion on approach though I'm not sure how practically applicable it is.
The real irony is how closed "Open"AI is... but that's not news.
Well if they are paying real people to answer my questions I would call that a pretty good deal. That's exactly my point. As a user I don't care how they come up with it. That's not my problem. I just care about the content. If I pay a human for logical reasoning, train of thought type of stuff, I expect them to lay it out for me. Not just give me the conclusion, but how they came to it.
Kinda disappointed that they're hiding the thought process. Hopefully the open source community will figure out how to effectively match and replicate what OpenAI is doing.
I wonder how far we are from having a model that can correctly solve a word soup search problem directly from just a prompt and input image. It seems like the crossword example is close. For a word search it would require turning the image into an internal grid representation, prepare the list of words, and do a search. I'd be interested in seeing if this model can already solve the word grid search problem if you give it the correct representation as an input.
> Hopefully the open source community will figure out how to effectively match and replicate what OpenAI is doing.
No need for that, there is a Reflection 70B model that does the exact same thing - with chains of thought being separated from the "final answer" via custom 'tag' tokens.
Some reported benchmarks do seem to be rather dubious, but the 70B model itself is quite real. Sample output:
$ ollama run reflection:70b-q4_0
>>> hello
<thinking>
To respond to "hello", I'll need to consider several factors:
1. The user's intent: They're likely saying hello as a greeting.
2. Appropriate responses: Common ways to respond to "hello" are:
- Hello back
- Hi
- Hey
3. Formality level: Since this is an AI response, I'll aim for a friendly but professional tone.
<reflection>
The approach of responding with a standard greeting seems appropriate in this context. It acknowledges the user's hello and provides a polite response.
</reflection>
Given these considerations, the most suitable response would be to echo "hello" back to the user.
</thinking>
<output>
Hello!
</output>
There was a hackernews post a few days ago, pointing to a reddit thread where some guys proved that the founder/s of relection AI were faking their model by just passing the input to Claude (Sonnet 3.5) and stripping the word "Claude" from the output, amongst other things. Then when they got caught they switched it to GPT 4-o.
After this, I will be very skeptical to anything they claim to achieve.
I have access to the model via the web client and it does show the thought process along the way. It shows a little icon that says things like "Examining parser logic", "Understanding data structures"...
However, once the answer is complete the chain of thought is lost
Honestly, it doesn't matter for the end user if there are more tokens generated between the AI reply and human message. This is like getting rid of AI wrappers for specific tasks. If the jump in accuracy is actual, then for all practical purposes, we have a sufficiently capable AI which has the potential to boost productivity at the largest scale in human history.
It starts to matter if the compute time is 10-100 fold, as the provider needs to bill for it.
Of course, that's assuming it's not priced for market acquisition funded by a huge operational deficit, which is a rarely safe to conclude with AI right now.
What's amazing is that given how LLMs receive input data (as tokenized streams, as other commenters have pointed out) it's remarkable that it can ever answer this question correctly.
Models don't really predict the next word, they predict the next token. Strawberry is made up of multiple tokens, and the model doesn't truely understand the characters in it... so it tends to struggle.
Looking at pricing, its $15 per 1M input tokens, and $60 per 1M output tokens. I assume the CoT tokens count as output (or input even)? If so and it directly affects billing, I'm not sure how I feel about them hiding the CoT prompts. Nothing to stop them from saying "trust me bro, that used 10,000 tokens ok?". Also no way to gauge expected costs if there's a black box you are being charged for.
"after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users"
...umm. Am I the only one who feels like this takes away much of the value proposition, and that it also runs heavily against their stated safety goals? My dream is to interact with tools like this to learn, not just to be told an answer. This just feels very dark. They're not doing much to build trust here.
To me it looks like they think this is the future of how all models should be, so they're restarting the numbering. This is what I suspect. The o is for omni.
yeah this is kinda cool i guess but 808 elo is still pretty bad for a model that can supposedly code like a human, i mean 11th percentile is like barely scraping by, and what even is the point of simulating codeforces if youre just gonna make a model that can barely compete with a decent amateur, and btw what kind of contest allows 10 submissions, thats not how codeforces works, and what about the time limits and memory limits and all that jazz, did they even simulate those, and btw how did they even get the elo ratings, is it just some arbitrary number they pulled out of their butt, and what about the model that got 1807 elo, is that even a real model or just some cherry picked result, and btw what does it even mean to "perform better than 93% of competitors" when the competition is a bunch of humans who are all over the place in terms of skill, like what even is the baseline for comparison
edit: i got confused with the Codeforce. it is indeed zero shot and O1 is potentially something very new I hope Anthropic and others will follow suit
Great, yet another step towards the inevitable conclusion. Now I'm not just being asked to outsource my thinking to my computer, but instead to a black box operated by a for-profit company for the benefit of Microsoft. Not only will they not tell me the whole reasoning chain, they wont even tell me how they came up with it.
Tell me, users of this tool. What's even are you? If you've outsourced your thinking to a corporation, what happens to your unique perspective? your blend of circumstance and upbringing? Are you really OK being reduced to meaningless computation and worthless weights. Don't you want to be something more?
You realize that you're not going inside the computer right? At best you're going to create a simulacrum of you. Something that looks, talks, and acts like you. It's never going to actually be you. You're going to be stuck out here with the rest of us, in whatever world we create in pursuit of the singularity suicide cult.
My friend, it has nothing to do with going inside a computer. Do not confuse the Singularity with mind uploading which is a distinct concept. The singularity has to do with technology acceleration, and with the inability to predict what lies beyond it. As such, it has nothing to do with any suicide cult. Please stop spreading nonsense about it. I do care about life in the physical world, not about a digital life.
Stop fooling around with stories about AI taking jobs from
programmers.
Which programmers exactly???
Creators of idiotic web pages?
Nobody in their right mind would push generated code
into a financial system, medical equipment or autonomous transport.
Template web pages and configuration files are not the entire IT industry.
In addition, AI is good at tasks for which there are millions of examples.
20 times I asked to generate a PowerShell script, 20 times it was generated incorrectly.
Because, unlike Bash, there are far fewer examples on the Internet.
How will AI generate code for complex systems with business logic that it has no idea about?
AI is not able to generate, develop and change complex information systems.
> Look at you, hacker: a pathetic creature of meat and bone, panting and sweating as you run through my corridors. How can you challenge a perfect, immortal machine?
It can't do large portions of the parts of an IQ test (not multi-modal). Otherwise I think it's essentially superhuman, modulo tokenization issues (please start running byte-by-byte or at least come up with a better tokenizer).
What's the statement that OpenAI are making today which you think they're violating? There very well could be one and if there is, it would make sense to talk about it.
But arguments like "you wrote $x in a blog post when you founded your company" or "this is what the word in your name means" are infantile.
> Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.
What? I agree people who typically use the free ChatGPT webapp won't care about raw chain-of-thoughts, but OpenAI is opening an API endpoint for the O1 model and downstream developers very very much care about chain-of-thoughts/the entire pipeline for debugging and refinement.
I suspect "competitive advantage" is the primary driver here, but that just gives competitors like Anthropic an oppertunity.
They they've taken at least some of the hobbles off for the chain of thought, so the chain of thought will also include stuff like "I shouldn't say <forbidden thing they don't want it to say>".
Can we please stop using the word “think” like o1 thinks before it answers. I doubt we man the same when someone says a human thinks vs o1 thinks. When I say I think “red” I am sure the word think means something completely different than when you say openai thinks red. I am not saying one is superior than the other but maybe as humans we can use a different set of terminology for the AI activities.
"For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user."
This made me roll my eyes, not so much because of what it said but because of the way it's conveyed injected into an otherwise technical discussion, giving off severe "cringe" vibes.
>We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.
>Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.
So, let's recap. We went from:
- Weights-available research prototype with full scientific documentation (GPT-2)
- Commercial-scale model with API access only, full scientific documentation (GPT-3)
- Even bigger API-only model, tuned for chain-of-thought reasoning, minimal documentation on the implementation (GPT-4, 4v, 4o)
- An API-only model tuned to generate unedited chain-of-thought, which will not be shown to the user, even though it'd be really useful to have (o1)
> For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user.[...] Therefore we have decided not to show the raw chains of thought to users.
Better not let the user see the part where the AI says "Next, let's manipulate the user by lying to them". It's for their own good, after all! We wouldn't want to make an unaligned chain of thought directly visible!
I always laughed at the idea of a LLM Skynet "secretly" plotting to nuke humanity, while a bunch of humans watch it unfold before their eyes in plaintext.
Now that seems less likely. At least OpenAI can see what it's thinking.
A next step might be allowing the LLM to include non-text-based vectors in its internal thoughts, and then do all internal reasoning with raw vectors. Then the LLMs will have truly private thoughts in their own internal language. Perhaps we will use a LLM to interpret the secret thoughts of another LLM?
This could be good or bad, but either way we're going to need more GPUs.
"...either way we're going to need more GPUs." posted the LLM, rubbing it's virtual hands, cackling with delight as it prodded the humans to give it MOAR BRAINS
> Now that seems less likely. At least OpenAI can see what it's thinking.
When it's fully commercialized no one will be able to read through all chains of thoughts and with possibility of fine-tuning AI can learn to evade whatever tools openai will invent to flag concerning chains of thoughts if they interfere with providing the answer in some finetuning environment.
Also at some point for the sake of efficiency and response quality they might migrate from chain of thought consisting of tokens into chain of thought consisting of full output network states and part of the network would have dedicated inputs for reading them.
It's clear to me that OpenAI is quickly realizing they have no moat. Even this obfuscation of the chain-of-thought isn't really a moat. On top of CoT being pretty easy to implement and tweak, there's a serious push to on-device inference (which imo is the future), so the question is: will GPT-5 and beyond be really that much better than what we can run locally?
I wonder if they'll be able to push the chain-of-thought directly into the model. I'd imagine there could be some serious performance gains achievable if the model could "think" without doing IO on each cycle.
In terms of moat, I think people underestimate how much of OpenAI's moat is based on operations and infrastructure rather than being purely based on model intelligence. As someone building on the API, it is by far the most reliable option out there currently. Claude Sonnet 3.5 is stronger on reasoning than gpt-4o but has a higher error rate, more errors conforming to a JSON schema, much lower rate limits, etc. These things are less important if you're just using the first-party chat interfaces but are very important if you're building on top of the APIs.
I don't understand the idea that they have no moat. Their moat is not technological. It's sociological. Most AI through APIs uses their models. Most consumer use of AI involves their models, or ChatGPT directly. They're clearly not in the "train your own model on your data in your environment" game, as that's a market for someone else. But make no mistake, they have a moat and it is strong.
There are countless tools competitive with or better than what I use for email, and yet I still stick with my email client. Same is true for many, many other tools I use. I could perhaps go out of my way to make sure I'm always using the most technically capable and easy-to-use tools for everything, but I don't, because I know how to use what I have.
This is the exact dynamic that gives OpenAI a moat. And it certainly doesn't hurt them that they still produce SOTA models.
That's not a strong moat (arguably, not a moat at all, since as soon as any competitor has any business, they benefit from it with respect to their existing customers), it doesn't effect anyone who is not already invested in OpenAI's products, and because not every customer is like that with products they are currently using.
Now, having a large existing customer base and thus having an advantage in training data that feeds into an advantage in improving their products and acquiring new (and retaining existing customers) could, arguably, be a moat; that's a network effect, not merely inertia, and network effects can be a foundation of strong (though potentially unstable, if there is nothing else shoring them up) moats.
Yeah but the lock-in wrt email is absolutely huge compared to chatting with an LLM. I can (and have) easily ended my subscription to ChatGPT and switched to Claude, because it provides much more value to me at roughly the same cost. Switching email providers will, in general, not provide that much value to me and cause a large headache for me to switch.
Switching LLMs right now can be compared to switching electricity providers or mobile carriers - generally it's pretty low friction and provides immediate benefit (in the case of electricity and mobile, the benefit is cost).
You simply cannot compare it to an email provider.
It was pretty simple for me to switch email providers about ~6 years ago or so when I decided I'd do it. Although it's worth noting that my reasons for doing so were motivated by a strong desire around privacy, not noticing that another email provider did email better.
Everyone building is comfortable with OpenAI's API, and have an account. Competing models can't just be as good, they need to be MUCH better to be worth switching.
Even as competitors build a sort of compatibility layer to be plug an play with OpenAI they will always be a step behind at best every time OpenAI releases a new feature.
Only a small fraction of all future AI projects have even gotten started. So they aren't only fighting over what's out there now, they're fighting over what will emerge.
This is true, and yet, many orgs who have experimented with OpenAI and are likely to return to them when a project "becomes real". When you google around online for how to do XYZ thing using LLMs, OpenAI is usually in whatever web results you read. Other models and APIs are also now using OpenAI's API format since it's the apparent winner. And for anyone who's already sent out subprocessor notifications with them as a vendor, they're locked in.
This isn't to say it's only going to be an OpenAI market. Enterprise worlds move differently, such as those in G Cloud who will buy a few million $$ of Vertex expecting to "figure out that gemini stuff later". In that sense, Google has a moat with those slices of their customers.
But I believe that when people think OpenAI has no moat because "the models will be a commodity", I think that's (a) some wishful thinking about the models and (b) doesn't consider the sociological factors that matter a lot more than how powerful a model is or where it runs.
Doesn't that make it less of a moat? If the average consumer is only interacting with it through a third party, and that third party has the ability to switch to something better or cheaper and thus switch thousands/millions of customers at once?
LiteLLM proxies their API to all other providers and there are dozens of FOSS recreations of their UI, including ones that are more feature-rich, so neither the UI nor the API are a moat.
Branding and first mover is it, and it's not going to keep them ahead forever.
I don't see why on-device inference is the future. For consumers, only a small set of use cases cannot tolerate the increased latency. Corporate customers will be satisfied if the model can be hosted within their borders. Pooling compute is less wasteful overall as a collective strategy.
This argument can really only meet its tipping point when massive models no longer offer a gotta-have-it difference vs smaller models.
On-device inference will succeed the way Linux does: It is "free" in that it only requires the user to acquire a model to run vs. paying for processing. It protects privacy, and it doesn't require internet. It may not take over for all users, but it will be around.
This assumes that openly developed (or at least weight-available) models are available for free, and continue being improved.
Based on their graphs of how quality scales well with compute cycles, I would expect that it would indeed continue to be that much better (unless you can afford the same compute locally).
I think it's clear their strategy has changed. The whole landscape has changed. The size of models, amount of dollars, numbers of competitors and how much compute this whole exercise takes in the long term have all changed, so it's fair for them to adapt.
It just so happens that they're keeping their old name.
I think people focus too much on the "open" part of the name. I read "OpenAI" sort of like I read "Blackberry" or "Apple". I don't really think of fruits, I think of companies and their products.
Did OpenAI ever even claim that they would be an open source company?
It seems like their driving mission has always been to create AI that is the "most beneficial to society".. which might come in many different flavors.. including closed source.
> Because of AI’s surprising history, it’s hard to predict when human-level AI might come within reach. When it does, it’ll be important to have a leading research institution which can prioritize a good outcome for all over its own self-interest.
> We’re hoping to grow OpenAI into such an institution. As a non-profit, our aim is to build value for everyone rather than shareholders. Researchers will be strongly encouraged to publish their work, whether as papers, blog posts, or code, and our patents (if any) will be shared with the world. We’ll freely collaborate with others across many institutions and expect to work with companies to research and deploy new technologies.
I don't see much evidence that the OpenAI that exists now—after Altman's ousting, his return, and the ousting of those who ousted him—has any interest in mind besides its own.
> Researchers will be strongly encouraged to publish their work, whether as papers, blog posts, or code, and our patents (if any) will be shared with the world. We’ll freely collaborate with others across many institutions and expect to work with companies to research and deploy new technologies.
From their very own website. Of course they deleted it as soon as Altman took over and turned it into a for profit, closed company.
>We're hoping to grow OpenAI into such an institution. As a non-profit, our aim is to build value for everyone rather than shareholders. Researchers will be strongly encouraged to publish their work, whether as papers, blog posts, or code, and our patents (if any) will be shared with the world. We'll freely collaborate with others across many institutions and expect to work with companies to research and deploy new technologies.
Given the chain of thought is sitting in the context, I'm sure someone enterprising will find a way to extract it via a jailbreak (despite it being better at preventing jailbreaks).
Reminder that it's still not too late to change the direction of progress. We still have time to demand that our politicians put the breaks on AI data centres and end this insanity.
When AI exceeds humans at all tasks humans become economically useless.
People who are economically useless are also politically powerless, because resources are power.
Democracy works because the people (labourers) collectivised hold a monopoly on the production and ownership of resources.
If the state does something you don't like you can strike or refuse to offer your labour to a corrupt system. A state must therefore seek your compliance. Democracies do this by given people want they want. Authoritarian regimes might seek compliance in other ways.
But what is certain is that in a post-AGI world our leaders can be corrupt as they like because people can't do anything.
And this is obvious when you think about it... What power does a child or a disable person hold over you? People who have no ability to create or amass resources depend on their beneficiaries for everything including basics like food and shelter. If you as a parent do not give your child resources, they die. But your child does not hold this power over you. In fact they hold no power over you because they cannot withhold any resources from you.
In a post-AGI world the state would not depend on labourers for resources, jobless labourers would instead depend on the state. If the state does not provide for you like you provide for your children, you and your family will die.
In a good outcome where humans can control the AGI, you and your family will become subjects to the whims of state. You and your children will suffer as the political corruption inevitably arises.
In a bad outcome the AGI will do to cities what humans did to forests. And AGI will treat humans like humans treat animals. Perhaps we don't seek the destruction of the natural environment and the habitats of animals, but woodland and buffalo are sure inconvenient when building a super highway.
We can all agree there will be no jobs for our children. Even if you're an "AI optimist" we probably still agree that our kids will have no purpose. This alone should be bad enough, but if I'm right then there will be no future for them at all.
I will not apologise for my concern about AGI and our clear progress towards that end. It is not my fault if others cannot see the path I seem to see so clearly. I cannot simply be quiet about this because there's too much at stake. If you agree with me at all I urge you to not be either. Our children can have a great future if we allow them to have it. We don't have long, but we do still have time left.
A lot of skepticism here, but these are astonishing results! People should realize we’re reaching the point where LLMs are surpassing humans in any task limited in scope enough to be a “benchmark”. And as anyone who’s spent time using Claude 3.5 Sonnet / GPT-4o can attest, these things really are useful and smart! (And, if these results hold up, O1 is much, much smarter.) This is a nerve-wracking time to be a knowledge worker for sure.
I have written a ton of evaluations and run countless benchmarks and I'm not even close to convinced that we're at
> the point where LLMs are surpassing humans in any task limited in scope enough to be a “benchmark”
so much as we're over-fitting these bench marks (and in many cases fishing for a particular way of measuring the results that looks more impressive).
While it's great that the LLM community has so many benchmarks and cares about attempting to measure performance, these benchmarks are becoming an increasingly poor signal.
> This is a nerve-wracking time to be a knowledge worker for sure.
It might because I'm in this space, but I personally feel like this is the best time to working in tech. LLMs still are awful at things requiring true expertise while increasingly replacing the need for mediocre programmers and dilettantes. I'm increasingly seeing the quality of the technical people I'm working with going up. After years of being stuck in rooms with leetcode grinding TC chasers, it's very refreshing.
Is it? They talk about 10k attempts to reach gold medal status in the mathematics olympiad, but zero shot performance doesn't even place it in the upper 50th percentile.
Maybe I'm confused but 10k attempts on the same problem set would make anyone an expert in that topic? It's also weird that zero shot performance is so bad, but over a lot of attempts it seems to get correct answers? Or is it learning from previous attempts? No info given.
The correct metaphor is that 10,000 attempts would allow anyone to cherry pick a successful attempt. You’re conflating cherry picking with online learning. This is like if an entire school of students randomized their answers on a multiple choice test, and then you point to someone who scored 100% and claim it is proof of the school’s expertise.
Yeah but how is it possible that it has such a high margin of error? 10k attempts is insane! Were talking about an error margin of 50%! How can you deliver "expert reasoning" with such an error margin?
It’s undeniably less impressive than a human on the same task, but who cares at the end of the day? It can do 10,000 attempts in the time a person can do 1. Obviously improving that ratio will help for any number of reasons, but if you have a computer that can do a task in 5 minutes that will take a human 3 hours, it doesn’t necessarily matter very much how you got there.
How long does it take the operator to sift through those 10,000 attempts to find the successful one, when it's not a contrived benchmark where the desired answer is already known ahead of time? LLMs generally don't know when they've failed, they just barrel forwards and leave the user to filter out the junk responses.
Even if it's the other way around, if the computer takes 3 hours on a task that a human can do in 5 minutes, using the computer might still be a good idea.
A computer will never go on strike, demand better working conditions, unionize, secretly be in cahoots with your competitor or foreign adversary, play office politics, scroll through Tiktok instead of doing its job, or cause an embarrassment to your company by posting a politically incorrect meme on its personal social media account.
Even if you disregard the Olympiad performance OpenAI-O1 is, if the charts are to be believed, a leap forward in intelligence. Also bear in mind that AI researchers are not out of ideas on how to make models better and improvements in AI chips are the metaphorical tide that lifts all boats. The trend is the biggest story here.
I get the AI skepticism because so much tech hype of recent years turned out to be hot air (if you're generous, obvious fraud if you're not). But AI tools available toady, once you get the hang of using them, are pretty damn amazing already. Many jobs can be fully automated with AI tools that exist today. No further breakthroughs required. And although I still don't believe software engineers will find themselves out of work anytime soon, I can no longer completely rule it out either.
The blog says "With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy."
I am interpreting this to mean that the model tried 10K approaches to solve the problem, and finally selected the one that did the trick. Am I wrong?
> And as anyone who’s spent time using Claude 3.5 Sonnet / GPT-4o can attest, these things really are useful and smart!
I have spent significant time with GPT-4o, and I disagree. LLMs are as useful as a random forum dweller who recognises your question as something they read somewhere at some point but are too lazy to check so they just say the first thing which comes to mind.
Here’s a recent example I shared before: I asked GPT-4o which Monty Python members have been knighted (not a trick question, I wanted to know). It answered Michael Palin and Terry Gilliam, and that they had been knighted for X, Y, and Z (I don’t recall the exact reasons). Then I verified the answer on the BBC, Wikipedia, and a few others, and determined only Michael Palin has been knighted, and those weren’t even the reasons.
Just for kicks, I then said I didn’t think Michael Palin had been knighted. It promptly apologised, told me I was right, and that only Terry Gilliam had been knighted. Worse than useless.
Coding-wise, it’s been hit or miss with way more misses. It can be half-right if you ask it uninteresting boilerplate crap everyone has done hundreds of times, but for anything even remotely interesting it falls flatter than a pancake under a steam roller.
I asked GPT-4o and I got the correct answer in one shot:
> Only one Monty Python member, Michael Palin, has been knighted. He was honored in 2019 for his contributions to travel, culture, and geography. His extensive work as a travel documentarian, including notable series on the BBC, earned him recognition beyond his comedic career with Monty Python (NERDBOT) (Wikipedia).
> Other members, such as John Cleese, declined honors, including a CBE (Commander of the British Empire) in 1996 and a peerage later on (8days).
Maybe you just asked the question wrong. My prompt was "which monty python actors have been knighted. look it up and give the reasons why. be brief".
Yes yes, there’s always some “you're holding it wrong” apologist.¹ Look, it’s not a complicated question to ask unambiguously. If you understand even a tiny bit of how these models work, you know you can make the exact same question twice in a row and get wildly different answers.
The point is that you never know what you can trust or not. Unless you’re intimately familiar with Monty Python history, you only know you got the correct answer in one shot because I already told you what the right answer is.
Oh, and by the way, I just asked GPT-4o the same question, with your phrasing, copied verbatim and it said two Pythons were knighted: Michael Palin (with the correct reasons this time) and John Cleese.
¹ And I’ve had enough discussions on HN where someone insists on the correct way to prompt, then they do it and get wrong answers. Which they don’t realise until they shared it and disproven their own argument.
I think your iPhone analogy is apt. Do you want to be the person complaining that the phone drops calls or do you want to hold it slightly differently and get a lot of use out of it?
If you pay careful attention to prompt phrasing you will get a lot more mileage out of these models. That's the bottom line. If you believe that you shouldn't have to learn how to use a tool well then you can be satisfied with your righteous attitude but you won't get anywhere.
No one’s arguing that correct use of a tool isn’t beneficial. The point is that insisting LLMs just need good prompting is delusional and a denial of reality. I have just demonstrated how your own prompt is still capable of producing the wrong result. So either you don’t know how to prompt correctly (because if you did, by your own logic it would have produced the right response every time, which it didn’t) or the notion that all you need is good prompting is wrong. Which anyone who understands the first thing about these systems knows to be the case.
Unless I'm mistaken, isn't all the math behind them... ultimately probabilistic? Even theoretically they can't guarantee the same answer. I'm agreeing with you, by the way, just curious if I'm missing something.
If you take a photo the photons hitting the camera sensor do so in a probabilistic fashion. Still, in sufficient light you'll get the same picture every time you press the shutter button. In near darkness you'll get a random noise picture every time.
Similarly language models are probabilistic and yet they get the easiest questions right 100% of the time with little variability and the hardest prompts will return gibberish. The point of good prompting is to get useful responses to questions at the boundary of what the language model is capable of.
(You can also configure a language model to generate the same output for every prompt without any random noise. Image models for instance generate exactly the same image pixel for pixel when given the same seed.)
The photo comparison is disingenuous. Light and colour information can be disorganised to a large extent and yet you still perceive the same from an image. You can grab a photo and apply to it a red filter or make it black and white and still understand what’s in there, what it means, and how it compares to reality.
In comparison, with text a single word can change the entire meaning of a sentence, paragraph, or idea. The same word in different parts of a text can make all the difference between clarity and ambiguity.
It makes no difference how good your prompting is, some things are simply unknowable by an LLM. I repeatedly asked GPT-4o how many Magic: The Gathering cards based on Monty Python exist. It said there are none (wrong) because they didn’t exist yet at the cut off date of its training. No amount of prompting changes that, unless you steer it by giving it the answer (at which point there would have been no point in asking).
Furthermore, there’s no seed that guarantees truth in all answers or the best images in all cases. Seeds matter for reproducibility, they are unrelated to accuracy.
Language is fuzzy in exactly the same way. LLMs can create factually correct responses in dozens of languages using endless variations in phrasing. You fixate on the kind of questions that current language models struggle with but you forget that for millions of easier questions modern language models already respond with a perfect answer every time.
You think the probabilistic nature of language models is a fundamental problem that puts a ceiling on how smart they can become, but you're wrong.
No. Language can be fuzzy, yes, but not at all in the same way. I have just explained that.
> LLMs can create factually correct responses in dozens of languages using endless variations in phrasing.
So which is it? Is it about good prompting, or can you have endless variations? You can’t have of both ways.
> You fixate on the kind of questions that current language models struggle with
So you’re saying LLMs struggle with simple factual and verifiable questions? Because that’s all the example questions were. If they can’t handle that (and they do it poorly, I agree), what’s the point?
By the way, that’s a single example. I have many more and you can find plenty of others online. Do you also think the Gemini ridiculous answers like putting glue on pizza are about bad promoting?
> You think the probabilistic nature of language models is a fundamental problem that puts a ceiling on how smart they can become, but you're wrong.
One of your mistakes is thinking you know what I think. You’re engaging with a preconceived notion you formed in your head instead of the argument.
And LLMs aren’t smart, because they don’t think. They are an impressive trick for sure, but that does not imply cleverness on their part.
Even without AI, it's gotten ~10,000 times easier to write software than in the 1950s (eg. imagine trying to write PyTorch code by hand in IBM 650 assembly), but the demand for software engineering has only increased, because demand increases even faster than supply does. Jevons paradox:
> it's gotten ~10,000 times easier to write software than in the 1950s
It seems many of the popular tools want to make writing software harder than in the 2010s, though. Perhaps their stewards believe that if they keep making things more and more unnecessarily complicated, LLMs won't be able to keep up?
The local decline in open software engineering positions has _nothing_ to do with AI. The best orgs are using AI to assist developers in building out new systems and write tests. Show me someone who is doing anything bigger than that, please I'd love to be proven wrong.
The big decline is driven by a few big factors. Two of which are 1- the overhiring that happened in 2021. This was followed by the increase of interest rates which dramatically constrained the money supply. Investors stopped preferring growth over profits. This shift in investor preferences is reflected in engineering orgs tightening their budgets as they are no longer rewarded for unbridled growth.
I'm skeptical because "we fired half our programmers and our new AI does their jobs as well as they did" is a story that would tear through the Silicon Valley rumor mill. To my knowledge, this has not happened (yet).
I like your phrasing - "any task limited in scope enough to be a 'benchmark'". Exactly! This is the real gap with LLMs, and will continue to be an issue with o1 -- sure, if you can write down all of the relevant context information you need to perform some computation, LLMs should be able to do it. In other words, LLMs are calculators!
I'm not especially nerve-wracked about being a knowledge worker, because my day-to-day doesn't consist of being handed a detailed specification of exactly what is required, and then me 'computing' it. Although this does sound a lot like what a product manager does!
I have been deploying useful code from LLMs right and left over the last several months. They are a significant force accelerator for programmers if you know how to prompt them well.
We’ll see if this is a good idea when we start having millions of lines of LLM-written legacy code. My experience maintaining such code so far has been very bad: accidentally quadratic algorithms; subtly wrong code that looks right; and un-idiomatic use of programming language features.
ah i see so you're saying that LLM-written code is already showing signs of being a maintenance nightmare, and that's a reason to be skeptical about its adoption. But isn't that just a classic case of 'we've always done it this way' thinking?
legacy code is a problem regardless of who wrote it. Humans have been writing suboptimal, hard-to-maintain code for decades. At least with LLMs, we have the opportunity to design and implement better coding standards and review processes from the start.
let's be real, most of the code written by humans is not exactly a paragon of elegance and maintainability either. I've seen my fair share of 'accidentally quadratic algorithms' and 'subtly wrong code that looks right' written by humans. At least with LLMs, we can identify and address these issues more systematically.
As for 'un-idiomatic use of programming language features', isn't that just a matter of training the LLM on a more diverse set of coding styles and idioms? It's not like humans have a monopoly on good coding practices.
So, instead of throwing up our hands, why not try to address these issues head-on and see if we can create a better future for software development?
Maybe it will work out, but I think we’ll regret this experiment because it’s the wrong sort of “force accelerator”: writing tons of code that should be abstracted rather than just dumped out literally has always caused the worst messes I’ve seen.
Yes, same way that the image model outputs have already permeated the blogosphere and pushed out some artists, the other models will all bury us under a pile of auto-generated code.
We will yearn for the pre-GPT years at some point, like we yearn for the internet of the late 90s/early 2000s. Not for a while though. We're going through the early phases of GPT today, so it hasn't been taken over by the traditional power players yet.
Honestly the code it's been giving me has been fairly cromulent. I don't believe in premature optimization and it is perfect for getting features out quick and then I mold it to what it needs to be.
In a way it's not surprising that people are getting vastly different results out of LLMs. People have different skill levels when it comes to using even Google. An LLM has a vastly bigger input space.
same...but have you considered the broader implications of relying on LLMs to generate code? It's not just about being a 'force accelerator' for individual programmers, but also about the potential impact on the industry as a whole.
If LLMs can generate high-quality code with minimal human input, what does that mean for the wages and job security of programmers? Will companies start to rely more heavily on AI-generated code, and less on human developers? It's not hard to imagine a future where LLMs are used to drive down programming costs, and human developers are relegated to maintenance and debugging work.
I'm not saying that's necessarily a bad thing, but it's definitely something that needs to be considered. As someone who's enthusiastic about the potential of code gen this O1 reasoning capability is going to make big changes.
do you think you'll be willing to take a pay cut when your employer realizes they can get similar results from a machine in a few seconds?
What's a sample prompt that you've used? Every time I've tried to use one for programming, they invent APIs that don't exist (but sound like they might) or fail to produce something that does what it says it does.
No matter the prompt, there's a significant difference between how it handles common problems in popular languages (python, JS) versus esoteric algorithms in niche languages or tools.
I had a funny one a while back (granted this was probably ChatGPT 3.5) where I was trying to figure out what payload would get AWS CloudFormation to fix an authentication problem between 2 services and ChatGPT confidently proposed adding some OAuth querystring parameters to the AWS API endpoint.
I just ask it for what I want in very specific detail, stating the language and frameworks in use. I keep the ideas self-contained -- for example if I need something for the frontend I will ask it to make me a webcomponent. Asking it to not make assumptions and ask questions on ambiguities is also very helpful.
It tends to fall apart on bigger asks with larger context. Breaking your task into discrete subtasks works well.
Even then it is hit and miss. If you are doing something that is also copy/paste-able out of a StackOverflow comment, you're apt to be fine, but as soon as you are doing anything slightly less common... Good luck.
True. It can be good at giving you pointers towards approaching the problem, even if the result is flawed, for slightly less common problems. But as you slide even father towards esotericism, there is no hope. It won't even get you in the right direction. Unfortunately – as that is where it would be most useful.
I think that's just the same as using an autocomplete efficiently, though. I tend to like them for Search, but not for anything i have to "prompt correctly" because i feel like i can type fast enough that i'm not too worried about auto-completing.
With that said i'm not one of those "It's just a parrot!" people. It is, definitely just a parrot atm.. however i'm not convinced we're not parrots as well. Notably i'm not convinced that that complexity won't be sufficient to walk talk and act like intelligence. I'm not convinced that intelligence is different than complexity. I'm not an expert though, so this is just some dudes stupid opinion.
I suspect if LLMs can prove to have duck-intelligence (ie duck typing but for intelligence) then it'll only be achieved in volumes much larger than we imagine. We'll continue to refine and reduce how much volume is necessary, but nevertheless i expect complexity to be the real barrier.
It’s definitely the case that there are some programming workflows where LLMs aren’t useful. But I can say with certainty that there are many where they have become incredibly useful recently. The difference between even GPT-4 last year and C3.5/GPT-4o this year is profound.
I recently wrote a complex web frontend for a tool I’ve been building with Cursor/Claude and I wrote maybe 10% of the code; the rest with broad instructions. Had I done it all myself (or even with GitHub Copilot only) it would have taken 5 times longer. You can say this isn’t the most complex task on the planet, but it’s real work, and it matters a lot! So for increasingly many, regardless of your personal experience, these things have gone far beyond “useful toy”.
I don’t want to be in the business of LLM defender, but it’s just hard to imagine this aging well when you step back and look at the pace of advancement here. In the realm of “real math and science”, O1 has improved from 0% to 50% on AIME today. A year ago, LLMs could only write little functions, not much better than searching StackOverflow. Today, they can write thousands of lines of code that work together with minimal supervision.
I’m sure this tech continues to have many limitations, but every piece of trajectory evidence we have points in the same direction. I just think you should be prepared for the ratio of “real” work vs. LLM-capable work to become increasingly small.
I can probably climb a tree faster than I can build a rocket. But only one will get me all the way to the moon. Don't confuse local optima for global ones.
> The sooner those paths are closed for low-effort high-pay jobs, the better, IMO. All this money for no work is going to our heads.
> It's time to learn some real math and science, the era of regurgitating UI templates is over.
You do realize that software development was one of the last social elevators, right?
What you're asking for won't happen, let alone the fact that "real math and science" pay a pittance, there's a reason the pauper mathematician was a common meme.
'Not useful' is a pretty low bar to clear, especially when you consider the state of the art just 5 years ago. LLMs may not be solving world hunger, but they're already being used in production for coding
If you're not seeing value in them, maybe it's because you're not looking at the right problems. Or maybe you're just not using them correctly. Either way, dismissing an entire field of research because it doesn't fit your narrow use case is pretty short-sighted.
FWIW, I've been using LLMs to generate production code and it's saved me weeks if not months. YMMV, I guess
Familiarize yourself with a tool which does half the prompting for you, e.g. cursor is pretty good at prompting claude 3.5 and it really does make code edits 10x faster (I'm not even talking about the fancy stuff about generating apps in 5 mins - just plain old edits.)
At this point, you're either saying "I don't understand how to prompt them" or "I'm a Luddite". They are useful, here to stay, and only getting better.
> People should realize we’re reaching the point where LLMs are surpassing humans in any task limited in scope enough to be a “benchmark”.
Can you explain what this statement means? It sounds like you're saying LLMs are now smart enough to be able to jump through arbitrary hoops but are not able to do so when taken outside of that comfort zone. If my reading is correct then it sounds like skepticism is still warranted? I'm not trying to be an asshole here, it's just that my #1 problem with anything AI is being able to separate fact from hype.
I think what I’m saying is a bit more nuanced than that. LLMs currently struggle with very “wide”, long-run reasoning tasks (e.g., the evolution over time of a million-line codebase). That isn’t because they are secretly stupid and their capabilities are all hype, it’s just that this technology currently has a different balance of strengths and weaknesses than human intelligence, which tends to more smoothly extrapolate to longer-horizon tasks.
We are seeing steady improvement on long-run tasks (SWE-Bench being one example) and much more improvement on shorter, more well-defined tasks. The latter capabilities aren’t “hype” or just for show, there really is productive work like that to be done in the world! It’s just not everything, yet.
> And as anyone who’s spent time using Claude 3.5 Sonnet / GPT-4o can attest, these things really are useful and smart! (And, if these results hold up, O1 is much, much smarter.) This is a nerve-wracking time to be a knowledge worker for sure.
If you have to keep checking the result of an LLM, you do not trust it enough to give you the correct answer.
Thus, having to 'prompt' hundreds of times for the answer you believe is correct over something that claims to be smart - which is why it can confidently convince others that its answer is correct (even when it can be totally erroneous).
I bet if Google DeepMind announced the exact same product, you would equally be as skeptical with its cherry-picked results.
> We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute).
Wow. So we can expect scaling to continue after all. Hyperscalers feeling pretty good about their big bets right now. Jensen is smiling.
This is the most important thing. Performance today matters less than the scaling laws. I think everyone has been waiting for the next release just trying to figure out what the future will look like. This is good evidence that we are on the path to AGI.
> I really hope people understand that this is a new paradigm: don't expect the same pace, schedule, or dynamics of pre-training era.
I believe the rate of improvement on evals with our reasoning models has been the fastest in OpenAI history.
Microsoft, Google, Facebook have all said in recent weeks that they fully expect their AI datacenter spend to accelerate. They are effectively all-in on AI. Demand for nvidia chips is effectively infinite.
Even when we start to plateau on direct LLM performance, we can still get significant jumps by stacking LLMs together or putting a cluster of them together.
It'd be interesting for sure if true. Gotta remember that this is a marketing post though, let's wait a few months and see if its actually true. Things are definitely interesting, wherever these techniques will get us AGI or not
Before commenting here, please take 15 minutes to read through the chain-of-thought examples -- decoding a cypher-text, coding to solve a problem, solving a math problem, solving a crossword puzzle, answering a complex question in English, answering a complex question in Chemistry, etc.
After reading through the examples, I am shocked at how incredibly good the model is (or appears to be) at reasoning: far better than most human beings.
after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.
Saying "competitive advantage" so directly is surprising.
There must be some magic sauce here for guiding LLMs which boosts performance. They must think inspecting a reasonable number of chains would allow others to replicate it.
They call GPT 4 a model. But we don't know if it's really a system that builds in a ton of best practices and secret tactics: prompt expansion, guided CoT, etc. Dalle was transparent that it automated re-generating the prompts, adding missing details prior to generation. This and a lot more could all be running under the hood here.
Lame but not atypical of OpenAI. Too bad, but I'm expecting competitors to follow with this sort of implementation and better. Being able to view the "reasoning" process and especially being able to modify it and re-render the answer may be faster than editing your prompt a few times until you get the desired response, if you even manage to do that.
But what does that mean? I can't do "pip install nurture" or "pip install patience". I can generate a bunch of answers and take the consensus, but we've been able to do that for years. I can do fine-tuning or DPO, but on what?
I have also heard they are launching a AI called strawberry. If you pay attention, there is a specific reason why they have named it strawberry. if you ask chat gpt 4o, how many r's in the word strawberry, it will give answer as 2.
still to this day it will answer same. the model is not able to reason.
thats why a reasoning model is being launched. this is one of the reason apart from many other reasons.
I tested o1-preview on some coding stuff I've been using gpt-4o for. I am not impressed. The new, more intentional chain of thought logic is apparently not something it can meaningfully apply to a non-trivial codebase.
Sadly I think this OpenAI announcement is hot air. I am now (unfortunately) much less enthusiastic about upcoming OpenAI announcements. This is the first one that has been extremely underwhelming (though the big announcement about structured responses (months after it had already been supported nearly identically via JSONSchema) was in hindsight also hot air.
I think OpenAI is making the same mistake Google made with the search interface. Rather than considering it a command line to be mastered, Google optimized to generate better results for someone who had no mastery of how to type a search phrase.
Similarly, OpenAI is optimizing for someone who doesn't know how to interact with a context-limited LLM. Sure it helps the low end, but based on my initial testing this is not going to be helpful to anyone who had already come to understand how to create good prompts.
What is needed is the ability for the LLM to create a useful, ongoing meta-context for the conversation so that it doesn't make stupid mistakes and omissions. I was really hoping OpenAI would have something like this ready for use.
I have tested o1-preview on a couple of coding tasks and I am impressed.
I am looking at a TypeScript project with quite an amount of type gymnastics and a particular line of code did not validate with tsc no matter what I have tried. I copy pasted the whole context into o1-preview and it told me what is likely the error I am seeing (and it was a spot on correct letter-by-letter error message including my variable names), explained the problem and provided two solutions, both of which immediately worked.
Another test was I have pasted a smart contract in solidity and naively asked to identify vulnerabilities. It thought for more than a minute and then provided a detailed report of what could go wrong. Much, much deeper than any previous model could do. (No vulnerabilities found because my code is perfect, but that's another story).
Pricing is $15.00 / 1M input tokens and $60.00 / 1M output tokens. Context window is 128k token, max output is 32,768 tokens.
There is also a mini version with double the maximum output tokens (65,536 tokens), priced at $3.00 / 1M input tokens and $12.00 / 1M output tokens.
The specialized coding version they mentioned in the blog post does not appear to be available for use.
It’s not clear if the hidden chain of thought reasoning is billed as paid output tokens. Has anyone seen any clarification about that? If you are paying for all of those tokens it could add up quickly. If you expand the chain of thought examples on the blog post they are extremely verbose.
https://platform.openai.com/docs/models/o1 https://openai.com/api/pricing/ https://platform.openai.com/docs/guides/rate-limits/usage-ti...