I’m not as inclined to grade Gary Marcus’ predictions leniently as Gary Marcus seems to be.
• 7-10 GPT-4 level models
He only gets to claim this is true if you count a bunch of different versions of the same base model, or if you’re willing to say that some models that outperform GPT-4 on some benchmarks count as being GPT-4 level. I don’t think Marcus was right in spirit, here.
• No massive advance (no GPT-5, or disappointing GPT-5)
Seems too unquantifiable to judge, I would call o1 a massive advance over 4o, but I’m sure Marcus would not.
• Price wars
I guess so? From what I’ve read, the frontier models companies are still profitable, and OpenAI now has a $200/mo commercial model, hardly the action of a company deciding its prices purely to undercut the competition.
• Very little moat for anyone
It still seems like the only companies who have pulled off frontier model capabilities have spent many millions of dollars doing it. I think this might become true next year but I don’t think this can be judged as correct based on what we saw in 2024 alone.
• No robust solution to hallucinations
You only use words like “robust” in a prediction like this so you have room to weasel out of it later when the hallucinations diminish greatly but don’t quite go extinct.
• Modest lasting corporate adoption
My industry is oil and gas. A pretty hidebound and conservative industry. Adoption of LLMs has been massive.
ChatGPT/Claude still sucks at providing factual info. You can see how in some areas (like coding), massive amounts of training has taken place, and the model is almost always right when recalling existing knowledge, particularly if you ask about the major frameworks. Then there are somewhat more obscure topics, where the models are mostly right, but can be convincigly wrong, and it's very hard to tell which is which.
If you ask about something that wasn't deemed to have economic value, such as book recommendations (I thought a model that was advertised to have been trained on the entirety of human literaly works would be good at that), what you get is almost always obviously wrong info.
I agree with what you say here, but I still don’t feel like this gives Marcus the edge here, because Perplexity does a fantastic job at finding correct information and, in my experience, never hallucinates. I would call Perplexity and whatever they’re doing a pretty “robust” solution to hallucination.
So there are solutions to hallucination, but the product has to prioritize that to the extent that it reduces the model’s effectiveness as a general tool.
I was talking to a kid at our NYE party last night and he is using machine learning at his company to determine which natural gas wells will produce the most.
In 2024, AI made astonishing progress across the board. Top models have improved by roughly 130 ELO points on public leaderboards, while costs for running those same models have dropped many fold. Smaller, low-latency models now offer near–state-of-the-art performance (see Gemini Flash, for example). Multimodal capability is increasingly the norm, and strong reasoning skills are widely expected. New “hard problem” leaderboards have emerged, and they, too, have seen major gains. Long-context models have become standard, and mainstream products like Google Search now integrate AI by default.
Overall, the field has seen tremendous progress -- whether or not most users realize just how far we’ve come. Marcus's predictions don't sound specific enough -- no GPT5? Correct. But, what does that even mean?
Personally I see a lot of benchmarks come and go with the scores on these benchmarks always creeping upwards, but my practical day to day experience of using LLMs has remained pretty lukewarm.
The things they were good at a year ago they’re still good at and what they were bad at they’re still bad at.
The products and infrastructure around them is better. Claude Artifacts are cool, for example. O1 has some clever prompting under the hood.
I just don’t know how much stock we should put in benchmarks.
I think the jump from ChatGPT to GPT4 was also something like 130 ELO points. This roughly equals 2/3 preference for the new model, 1/3 preference for the old model. That's roughly how much top models have improved for an "average query".
Now, none of us enter average queries :) We all have specific use cases in mind. So any specific user's mileage will vary.
I personally feel substantial improvement. I don't feel the need to google every LLM answer 5 times to feel confident about it :)
I have been mislead several times in very subtle ways when trying to ask specific questions to chatgpt and Claude.
It actually burned me at work. Made me look bad and caused my project to take an extra a sprint because of incorrect info about the web history API that came with working examples.
I just can’t ever trust information from these systems, unless they include sources to first party docs.
I agree that LLMs can be highly valuable when used appropriately. For me, their strengths lie primarily in semantic search (e.g., Perplexity.ai), rewriting and discussing overall code architecture. However, they often prove unreliable in specialized knowledge domains. If a user isn’t skilled in initializing and following up with effective prompts or discerning the usefulness from the responses, interactions can be hit-or-miss.
The robustness of LLMs—like the original ChatGPT or GPT-3.5—is still far from a level where domain novices can rely on them with confidence. This might change as models incorporate more first-order data (e.g., direct observations in physics, embodiment) and improve in causal reasoning and deductive logic.
I find it crucial to have critical voices like Gary Marcus to counterbalance the overwhelming hype perpetuated by social media enthusiasts and corporate PR—much of which the media tends to echo uncritically.
One of Marcus's demands is the need for a more neuro-symbolic approach to advance AI. Progress can’t come solely from "scaling up" models.
And it seems like he's right: All major ML companies seem to shift towards search-based algorithms (e.g., Q*) combined with reinforcement learning during inference time to explore the problem space, moving beyond mere "next-token prediction" training.
I don't understand why he has such an axe to grind. Is there some historical baggage here?
Of course there are plenty of problems with the current state of AI and LLMs, but to have such a preconceived pessimistic outlook that can't even acknowledge their massive and quick adoption and usefulness in multiple domains seems not intellectually honest.
Marcus is of a school of thought that intelligent brain requires symbolic reasoning, not just a neural network. He also believes intelligent symbolic reasoning is attributable to a genetic origin and can’t be entirely trained. I don’t know enough to say the school is wrong, but it is in crisis with the successes of large language models.
It's kind of his schtick. He pitched himself as an AI expert but hasn't really contributed anything much to developing AI, but gets a lot of coverage for saying this that and the other won't work. (see https://www.reddit.com/r/singularity/comments/1ajemjc/spoile...)
George HW Bush built part of his folksy persona around hating the taste of broccoli. If one day he tried eating broccoli and enjoyed it, would he tell everyone? Likely he'd choose to keep his reputation and just keep his new taste to himself.
I'd argue 1, 2, maybe 6 are effectively already doable. 3 & 5 are good tasks which are technically possible with some RAG hackery but in general is a good benchmark testing out-of-context fact retrieval. 4 & 10 might happen soon-ish with work on "agents" and proof synthesis respectively. 7 & 8 are too subjective. 9, maybe weakened to formulating or proving novel theorems, might be a good baseline for "peak human" intelligence (I think at best O1-O3 can spot and prove some lemmas, but nothing that anyone would bother publishing).
Marcus clearly threw a lot of spaghetti, seemingly pessimistic "it socks and won't get better" and then cherry picks confirmations, for the bits that stuck to the wall.
How about he indicates: 1. How he came to these conclusions (coin flip? Pessimism?) 2. How many predictions he missed. He's implying a very high rate of success, which is a big red flag of shenanigans for me.
This is little more than vague generalities and coin flipping with retroactive cherry picked "See?! I was right!" analysis.
A gypsy at a traveling circus serves up about the same.
>No single system will solve more than 4 of the AI 2027 Marcus-Brundage tasks by the end of 2025
I had a look at the "Marcus-Brundage tasks" that he has modestly named after himself and am stuck that for an AI skeptic he's listed things for 2027 well beyond 99.9% of humans like write 10,000 lines of bug free code, Oscar level screenplays, Nobel prize discoveries, Pulitzer books etc.
• 7-10 GPT-4 level models
He only gets to claim this is true if you count a bunch of different versions of the same base model, or if you’re willing to say that some models that outperform GPT-4 on some benchmarks count as being GPT-4 level. I don’t think Marcus was right in spirit, here.
• No massive advance (no GPT-5, or disappointing GPT-5)
Seems too unquantifiable to judge, I would call o1 a massive advance over 4o, but I’m sure Marcus would not.
• Price wars
I guess so? From what I’ve read, the frontier models companies are still profitable, and OpenAI now has a $200/mo commercial model, hardly the action of a company deciding its prices purely to undercut the competition.
• Very little moat for anyone
It still seems like the only companies who have pulled off frontier model capabilities have spent many millions of dollars doing it. I think this might become true next year but I don’t think this can be judged as correct based on what we saw in 2024 alone.
• No robust solution to hallucinations
You only use words like “robust” in a prediction like this so you have room to weasel out of it later when the hallucinations diminish greatly but don’t quite go extinct.
• Modest lasting corporate adoption
My industry is oil and gas. A pretty hidebound and conservative industry. Adoption of LLMs has been massive.
• Modest profits, split 7-10 ways
Define modest.
I score Marcus at 0/7, at best 2/7.
reply