Arc Prize 2024 Winners and Technical Report

mikeknoop · 2024-12-06T19:52:57 1733514777

Author here -- six months ago we launched ARC Prize, a huge $1M experiment, to test if we need new ideas for AGI. The ARC-AGI benchmark remains unbeaten and I think we can now definitely say "yes".

One big update since June is that progress is no longer stalled. Coming into 2024, the public consensus vibe was that pure deep learning / LLMs would continue scaling to AGI. The fundamental architecture of these systems hasn't changed since ~2019.

But this flipped late summer. AlphaProof and o1 are evidence of this new reality. All frontier AI systems are now incorporating components beyond pure deep learning like program synthesis and program search.

I believe ARC Prize played a role here too. All the winners this year are leveraging new AGI reasoning approaches like deep-learning guided program synthesis, and test-time training/fine-tuning. We'll be seeing a lot more of these in frontier AI systems in coming years.

And I'm proud to say that all the code and papers from this year's winners are now open source!

We're going to keep running this thing annually until its defeated. And we've got ARC-AGI-2 in the works to improve on several of the v1 flaws (more here: https://arcprize.org/blog/arc-prize-2024-winners-technical-r...)

The ARC-AGI community keeps surprising me. From initial launch, through o1 testing, to the final 48 hours when the winning team jumped 10% and both winning papers dropped out of nowhere. I'm incredibly grateful to everyone and we will do our best to steward this attention towards AGI.

We'll be back in 2025!

tbalsam · 2024-12-06T21:10:55 1733519455

As a rather experienced ML researcher, ARC is a great benchmark on its own, but is punching below its weight in terms of claiming that it is a gate (or in terms of this post -- a "steward") towards AGI, and in my perspective and the perspective of several researchers near me this has watered down the value of the ARC benchmark as a test.

It is a great unit test for reasoning -- that's fantastic! And maybe it is indeed the best way to test for this -- who knows exactly. But the claim is a little grandiose for what it is, this is somewhat similar to saying that testing on string parity is the One True Test for testing an optimizer's efficiency.

I'd heartily recommend maybe taking down the marketing vibrance down a notch and keep things a bit more measured, it's not entirely a meme, though some of the more-serious researchers don't take it as seriously as a result. And that's the kind of people that you want to attract to this sort of thing!

I think there is a potentially good future for ARC! But it might struggle to attract some of the kind of talent that you want to work on this problem as a result.

mikeknoop · 2024-12-06T22:03:21 1733522601

> I'd heartily recommend maybe taking down the marketing vibrance down a notch and keep things a bit more measured, it's not entirely a meme, though some of the more-serious researchers don't take it as seriously as a result.

This is fair critique. ARC Prize's 2024 messaging was sharp to break through the noise floor -- ARC has been around since 2019 but most only learned about it this summer. Now that it has garnered awareness, it is no longer useful, and in same cases hurting progress like you point out. The messaging needs to evolve and mature next year to be more neutral/academic.

tbalsam · 2024-12-06T22:22:13 1733523733

I feel rather consternated that this response effectively boils down to "yes, we know we overhyped this to get people's attention, and now that we have it we can be more honest about it". Fighting for place in the attention economy is understandable, being deceptive about it is not.

This is part of the ethical morass of why some more serious researchers aren't touching the benchmark. People are not going to take it seriously if it continues like this!

mikeknoop · 2024-12-06T22:55:26 1733525726

I think we agree; to clarify, sharp messaging isn't inaccurate messaging. And I believe the story is not overhyped given the evidence: the benchmark resisted a $1M prize pool for ~6 months. But I concede we did obsess about the story to give it the best chance of survival in the marketplace of ideas against the incumbent AI research meme (LLM scaling). Now that the AI research field is coming around to the idea that something beyond deep learning is needed, the story matters less, and the benchmark, and future versions, can stand on their utility as a compass towards AGI.

mrandish · 2024-12-07T00:12:48 1733530368

Mike - please know that not everyone who appreciates ARC feels the same way as the GP. I'm not an academic researcher but I am quite sensitive to hype and excessive marketing. I've never felt the ARC site was anything other than appropriately professional.

Even revisiting it now, I don't see anything wrong with being concisely clear and even a little provocative in stating your case on your own site. Especially since a key value of ARC is getting more objectively grounded regarding progress toward AGI. On top of that ARC is "A non-profit for the public advancement of open artificial general intelligence" that you guys are personally donating serious money and time to that's helping a field where a lot of entrepreneurs are going to make money and academics are going to advance their careers.

My perception is ARC tried it the other way for years but a lot of academics and AI pundits ignored or dismissed it without ever meaningfully engaging with it. "Sharpening" the message this year has clearly paid off in bringing attention that's shifted the conversation and is helping advance progress toward AGI in ways nothing else has. I also greatly appreciate the time and care you and Francois have put into making the ARC proposition clear enough for non-technical people to understand. That's hard to do and doesn't happen by accident.

Personally, I've found ARC valuable in the real world outside of academia and domain experts because it provides a conceptually simple starting place to discuss with non-technical people what the term AGI might even mean. My high school-aged daughter asked me about vague AGI impending doom scenarios she heard on TikTok. I had her solve a couple ARC samples and then pointed out that today's best AIs aren't yet close to doing the same. This counter-intuitive revelation got her pondering the "Why?" which led to a deep discussion about the multi-dimensional breadth of human creativity and an appreciation of the many ways artificial intelligences might differ from human intelligence.

YeGoblynQueenne · 2024-12-07T12:53:59 1733576039

>> My perception is ARC tried it the other way for years but a lot of academics and AI pundits ignored or dismissed it without ever meaningfully engaging with it.

Your perception is very wrong and the likely reason is that as you say you're not an academic researcher. ARC made a huge splash with the original Kaggle competition a few years ago and it drew in exactly the kind of "academic researcher" you seem to be pointing to: those in university research groups who do not have access to the data and compute that the big tech companies have, and who can consequently not compete in the usual big data benchmarks that are dominated by Google, OpenAI, Meta, and friends. ARC, with its (unfair) few-shot tasks and constantly changing private test set, is exactly the kind of dataset that that kind of researcher are looking for, something that is relatively safe from big tech deep neural nets. Even the $1 million prize seems specially designed to be just enough to draw in that crowd of not super-rich academics while leaving corporate research groups insufficiently motivated.

Besides which, I won't name names but one of the principal researchers in the winning system is just one of those academics. I don't know which is the period you mean ARC was ignored by the academic community but that particular researcher was in a certain meeting of like-minded academics two years ago where one of the main areas of discussion was in short "how to beat ARC and show that our stuff works".

YeGoblynQueenne · 2024-12-07T12:30:05 1733574605

>> Now that the AI research field is coming around to the idea that something beyond deep learning is needed, the story matters less, and the benchmark, and future versions, can stand on their utility as a compass towards AGI.

How so? All the three top systems are deep neural net systems. The first place went to a system that, quoting from the "contributions" section of the paper, employed:

>> An automated data generation methodology that starts with 100-160 program solutions for ARC training tasks, and expands them to make 400k new problems paired with Python solutions

As I pointed out in another comment the top results in ARC have been achieved by ordinary, deep-learning, big-data, memorisation based approaches. You and fchollet (in these comments) try to claim otherwise but I don't understand why.

In fact, no, I understand why. I think fchollet wanted to place ARC as "not just a benchmark", the opposite of what tbalsam is asking for above. The motivation is solid: if we've learned anything in the last twenty-thirty years is that deep neural nets are very capable at beating benchmarks. For any deep neural net model that beats a benchmark though the question remains whether it can do anything else besides. Unfortunately, that is not a question that can be answered by beating yet another benchmark.

And here we are now, and the first place in the current ARC challenge goes to a deep neural net system trained on a synthetically augmented dataset. The right thing to do now would be to scale back the claims about the magickal AGI-IQ test with unicorns, and accept that your benchmark is just not any different than any other previous AI benchmark, that it is not any more informative than any other benchmark, and that a completely different kind of test of artificial intelligence is needed.

There is after all such a thing as scientific integrity. You make a big conjecture, you look at the data, realise that you're wrong, accept it, and move on. For example the authors of GLUE did that (as in SUPERGLUE). The authors of the Winograd Schema Challenge did that. You should follow their examples.

trott · 2024-12-07T18:35:31 1733596531

> realise that you're wrong, accept it, and move on

What do you think about limiting the submission size? Kaggle does this sometimes.

With a limit like 0.1-1MB (compressed), you are basically saying: "Give me sample-efficient learning algorithms, not pretrained models."

YeGoblynQueenne · 2024-12-08T02:13:35 1733624015

That's fine if you want to measure sample efficiency, but ARC-AGI is supposed to measure progress towards AGI.

trott · 2024-12-08T03:04:32 1733627072

> That's fine if you want to measure sample efficiency, but ARC-AGI is supposed to measure progress towards AGI.

On the Measure of Intelligence defines intelligence as skill-acquisition efficiency, I believe, where efficiency is with respect to whatever is the limiting factor. For each ARC task, the primary limiting factor is the number of samples in it. And the skill here is your ability to convert inputs into the correct outputs. In other words, in this context, intelligence is sample-efficiency, as I see it.

YeGoblynQueenne · 2024-12-08T12:06:28 1733659588

Is that what fchollet is claiming?

trott · 2024-12-08T16:38:20 1733675900

Not sure. But I think this follows logically from the definition of intelligence he is using. Also, see II.2.2 in the paper.

tbalsam · 2024-12-06T23:14:40 1733526880

> Now that the AI research field is coming around to the idea that something beyond deep learning is needed,

I have not heard this from anyone that I work with! It would be a curious violation of info theory were this to be the case.

Certainly, some things cannot efficiently be learned from data. This is a case where some other kind of inductive bias or prior is needed (again, from info theory) -- but replacing deep learning entirely would be rather silly.

Part of the reason that a number of researchers don't take the benchmark more seriously is because it's meant to cripple the results. For example, in the name of reducing brute force search, the compute was severely limited! This turned many off to begin with. The general contention as I understand was to let compute be a reasonable amount, but this would not play well with the numbers game. Because if you restrict compute beyond a reasonable point, it makes the numbers artificially low for people who don't know what's going on behind the scenes. And this ends up biasing the results unreasonably to favor the original messaging, (i.e., "We need something other than deep learning.")

If it was structured with a reasonable amount of compute, and instead, time-accuracy gates were used for prizes, it would be much more open. But people do not use it because the game is rigged to begin with!

Unfortunately due to that, plus the consistent goal-post moving of the benchmark is why it's generally not really held with staying power in the research community -- the messaging changes based upon what is convenient for publicity, and there's unfortunately been a history of similar things in the past in the pedigree leading up to the ARC prize itself.

It is not entirely unsalvageable, but there really needs to be a turnaround of how the competition and prize is managed in order to win back people's trust. Placing a thumb on the scales to confirm a prior bias/previous messaging may work for a little while, but over time it robs the metric of its usability over time as the greater research community loses trust.

WhitneyLand · 2024-12-07T00:26:16 1733531176

I think you’re overly fixated on some minor points relative to the overall utility on offer here. And also skewing the facts a bit. For example at one point you quote the OP on words that were never said as far as I can see. At another point, you characterize their position as “replacing deep learning entirely” which, as far as I can tell, has never been advocated for in this comment thread or on behalf of ARC.

tbalsam · 2024-12-07T00:52:08 1733532728

That is an understandable statement, and probably fair as well I feel.

Much of this comes in reference to statements from fchollet w.r.t. replacing deep learning -- around the time of the initial prize, with a lot of the much more hype marketing, this was essentially the thru-line that was used, and it left a bitter taste in a number of peoples' mouths. W.r.t. misquoting, they did say that we needed something "beyond" deep learning, not "other than" here, and that is on me.

The utility is certainly still present, if I feel diminished, and it probably is a case of my own frustrations due to previous similar issues leading up to the ARC prize.

That being said, I do agree in retrospect that my response skewed from being objective -- it is a benchmark with a mixed history, but that doesn't mean that I should get personally caught up in it.

YeGoblynQueenne · 2024-12-07T12:47:33 1733575653

>> If it was structured with a reasonable amount of compute, and instead, time-accuracy gates were used for prizes, it would be much more open. But people do not use it because the game is rigged to begin with!

The entire benchmark is set up so as to try and make it _artificially_ hard for deep learning: there are only three examples for each task; AND the private test set has a different distribution than the public training and validation sets (from what I can tell; a violation of PAC-Learning assumptions and then why should anyone be surprised if machine learning approaches in general can't deal with that?).

Even I (long story) find ARC to be unfair in the simplest sense of the word: it does not make for a level playing field that would allow for disparate approaches to machine learning to be compared fairly. Strangely and uniquely, the unfairness is aimed at the dominant approach, deep learning, where every other benchmark tends to skew towards deep learning (e.g. huge feature-based, labelled data).

But why's that? If ARC-AGI is a true test of AGI, or intelligence, or whatever it is supposed to be (an IQ test for AIs) then why does it have to jump through hoops just to defend itself from the dominant approach to AI? If it's a good test for AI, and the dominant approach to AI can't really do AI, then the dominant approach should not be capable of passing the test, without any shenanigans with reduced compute or few examples.

Is the purpose to demonstrate that deep neural nets can't generalise from few examples? That's machine learning 101 (although I guess there's still those who missed the lecture). Is it to encourage deep neural nets to get better at generalising from few examples? Well, first place just went to a big, deep, bad neural net with data augmentation so that doesn't even work.

iwsk · 2024-12-06T22:56:46 1733525806

we live in a society

padswo1 · 2024-12-06T21:56:49 1733522209

I don’t think ARC has particularly advanced the research. The approaches that are successful were developed elsewhere and then applied to ARC. Happy to be shown somewhere this is not the case.

In the case of TTT, I wouldn’t really describe that as a ‘new AGI reasoning approach’. People have been fine tuning deep learning models on specific tasks for a long time.

The fundamental instinct driving the creation of ARC - that ‘deep learning cannot do system 2 thinking’, is under threat of being proven wrong very soon. Attempts to define the approaches that are working as somehow not ‘traditional deep learning’ really seem like shifting the goal posts.

mikeknoop · 2024-12-06T22:42:35 1733524955

Correct, fine-tuning is not new. It's long been used to augment foundational LLMs with private data. Eg. private enterprise data. We do this at Zapier, for instance.

The new and surprising thing about test-time training (TTT) is how effective it is an approach to deal with novel abstract reasoning problems like ARC-AGI.

TTT was pioneered by Jack Cole last year and popularized this year by several teams, including this winning paper: https://ekinakyurek.github.io/papers/ttt.pdf

p1esk · 2024-12-07T03:27:47 1733542067

How is TTT anything other than a deep learning algorithm? We have a deep learning model, we generate training data based on an example and use a stochastic gradient descent to update the model weights to improve its predictions according to the training data. This is a classic DL paradigm. I just don’t see why would you consider this an advancement if you your goal is to move “beyond” deep learning.

mrandish · 2024-12-06T21:10:37 1733519437

Congrats to you and Francois on the success of ARC-AGI 24 and thanks so much for doing it. I just finished the technical report and am encouraged! It's great to finally see some tangible progress in research that is both novel and plausibly in fruitful directions.

trott · 2024-12-06T21:18:41 1733519921

Mike and François,

Compute is limited during inference, and this naturally limits brute-force program search.

But this doesn't prevent one from creating a huge ARC-like dataset ahead of time, like BARC did (but bigger), and training a correspondingly huge NN on it.

Placing a limit on the submission size could foil this kind of brute-force approach though. I wonder if you are considering this for 2025?

nerdponx · 2024-12-08T04:36:30 1733632590

> Coming into 2024, the public consensus vibe was that pure deep learning / LLMs would continue scaling to AGI.

Was it? What did the "public" consist of exactly?

celeritascelery · 2024-12-06T20:24:25 1733516665

What surprises me about this is how poorly general-purpose LLMs do. The best one is OpenAI o1-preview at 18%. This is significantly worse than the purpose-built models like ARChitects (which scored 53.5). This model used TTT to train on the ARC-AGI task specification (amoung other things). It seems that even if someone creates a model that can "solve" ARC, it still is not indicative of AGI since it is not "general" anymore, it is just specialized to this particular task. Similar to how chess engines are not AGI, despite being superhuman at chess. It will be much more convincing when general models not trained specifically for ARC can still score well on it.

They do mention that some of the tasks here are susceptible to brute force and they plan to address that in ARC-AGI-2.

> nearly half (49%) of the private evaluation set was solved by at least one team during the original 2020 Kaggle competition all of which were using some variant of brute-force program search. This suggests a large fraction of ARC-AGI-1 tasks are susceptible to this kind of method and does not carry much useful signal towards general intelligence.

fchollet · 2024-12-06T20:40:28 1733517628

It is correct that the first model that will beat ARC-AGI will only be able to handle ARC-AGI tasks. However, the idea is that the architecture of that model should be able to be repurposed to arbitrary problems. That is what makes ARC-AGI a good compass towards AGI (unlike chess).

For instance, current top models use TTT, which is a completely general-purpose technique that provides the most significant boost to DL model's generalization power in recent memory.

The other category of approach that is working well is program synthesis -- if pushed to the extent that it could solve ARC-AGI, the same system could be redeployed to solve arbitrary programming tasks, as well as tasks isomorphic to programming (such as theorem proving).

ipunchghosts · 2024-12-07T23:45:16 1733615116

"However, the idea is that the architecture of that model should be able to be repurposed to arbitrary problems"

From a mathematical perspective, this doesn't sound right. All NNs are universal apprxomators and in theory can all learn the same thing to equal ability. It's more about the learning algorithm than the architecture IMO.

scoobertdoobert · 2024-12-06T21:00:37 1733518837

François, have you coded and tested a solution yourself that you think will work best?

optimalsolver · 2024-12-07T01:10:26 1733533826

Hey, he's the visionary. You come up with the nuts and bolts.

homarp · 2024-12-07T06:55:32 1733554532

is keras nuts and bolts enough?

ipunchghosts · 2024-12-07T23:43:33 1733615013

Keres is a good abstraction model but poorly implemented.

mrandish · 2024-12-06T21:32:07 1733520727

> It seems that even if someone creates a model that can "solve" ARC, it still is not indicative of AGI since it is not "general" anymore

I recently explained why I like ARC to a non-technical friend this way: "When an AI solves ARC it won't be proof of AGI. It's the opposite. As long as ARC remains unsolved I'm confident we're not even close to AGI."

For the sake of being provocative, I'd even argue that ARC remaining unsolved is a sign we're not yet making meaningful progress in the right direction. AGI is the top of Everest. ARC is base camp.

iwsk · 2024-12-06T23:04:29 1733526269

in other words, solving ARC is necessary but not sufficient for AGI

YeGoblynQueenne · 2024-12-07T12:18:00 1733573880

Why is it necessary? Could a spider solve ARC-AGI, or could a pigeon, or a cat? And if an animal doesn't need to solve ARC-AGI to be intelligent, then why does an AGI?

mrandish · 2024-12-07T00:55:58 1733532958

Yes! That's the exact phrase I would have used with someone on HN. But that doesn't describe my non-technical friend. :-)

thomasahle · 2024-12-07T00:58:12 1733533092

> What surprises me about this is how poorly general-purpose LLMs do. The best one is OpenAI o1-preview at 18%.

o1-preview doesn't even have image input, so I wonder how they used it.

Also, Ryan Greenblatts solution basically does "best of 4000" iirc. Presumably o1-preview was single shot.

celeritascelery · 2024-12-07T04:45:02 1733546702

None of the models use images, they all operate and a json format the describes the input squares.

YeGoblynQueenne · 2024-12-06T20:56:23 1733518583

The first question I still have is what happened to core knowledge priors. The white paper that introduced ARC made a big todo about how core knowledge priors are necessary to solve ARC tasks but from what I can tell none of the best-performing (or at-all performing) systems have anything to do with core knowlege priors.

So what happened to that assumption? Is it dead?

The second question I still have is about the defenses of ARC against memorisation-based, big-data approaches. I note that the second best system is based on an LLM with "test time training" where the first two steps are:

  initial finetuning on similar tasks 
  auxiliary task format and augmentations

Which is to say, a data augmentation approach. With big data comes great responsibility and the authors of the second-best system don't disappoint: they claim that by training on more examples they achieve reasoning.

So what happened to the claim that ARC is secure against big-data approaches? Is it dead?

fchollet · 2024-12-06T21:04:40 1733519080

What all top models do is recombine at test time the knowledge they already have. So they all possess Core Knowledge priors. Techniques to acquire them vary:

* Use a pretrained LLM and hope that relevant programs will be memorized via exposure to text data (this doesn't work that well)

* Pretrain a LLM on ARC-AGI-like data

* Hardcode the priors into a DSL

> Which is to say, a data augmentation approach

The key bit isn't the data augmentation but the TTT. TTT is a way to lift the #1 issue with DL models: that they cannot recombine their knowledge at test time to adapt to something they haven't seen before (strong generalization). You can argue whether TTT is the right way to achieve this, but there is no doubt that TTT is a major advance in this direction.

The top ARC-AGI models perform well not because they're trained on tons of data, but because they can adapt to novelty at test time (usually via TTT). For instance, if you drop the TTT component you will see that these large models trained on millions of synthetic ARC-AGI tasks drop to <10% accuracy. This demonstrates empirically that ARC-AGI cannot be solved purely via memorization and interpolation.

YeGoblynQueenne · 2024-12-06T22:08:23 1733522903

>> So they all possess Core Knowledge priors.

Do you mean the ones from your white paper? The same ones that humans possess? How do you know this?

>> The key bit isn't the data augmentation but the TTT.

I haven't had the chance to read the papers carefully. Have they done ablation studies? For instance, is the following a guess or is it an empirical result?

>> For instance, if you drop the TTT component you will see that these large models trained on millions of synthetic ARC-AGI tasks drop to <10% accuracy.

optimalsolver · 2024-12-06T21:26:42 1733520402

>This demonstrates empirically that ARC-AGI cannot be solved purely via memorization and interpolation

Now that the current challenge is over, and a successor dataset is in the works, can we see how well the leading LLMs perform against the private test set?

tuukkah · 2024-12-06T22:46:04 1733525164

I think the "semi-private" numbers here already measure that: https://arcprize.org/2024-results

For example, Claude 3.5 gets 14% in semi-private eval vs 21% in public eval. I remember reading an explanation of "semi-private" earlier but cannot find it now.

aithrowawaycomm · 2024-12-06T21:37:04 1733521024

Even the strongest possible interpretation of the results wouldn't conclude "ARC-AGI is dead" because none of the submissions came especially close to human-level performance; the criteria was 85% success but the best in 2024 was 55%.

That said, I think there should be consideration via information thermodynamics: even with TTT these program-generating systems are using an enormous amount of bits compared to a human mind, a tiny portion of which solves ARC quickly and easily using causality-first principles of reasoning.

Another point: suppose a system solves ARC-AGI with 99% accuracy. Then it should be tested on "HARC-HAGI," a variant that uses hexagons instead of squares. This likely wouldn't trip up a human very much - perhaps a small decrease due to increased surface area for brain farts. But if the AI needs to be retrained on a ton of hexagonal examples, then that AI can't be an AGI candidate.

szvsw · 2024-12-06T22:29:20 1733524160

> That said, I think there should be consideration via information thermodynamics: even with TTT these program-generating systems are using an enormous amount of bits compared to a human mind, a tiny portion of which solves ARC quickly and easily using causality-first principles of reasoning.

This isn’t my area of expertise, but it seems plausible to me that what you said is completely erroneous or at the very least completely unverifiable at this point in time. How do you quantify how many bits it takes a human mind to solve one of the ARC problems?

That seems likely beyond the level of insight we have into the structure of cognition and information storage etc etc in wetware. I could of course be wrong and would love to be corrected if so! You mentioned a “tiny portion” of the human mind, but (as far as I’m aware), any given “small” part of human cognition still involves huge amounts of complexity and compute.

Maybe you are saying that the high level decision making a human goes through when solving can be represented with a relatively small number of pieces of information/logical operations (as opposed to a much lower level notion closer to the wetware of the quantity of information) but then it seems unfair to compare to the low level equivalent (weights & biases, FLOPs etc) in the ML system when there may be higher order equivalents.

I do appreciate the general notion of wanting to normalize against something though, and some notion of information seems like a reasonable choice, but practically out of our reach. Maybe something like peak power or total energy consumption would be a more reasonable choice, which we can at least get a lower and upper bounds on in the human case (metabolic rates are pretty well studied, and even if we don’t have a good idea of how much energy is involved in completing cognitive tasks we can at least get bounds for running the entire system in that period of time) and close to a precise value in the ML case.

aithrowawaycomm · 2024-12-06T23:24:17 1733527457

I was speaking loosely but the operative term is "information thermodynamics": comparing bits of AI output versus bits of intentional human thought, ignoring statistical/physical bits related to ANN inference or biological neuron activity. The "tiny chunk of the human mind" thing was a distraction I shouldn't have included.

These AI output as tokens hundreds of potential solutions, whereas a human solving a very tricky ARC problem might need at most a few dozen cases to run through. There's a big mess of ANN linear algebra / human subconscious thought and I agree these messes can't be compared (or even identified in the human case). But we can compare the efficiency of the solution. It is possible that subconsciously humans "generate" hundreds of solutions that are mostly discarded, but I don't think the brain is fast enough to do that at the speed of conscious thought: it's a 50bn core processor but each core is only 200Hz and they aren't general-purpose CPUs. It also seems inconsistent with how humans solve these problems.

I believe energy usage would be even more misleading: in terms of operations/second a human brain is comparable to a 2020s supercomputer running at 30MW, but it only consumes 300 watts. (I was thinking about this with the "tiny portion" comment but it is irrelevant.)

szvsw · 2024-12-07T03:41:43 1733542903

Thanks for the response! I was trying to allude to what you are describing with the bit (ha) I mentioned about higher order thinking but you obviously articulated it much more effectively.

I guess I’m not sure it’s obvious where the right line to draw the boundary for “intentional human thought” is? Surely there is a lot of cognition and representation going on at extraordinary speeds that exist in some hazy border region between instinct/reflex/subconscious and conscious thought. Still, having said that, I do see what you are saying about trying to compare the complexity of the formal path to the solution, or at least what the human thinks their formal path was.

I’m generally of the mind (also, ha) that we won’t really ever be able to quantify any of this in a meaningful way in the short term and if anything which qualifies as AGI does emerge, it might only be something which is an “I know it when I see it” kind of evaluation…

Where are you getting 300W from? The body only dumps 100W of heat at rest and uses like 300-400W during moderate physical activity, so I’m a little confused about what you are describing there. The typical estimates I’ve seen are like 20W or so for the brain.

Edit: I should also say that what you describe does seem like a great way to compare solutions between computational systems currently being developed and a good one to use to try to push development forward; it just seems quixotic to try to be able to use it comparatively with human cognition or to be able to meaningfully use it to define where AGI is, which might not be what you were advocating for at all, in which case, sorry for misinterpreting!

nnx · 2024-12-07T03:00:55 1733540455

I'm unable to figure out how to solve current Daily Puzzle (Puzzle ID: 79369cc6) at https://arcprize.org/play

Either I'm really dumb or the test is getting into captcha-like territory where humans aren't really good at solving/deciphering the test anymore.

atc0m · 2024-12-07T11:10:33 1733569833

I agree some of the tests are not intuitive.

trott · 2024-12-07T12:40:49 1733575249

(Spoiler alert)

In https://arcprize.org/play?task=79369cc6 , the yellow 3x3 square shows you the pink pattern to look for, while allowing rotations (and ignoring the fact that the pattern may be next to other patterns)

hulium · 2024-12-06T21:26:27 1733520387

Were there any interesting non-neural approaches? I was wondering whether there is any underlying structure in the ARC tasks that could tell us something about algorithms for "reasoning" problems in general.

neoneye2 · 2024-12-06T21:29:15 1733520555

The 3rd place solution by Agnis Liukis, solves 40 tasks. https://www.kaggle.com/code/gregkamradt/arc-prize-2024-solut...

ipunchghosts · 2024-12-07T23:47:58 1733615278

Can we get this person to explain thier inspiration?

ipunchghosts · 2024-12-07T23:46:19 1733615179

This is a good question!!

a_wild_dandan · 2024-12-06T23:45:32 1733528732

Reasons that I can't take this benchmark seriously:

1. Existing brute force algorithms solve 40% of this "reasoning" and "generalization" test.

2. AGI must evidently fit on a single 16GB, decade-old GPU?

3. If ARC fails blind people, it's not a reasoning test. Reasoning is independent of visual acuity. So ARC is at best a vision processing then reasoning test. SotA model "failure" is meaningless. ("But what about the other format, JSON?" Yeah, I would love to see the human solve rate on that...)

pshirshov · 2024-12-07T01:19:38 1733534378

Ergh. This test checks how good you can infer cellular automaton rules. Considering that CAs are Turing-complete, that might be a very good entry-level intelligence detector.

If it's so easy to brute force, why wouldn't you claim the $1M?

jebarker · 2024-12-07T04:49:15 1733546955

I'm a little surprised by the seeming enthusiasm in the report for TTT as an approach. The results speak for themselves and TTT seems like a powerful approach. But the dependence on large amounts of synthetic pre-training data seems to contradict the philosophical ideas behind the competition.