The first question I still have is what happened to core knowledge priors. The w...

fchollet · 2024-12-06T21:04:40 1733519080

What all top models do is recombine at test time the knowledge they already have. So they all possess Core Knowledge priors. Techniques to acquire them vary:

* Use a pretrained LLM and hope that relevant programs will be memorized via exposure to text data (this doesn't work that well)

* Pretrain a LLM on ARC-AGI-like data

* Hardcode the priors into a DSL

> Which is to say, a data augmentation approach

The key bit isn't the data augmentation but the TTT. TTT is a way to lift the #1 issue with DL models: that they cannot recombine their knowledge at test time to adapt to something they haven't seen before (strong generalization). You can argue whether TTT is the right way to achieve this, but there is no doubt that TTT is a major advance in this direction.

The top ARC-AGI models perform well not because they're trained on tons of data, but because they can adapt to novelty at test time (usually via TTT). For instance, if you drop the TTT component you will see that these large models trained on millions of synthetic ARC-AGI tasks drop to <10% accuracy. This demonstrates empirically that ARC-AGI cannot be solved purely via memorization and interpolation.

YeGoblynQueenne · 2024-12-06T22:08:23 1733522903

>> So they all possess Core Knowledge priors.

Do you mean the ones from your white paper? The same ones that humans possess? How do you know this?

>> The key bit isn't the data augmentation but the TTT.

I haven't had the chance to read the papers carefully. Have they done ablation studies? For instance, is the following a guess or is it an empirical result?

>> For instance, if you drop the TTT component you will see that these large models trained on millions of synthetic ARC-AGI tasks drop to <10% accuracy.

optimalsolver · 2024-12-06T21:26:42 1733520402

>This demonstrates empirically that ARC-AGI cannot be solved purely via memorization and interpolation

Now that the current challenge is over, and a successor dataset is in the works, can we see how well the leading LLMs perform against the private test set?

tuukkah · 2024-12-06T22:46:04 1733525164

I think the "semi-private" numbers here already measure that: https://arcprize.org/2024-results

For example, Claude 3.5 gets 14% in semi-private eval vs 21% in public eval. I remember reading an explanation of "semi-private" earlier but cannot find it now.

aithrowawaycomm · 2024-12-06T21:37:04 1733521024

Even the strongest possible interpretation of the results wouldn't conclude "ARC-AGI is dead" because none of the submissions came especially close to human-level performance; the criteria was 85% success but the best in 2024 was 55%.

That said, I think there should be consideration via information thermodynamics: even with TTT these program-generating systems are using an enormous amount of bits compared to a human mind, a tiny portion of which solves ARC quickly and easily using causality-first principles of reasoning.

Another point: suppose a system solves ARC-AGI with 99% accuracy. Then it should be tested on "HARC-HAGI," a variant that uses hexagons instead of squares. This likely wouldn't trip up a human very much - perhaps a small decrease due to increased surface area for brain farts. But if the AI needs to be retrained on a ton of hexagonal examples, then that AI can't be an AGI candidate.

szvsw · 2024-12-06T22:29:20 1733524160

> That said, I think there should be consideration via information thermodynamics: even with TTT these program-generating systems are using an enormous amount of bits compared to a human mind, a tiny portion of which solves ARC quickly and easily using causality-first principles of reasoning.

This isn’t my area of expertise, but it seems plausible to me that what you said is completely erroneous or at the very least completely unverifiable at this point in time. How do you quantify how many bits it takes a human mind to solve one of the ARC problems?

That seems likely beyond the level of insight we have into the structure of cognition and information storage etc etc in wetware. I could of course be wrong and would love to be corrected if so! You mentioned a “tiny portion” of the human mind, but (as far as I’m aware), any given “small” part of human cognition still involves huge amounts of complexity and compute.

Maybe you are saying that the high level decision making a human goes through when solving can be represented with a relatively small number of pieces of information/logical operations (as opposed to a much lower level notion closer to the wetware of the quantity of information) but then it seems unfair to compare to the low level equivalent (weights & biases, FLOPs etc) in the ML system when there may be higher order equivalents.

I do appreciate the general notion of wanting to normalize against something though, and some notion of information seems like a reasonable choice, but practically out of our reach. Maybe something like peak power or total energy consumption would be a more reasonable choice, which we can at least get a lower and upper bounds on in the human case (metabolic rates are pretty well studied, and even if we don’t have a good idea of how much energy is involved in completing cognitive tasks we can at least get bounds for running the entire system in that period of time) and close to a precise value in the ML case.

aithrowawaycomm · 2024-12-06T23:24:17 1733527457

I was speaking loosely but the operative term is "information thermodynamics": comparing bits of AI output versus bits of intentional human thought, ignoring statistical/physical bits related to ANN inference or biological neuron activity. The "tiny chunk of the human mind" thing was a distraction I shouldn't have included.

These AI output as tokens hundreds of potential solutions, whereas a human solving a very tricky ARC problem might need at most a few dozen cases to run through. There's a big mess of ANN linear algebra / human subconscious thought and I agree these messes can't be compared (or even identified in the human case). But we can compare the efficiency of the solution. It is possible that subconsciously humans "generate" hundreds of solutions that are mostly discarded, but I don't think the brain is fast enough to do that at the speed of conscious thought: it's a 50bn core processor but each core is only 200Hz and they aren't general-purpose CPUs. It also seems inconsistent with how humans solve these problems.

I believe energy usage would be even more misleading: in terms of operations/second a human brain is comparable to a 2020s supercomputer running at 30MW, but it only consumes 300 watts. (I was thinking about this with the "tiny portion" comment but it is irrelevant.)

szvsw · 2024-12-07T03:41:43 1733542903

Thanks for the response! I was trying to allude to what you are describing with the bit (ha) I mentioned about higher order thinking but you obviously articulated it much more effectively.

I guess I’m not sure it’s obvious where the right line to draw the boundary for “intentional human thought” is? Surely there is a lot of cognition and representation going on at extraordinary speeds that exist in some hazy border region between instinct/reflex/subconscious and conscious thought. Still, having said that, I do see what you are saying about trying to compare the complexity of the formal path to the solution, or at least what the human thinks their formal path was.

I’m generally of the mind (also, ha) that we won’t really ever be able to quantify any of this in a meaningful way in the short term and if anything which qualifies as AGI does emerge, it might only be something which is an “I know it when I see it” kind of evaluation…

Where are you getting 300W from? The body only dumps 100W of heat at rest and uses like 300-400W during moderate physical activity, so I’m a little confused about what you are describing there. The typical estimates I’ve seen are like 20W or so for the brain.

Edit: I should also say that what you describe does seem like a great way to compare solutions between computational systems currently being developed and a good one to use to try to push development forward; it just seems quixotic to try to be able to use it comparatively with human cognition or to be able to meaningfully use it to define where AGI is, which might not be what you were advocating for at all, in which case, sorry for misinterpreting!