Well put. You ask LLMs about ARC-like challenges and they are able to come up with a list of possible problem formulations even before you show them the input. The models already know that they might expect various object manipulations, symmetry problem, etc. The fact that the solution costs thousands of dollars says to me that the model iterates over many solutions while using this implicit knowledge and feedback it gets from running the program. It is still impressive, but I don't think this is what the ARC prize was supposed to be about.
Basically solutions that were doing well in arc just threw thousands of ideas at the wall and picked the ones that stuck. They were literally generating thousands of python programs, running them and checking if any produced the correct output when fed with data from examples.
This o3 doesn't need to run python. It itself executes programs written in tokens inside it's own context window which is wildly inefficient but gives better results and is potentially more general.
So basically it's a massively inefficient trial-and-error leetcode solver which only works because it throws incredible amounts of compute at the problem.