Uhhhh… It was trained on ARC data? So they targeted a specific benchmark and are surprised and blown away the LLM performed well in it? What’s that law again? When a benchmark is targeted by some system the benchmark becomes useless?
Yeah, seriously. The style of testing is public, so some engineers at OpenAI could easily have spent a few months generating millions of permutations of grid-based questions and including those in the original data for training the AI. Handshakes all around, publicity for everyone.
They are running a business selling access these models to enterprises and consumers. People won’t pay for stuff that doesn’t solve real problems. Nobody pays for stuff just because of a benchmark. It’d be really weird to become obsessed with metrics gaming rather than racing to build something smarter than the other guys. Nothing wrong with curating any type of training set that actually produces something that is useful.