>> AlphaCode ranked within the top 54% in real-world programming competitions, a...

>> AlphaCode ranked within the top 54% in real-world programming competitions, an advancement that demonstrates the potential of deep learning models for tasks that require critical thinking.

Critical thinking? Oh, wow. That sounds amazing!

Let's read further on...

>> At evaluation time, we create a massive amount of C++ and Python programs for each problem, orders of magnitude larger than previous work. Then we filter, cluster, and rerank those solutions to a small set of 10 candidate programs that we submit for external assessment.

Ah. That doesn't sound like "critical thinking", or any thinking. It sounds like massive brute-force guessing.

A quick look at the arxiv preprint linked from the article reveals that the "massive" amount of prorgams generated is in the millions (see Section 4.4). These are "filtered" by testing them against program input-output (I/O) examples given in the problem descriptions. This "filtering" still leaves a few thousands of candidate programs that are further reduced by clustering to "only" 10 (which are finally submitted).

So it's a generate-and-test approach rather than anything to do with reasoning (as claimed elsewhere in the article) let alone "thinking". But why do such massive numbers of programs need to be generated? And why are there still thousands of candidate programs left after "filtering" on I/O examples?

The reason is that the generation step is constrained by the natural-language problem descriptions, but those are not enough to generate appropriate solutions because the generating language model doesn't understand what the problem descriptions mean; so the system must generate millions of solutions hoping to "get lucky". Most of those don't pass the I/O tests so they must be discarded. But there are only very few I/O tests for each problem so there are many programs that can pass them, and still not satisfy the problem spec. In the end, clustering is needed to reduce the overwhelming number of pretty much randomly generated programs to a small number. This is a method of generating programs that's not much more precise than drawing numbers at random from a hat.

Inevitably, the results don't seem to be particularly accurate, hence the evaluation against programs written by participants in coding competitions, which is not any objective measure of program correctness. Table 10 on the arxiv preprint lists results on a more formal benchmar, the APPS dataset, where it's clear that the results are extremely poor (the best performing AlphaCode variant solves 20% of the "introductory" level problems, though outperforming earlier approaches).

Overall, pretty underwhelming and a bit surpirsing to see such lackluster results from DeepMind.