We validated our performance using competitions hosted on Codeforces, a popular platform which hosts regular competitions that attract tens of thousands of participants from around the world who come to test their coding skills. We selected for evaluation 10 recent contests, each newer than our training data. AlphaCode placed at about the level of the median competitor, marking the first time an AI code generation system has reached a competitive level of performance in programming competitions.
[edit] Is "10 recent contests" a large enough sample size to prove whatever point is being made?
The test against human contestants doesn't tell us anything because we have no objective measure of the ability of those human coders (they're just the median in some unknown distribution of skill).
There's more objective measures of performance, like a good, old-fashioned, benchmark dataset. For such an evaluation, see table 10 in the arxiv preprint (page 21 of the pdf), listing the results against the APPS dataset of programming tasks. The best performing variant of AlphaCode solves 25% of the simplest ("introductory") APPS tasks and less than 10% of the intermediary ("interview") and more advanced ones ("competition").
So it's not very good.
Note also that the article above doesn't report the results on APPS. Because they're not that good.
[edit] Is "10 recent contests" a large enough sample size to prove whatever point is being made?