The test against human contestants doesn't tell us anything because we have no objective measure of the ability of those human coders (they're just the median in some unknown distribution of skill).
There's more objective measures of performance, like a good, old-fashioned, benchmark dataset. For such an evaluation, see table 10 in the arxiv preprint (page 21 of the pdf), listing the results against the APPS dataset of programming tasks. The best performing variant of AlphaCode solves 25% of the simplest ("introductory") APPS tasks and less than 10% of the intermediary ("interview") and more advanced ones ("competition").
So it's not very good.
Note also that the article above doesn't report the results on APPS. Because they're not that good.
There's more objective measures of performance, like a good, old-fashioned, benchmark dataset. For such an evaluation, see table 10 in the arxiv preprint (page 21 of the pdf), listing the results against the APPS dataset of programming tasks. The best performing variant of AlphaCode solves 25% of the simplest ("introductory") APPS tasks and less than 10% of the intermediary ("interview") and more advanced ones ("competition").
So it's not very good.
Note also that the article above doesn't report the results on APPS. Because they're not that good.