This is technology that simply didn't exist in any form 2 years ago. For no amount of money could you buy a program that did what this one does. Having been watching the growth of Transformer-based models for a couple years now really has hammered home that just as soon as we figure out how an AI can do X, X is no longer AI, or at least no longer impressive. How this happens is with comments like yours, and I'd really like to push back against it for once.
Also 5 years? So assuming that we have all of the future ahead of us, to think that we only have 5 years left of being the top in programming competitions seems like it's somehow important and shouldn't be dismissed with "I don't think it's quite as impressive as you make it out to be."
I don't think that's what happening. Let's talk about this case: programming. It's not that people are saying "an AI programming" isn't impressive or isn't AI, it's that when people say "an AI programming" they aren't talking about ridiculously controlled environments like in this case.
It's like self-driving cars. A car driving itself for the first time in a controlled environment, I'm sure, was an impressive feat, and it wouldn't be inaccurate to call it a self-driving car. However, that's not what we're all waiting for when we talk about the arrival of self-driving cars.
And if AI programming were limited to completely artificial contexts you would have a point, though I'd still be concerned. We live in a world, however, where programmers routinely call on the powers of an AI to complete their real code and get real value out of it. This is based on the same technology that brought us this particular win, so clearly this technology is useful outside "ridiculously controlled environments."
That's not significantly different than how programming has worked for the last 40 years though. We slowly push certain types of decisions and tasks down into the tools we use, and what's left over is what we call 'programming'. It's cool, no doubt, but as long as companies need to hire 'prorammers', then it's not the huge thing we're all looking out over the horizon waiting for.
Programmers do setup completely artificial contexts so AI can work.
None of the self driving systems where setup by giving the AI access to sensors, a car, and the drivers handbook and saying well you figure it out from there. The general trend is solve this greatly simplified problem, this more complex one, up to dealing with the real world.
By AI programming I mean the AI doing programming, not programming the AI. Though soon enough the first will be doing the second and that's where the loop really closes...
And those are only examples from neural program synthesis. Program synthesis, in
general, is a field that goes way back. I'd suggest as usual not making big
proclamations about its state of the art without being acquainted with the
literature. Because if you don't know what others have done every announcement
by DeepMind, OpenAI et al seems like a huge advance... when it really isn't.
Of course program synthesis has been a thing for years, I remember some excellent papers out of MSR 10 years ago. But which of those could read a prompt and build the program from the prompt? Setting up a whole bunch of constraints and having your optimizer spit out a program that fulfills them is program synthesis and is super interesting, but not at all what I think of when I'm told we can make the computer program for us.
For instance, RobustFill takes its optimization criteria from a bundle of pre-completed inputs and outputs of how people want the program to behave instead of having the problem described in natural language and creating the solution program.
Program synthesis from natural language specifications has existed for many
years, also. It's not my specialty (neither am I particularly interested in it),
but here's a paper I found from 2017, with a quick search:
AlphaCode is not particularly good at it, either. In the arxiv preprint, besides
the subjetive and pretty meaningless "evaluation" against human coders it's also
tested on a formal program synthesis benchmark, the APPS dataset. The best
performing AlphaCode variant reported in the arxiv preprint solves 25% of the
"introductory" APPS tasks (the least challenging ones). All AlphaCode variants
tested solve less than 10% of the "interview" and "competition" (intermediary
and advanced) tasks. These more objective results are not reported in the
article above, I think for obvious reasons (because they are extremely poor).
So it's not doing anything radically new and it's not doing it particularlly
well either. Please be better informed before propagating hype.
Edit: really, from a technical point of view, AlphaCode is a brute-force,
generate-and-test approach to program synthesis that was state-of-the-art 40
years ago. It's just a big generator that spams programs hoping it will hit a
good one. I have no idea who came up with this. Oriol Vinyals is the last author
and I've seen enough of that guy's work to know he knows better than bet on such
a primitive, even backwards approach. I'm really shocked that this is DeepMind
work.
I've also worked in the area and published research in it a couple years ago. I almost worked for a company focused on neural program synthesis but they did a large pivot and gave up a couple years ago working on much simpler problems and decided that current research was not good enough to do well on problems like this. I had a paper accepted that translated between toy programming languages 3ish years ago. Toy here meaning about complexity of simply typed lambda calculus that I wrote language in a couple hundred lines.
This and copilot are much better than level of problems being tackled a couple years ago.
I don't agree. I don't know your work, but the approach in AlphaCode and Copilot (or the Codex model behind it) is a step backwards for neural program synthesis, and for program synthesis in general. The idea is to train a large language model to generate code. The trained language model has no way to direct its generation towards code that satisfies a specification. It can only complete source code from some initial prompt. The code generated is remarkably grammatical (in the context of a programming language grammar) which is certainly an advance for language representation, but some kind of additional mechanism is required to ensure that the generated code is relevant to the specification. In Copilot, that mechanism is the user's eyballs. In AlphaCode the mechanism is to test against the few I/O examples of the programming problems. In either case, the whole thing is just hit-and-miss. The language model generates mostly garbage -DeepMind brags that AlphaCode generates "orders of magnitude" more code than previous work, but that's just to say that it's even more random, and its generation misses the target even more than previous work! Even filtering on I/O examples is not enough to control the excessive over-generation and so additional measures are needed (clustering and ranking of programs etc).
All this could be done 40 years ago with a dumb DSL, or perhaps a more sophisticated system like a PCFG for programs, with a verifier bolted on [1]. It's nothing new. What's new is that it's done with a large language model trained with a Transformer, which is all the rage these days, and of course that it's done at the scale and with the amount of processing power available to DeepMind. Which I'm going to assume you didn't have back when you published your work.
Honestly, this is just an archaic, regressive approach, that can only work because of very big computers and very big datasets.
___________
[1] Which btw, is straightforard to do "by hand" and is something that people do all the time. In the AlphaCode work, the large language model simply replaces a hand-crafted program generator with a lot of data, but there is no reason to do that. This is the quintessential problem where a machine learning solution is not necessary because a hand-crafted solution is available, and easier to control.
I agree the approach itself is quite brute force heavy. There is a lot of information that could be used and I’d hope is helpful like grammar or language, traces/simulations of behavior, etc.
But when I say alpha code/copilot is good I’m referring solely to the difficulty of problems they are doing. There are many papers including mine that worked on simpler problems with more structure used to work on them.
I expect follow up work will include actually incorporating other knowledge more heavily to the model. My work was mainly on restricting tree like models to only make predictions following grammar of the language. Does that parallelize/fit well with a transformer? Unsure, but I would expect some language information/genuine problem constraints to be incorporated in future work.
Honestly I am pretty surprised how far pure brute force with large model is going. I would not have expected gpt3 level language modeling from more scale on a transformer and little else.
Well, I'm not surprised because I know that large language models can learn smooth approximations of natural language. They can generate very grammatical natural English, so why not grammatical source code, which is easier? Of course, once you have a generator for code, finding a program that satisfies a specification is just a matter of searching -and assuming the generated code includes such a program. But it seems like that isn't really the case with AlphaCode, because its performance is very poor.
I have to say that usually I'm the one speaking out against an over-reliance on machine learning benchmarks and against expecting a new approach to beat the state of the art before it can be taken seriously, but this is not a new approach, and that's the problem I have here. It's nothing new, repackaged as something new and sold as something it isn't ("reasoning" and "critical thinking" and other nonsense like that).
I agree that future work must get smarter, and incorporate some better inductive biases (knowledge, something). Or perhaps it's a matter of searching more intelligently because given they can generate millions of programs I'd have thought they'd be able to find more programs that approximate a solution.
Has someone tried classical program synthesis techniques on competitive programming problems? I wonder what would have been possible with tech from more than 2 years ago.
I don't know if anyone has tried it, but it's not a very objective evaluation. We have no good measure of the coding ability of the "median level competitor" so doing better or worse than that, doesn't really tell us anything useful about the coding capability of an automated system.
So my hunch is that it probably hasn't been done, or hasn't been done often, because the program synthesis community would recognise it's pointless.
What you really want to look at is formal program synthesis benchmarks and how systems like AlphaCode do on them (hint: not so good).