This is extremely impressive, but I do think it’s worth noting that these two th...

e4e78a06 · on Feb 2, 2022

I don't think it's quite as impressive as you make it out to be. Median performance in a Codeforces programming competition is solving the easiest 1-2 problems out of 5-6 problems. Like all things programming the top 1% is much, much better than the median.

There's also the open problem of verifying correctness in solutions and providing some sort of flag when the model is not confident in its correctness. I give it another 5 years in the optimistic case before AlphaCode can reliably compete at the top 1% level.

ctoth · on Feb 2, 2022

This is technology that simply didn't exist in any form 2 years ago. For no amount of money could you buy a program that did what this one does. Having been watching the growth of Transformer-based models for a couple years now really has hammered home that just as soon as we figure out how an AI can do X, X is no longer AI, or at least no longer impressive. How this happens is with comments like yours, and I'd really like to push back against it for once. Also 5 years? So assuming that we have all of the future ahead of us, to think that we only have 5 years left of being the top in programming competitions seems like it's somehow important and shouldn't be dismissed with "I don't think it's quite as impressive as you make it out to be."

BobbyJo · on Feb 2, 2022

I don't think that's what happening. Let's talk about this case: programming. It's not that people are saying "an AI programming" isn't impressive or isn't AI, it's that when people say "an AI programming" they aren't talking about ridiculously controlled environments like in this case.

It's like self-driving cars. A car driving itself for the first time in a controlled environment, I'm sure, was an impressive feat, and it wouldn't be inaccurate to call it a self-driving car. However, that's not what we're all waiting for when we talk about the arrival of self-driving cars.

ctoth · on Feb 2, 2022

And if AI programming were limited to completely artificial contexts you would have a point, though I'd still be concerned. We live in a world, however, where programmers routinely call on the powers of an AI to complete their real code and get real value out of it. This is based on the same technology that brought us this particular win, so clearly this technology is useful outside "ridiculously controlled environments."

BobbyJo · on Feb 2, 2022

That's not significantly different than how programming has worked for the last 40 years though. We slowly push certain types of decisions and tasks down into the tools we use, and what's left over is what we call 'programming'. It's cool, no doubt, but as long as companies need to hire 'prorammers', then it's not the huge thing we're all looking out over the horizon waiting for.

Retric · on Feb 2, 2022

Programmers do setup completely artificial contexts so AI can work.

None of the self driving systems where setup by giving the AI access to sensors, a car, and the drivers handbook and saying well you figure it out from there. The general trend is solve this greatly simplified problem, this more complex one, up to dealing with the real world.

ctoth · on Feb 2, 2022

By AI programming I mean the AI doing programming, not programming the AI. Though soon enough the first will be doing the second and that's where the loop really closes...

YeGoblynQueenne · on Feb 2, 2022

>> This is technology that simply didn't exist in any form 2 years ago.

A few examples of neural program synthesis from at least 2 years ago:

https://sunblaze-ucb.github.io/program-synthesis/index.html

Another example from June 2020:

DreamCoder: Growing generalizable, interpretable knowledge with wake-sleep Bayesian program learning

https://arxiv.org/abs/2006.08381

RobustFill, from 2017:

RobustFill: Neural Program Learning under Noisy I/O

https://www.microsoft.com/en-us/research/wp-content/uploads/...

I could go on.

And those are only examples from neural program synthesis. Program synthesis, in general, is a field that goes way back. I'd suggest as usual not making big proclamations about its state of the art without being acquainted with the literature. Because if you don't know what others have done every announcement by DeepMind, OpenAI et al seems like a huge advance... when it really isn't.

ctoth · on Feb 2, 2022

Of course program synthesis has been a thing for years, I remember some excellent papers out of MSR 10 years ago. But which of those could read a prompt and build the program from the prompt? Setting up a whole bunch of constraints and having your optimizer spit out a program that fulfills them is program synthesis and is super interesting, but not at all what I think of when I'm told we can make the computer program for us. For instance, RobustFill takes its optimization criteria from a bundle of pre-completed inputs and outputs of how people want the program to behave instead of having the problem described in natural language and creating the solution program.

YeGoblynQueenne · on Feb 2, 2022

Program synthesis from natural language specifications has existed for many years, also. It's not my specialty (neither am I particularly interested in it), but here's a paper I found from 2017, with a quick search:

https://www.semanticscholar.org/paper/Program-Synthesis-from...

AlphaCode is not particularly good at it, either. In the arxiv preprint, besides the subjetive and pretty meaningless "evaluation" against human coders it's also tested on a formal program synthesis benchmark, the APPS dataset. The best performing AlphaCode variant reported in the arxiv preprint solves 25% of the "introductory" APPS tasks (the least challenging ones). All AlphaCode variants tested solve less than 10% of the "interview" and "competition" (intermediary and advanced) tasks. These more objective results are not reported in the article above, I think for obvious reasons (because they are extremely poor).

So it's not doing anything radically new and it's not doing it particularlly well either. Please be better informed before propagating hype.

Edit: really, from a technical point of view, AlphaCode is a brute-force, generate-and-test approach to program synthesis that was state-of-the-art 40 years ago. It's just a big generator that spams programs hoping it will hit a good one. I have no idea who came up with this. Oriol Vinyals is the last author and I've seen enough of that guy's work to know he knows better than bet on such a primitive, even backwards approach. I'm really shocked that this is DeepMind work.

Mehdi2277 · on Feb 3, 2022

I've also worked in the area and published research in it a couple years ago. I almost worked for a company focused on neural program synthesis but they did a large pivot and gave up a couple years ago working on much simpler problems and decided that current research was not good enough to do well on problems like this. I had a paper accepted that translated between toy programming languages 3ish years ago. Toy here meaning about complexity of simply typed lambda calculus that I wrote language in a couple hundred lines.

This and copilot are much better than level of problems being tackled a couple years ago.

YeGoblynQueenne · on Feb 3, 2022

I don't agree. I don't know your work, but the approach in AlphaCode and Copilot (or the Codex model behind it) is a step backwards for neural program synthesis, and for program synthesis in general. The idea is to train a large language model to generate code. The trained language model has no way to direct its generation towards code that satisfies a specification. It can only complete source code from some initial prompt. The code generated is remarkably grammatical (in the context of a programming language grammar) which is certainly an advance for language representation, but some kind of additional mechanism is required to ensure that the generated code is relevant to the specification. In Copilot, that mechanism is the user's eyballs. In AlphaCode the mechanism is to test against the few I/O examples of the programming problems. In either case, the whole thing is just hit-and-miss. The language model generates mostly garbage -DeepMind brags that AlphaCode generates "orders of magnitude" more code than previous work, but that's just to say that it's even more random, and its generation misses the target even more than previous work! Even filtering on I/O examples is not enough to control the excessive over-generation and so additional measures are needed (clustering and ranking of programs etc).

All this could be done 40 years ago with a dumb DSL, or perhaps a more sophisticated system like a PCFG for programs, with a verifier bolted on [1]. It's nothing new. What's new is that it's done with a large language model trained with a Transformer, which is all the rage these days, and of course that it's done at the scale and with the amount of processing power available to DeepMind. Which I'm going to assume you didn't have back when you published your work.

Honestly, this is just an archaic, regressive approach, that can only work because of very big computers and very big datasets.

___________

[1] Which btw, is straightforard to do "by hand" and is something that people do all the time. In the AlphaCode work, the large language model simply replaces a hand-crafted program generator with a lot of data, but there is no reason to do that. This is the quintessential problem where a machine learning solution is not necessary because a hand-crafted solution is available, and easier to control.

Mehdi2277 · on Feb 3, 2022

I agree the approach itself is quite brute force heavy. There is a lot of information that could be used and I’d hope is helpful like grammar or language, traces/simulations of behavior, etc.

But when I say alpha code/copilot is good I’m referring solely to the difficulty of problems they are doing. There are many papers including mine that worked on simpler problems with more structure used to work on them.

I expect follow up work will include actually incorporating other knowledge more heavily to the model. My work was mainly on restricting tree like models to only make predictions following grammar of the language. Does that parallelize/fit well with a transformer? Unsure, but I would expect some language information/genuine problem constraints to be incorporated in future work.

Honestly I am pretty surprised how far pure brute force with large model is going. I would not have expected gpt3 level language modeling from more scale on a transformer and little else.

YeGoblynQueenne · on Feb 3, 2022

Well, I'm not surprised because I know that large language models can learn smooth approximations of natural language. They can generate very grammatical natural English, so why not grammatical source code, which is easier? Of course, once you have a generator for code, finding a program that satisfies a specification is just a matter of searching -and assuming the generated code includes such a program. But it seems like that isn't really the case with AlphaCode, because its performance is very poor.

I have to say that usually I'm the one speaking out against an over-reliance on machine learning benchmarks and against expecting a new approach to beat the state of the art before it can be taken seriously, but this is not a new approach, and that's the problem I have here. It's nothing new, repackaged as something new and sold as something it isn't ("reasoning" and "critical thinking" and other nonsense like that).

I agree that future work must get smarter, and incorporate some better inductive biases (knowledge, something). Or perhaps it's a matter of searching more intelligently because given they can generate millions of programs I'd have thought they'd be able to find more programs that approximate a solution.

qualudeheart · on Feb 2, 2022

Has someone tried classical program synthesis techniques on competitive programming problems? I wonder what would have been possible with tech from more than 2 years ago.

YeGoblynQueenne · on Feb 2, 2022

I don't know if anyone has tried it, but it's not a very objective evaluation. We have no good measure of the coding ability of the "median level competitor" so doing better or worse than that, doesn't really tell us anything useful about the coding capability of an automated system.

So my hunch is that it probably hasn't been done, or hasn't been done often, because the program synthesis community would recognise it's pointless.

What you really want to look at is formal program synthesis benchmarks and how systems like AlphaCode do on them (hint: not so good).

xorcist · on Feb 2, 2022

You don't think it's impressive, yet you surmise that a computer program could compete at a level of the top 1% of all humans in five years?

That's wildly overstating the promise of this technology, and I'd be very surprised if the authors of this wouldn't agree.

bricemo · on Feb 2, 2022

Agree. If an AI could code within the top 1%, every single person whose career touches code would have their lives completely upended. If that’s only 5 years out…ooof.

Jensson · on Feb 2, 2022

Top 1% competitive programming level means that it can start solving research problems, problem difficulty and creativity needed for problems goes up exponentially for harder problems and programming contests have lead to research papers before. It would be cool if we got there in 5 years but I doubt it. But if we got there it would revolutionize so many things in society.

Groxx · on Feb 2, 2022

I do kinda wonder if it'd lead to as good results if you just did a standard "matches the most terms the most times" search against all of github.

I have a suspicion it would - kinda like Stack Overflow, problems/solutions are not that different "in the small". It'd have almost certainly given us the fast square root trick verbatim, like Github's AI is doing routinely.

thomasahle · on Feb 3, 2022

Can't rule it out, but if Alphacode gets to top 1% in five years, that's when it can basically do algorithms research. We can ask it to come up with new algorithms for all the famous problems and then just have to try and understand it's solutions :O

jakub_g · on Feb 2, 2022

100% agree. Someone (who?) had to take time and write the detailed requirements. In real jobs you rarely get good tickets with well defined expectations; it's one of most important developer's jobs to transform fuzzy requirement into a good ticket.

(Side note: I find that many people skip this step, and go straight from fuzzy-requirement-only-discussed-on-zoom-with-Bob to code; open a pull request without much context or comments; and then a code reviewer is supposed to review it properly without really knowing what problem is actually being solved, and whether the code is solving a proper problem at all).

ctoth · on Feb 2, 2022

So what happens when OpenAI releases TicketFixer 0.8 which synthesizes everything from transcripts of your meetings to the comments to the JIRA ticket to the existing codebase and spits out better tickets to feed into the programming side?

solarmist · on Feb 2, 2022

Yup, I hope that'll happen. Then engineers would just end up being done at a higher level of abstraction closer to what designers do with wireframes and mockups.

Kind of the opposite of the way graphic design has evolved. Instead of getting more involved in the process and, in many cases, becoming front-end developers, it'll become more abstract where humans make the decisions and reason about what to include/exclude, how it'll flow, etc.

Even TicketFixer wouldn't be able to do more than offer a handful of possible solutions to design-type issues.

bmhin · on Feb 2, 2022

Yeah, we need our TicketFixer to also include the No_Bob 0.2 plugin that figures out that a decent percentage of the time whatever "Bob" is asking for in that meeting is not what "Bob" thinks he is asking for or should be asking for and can squash those tickets. Without that we're gonna somehow end up with spreadsheets in everything.

solarmist · on Feb 2, 2022

Haha, yeah, there's that, but there are also things like "adding a dark mode." There are a dozen ways to accomplish that kind of thing, and every company's solution will diverge when you get down to the details.

jakub_g · on Feb 2, 2022

Take my money.

ohwellhere · on Feb 2, 2022

Is the next step in the evolution of programming having the programmer become the specifier?

Fuzzy business requirements -> programmer specifies and writes tests -> AI codes

buscoquadnary · on Feb 2, 2022

That's all we've ever been since we invented software.

First we specified the exact flow of the bits with punch cards.

Then we got assembly and we specified the machine instructions.

Then we got higher level languages and we specified how the memory was to be managed and what data to store where.

Now we have object oriented languages that allow us to work with domain models, and functional languages that allow us to work data structures and algorithms.

The next level may be writing business rules, and specifying how services talk to each other, who knows, but it will be no different than it is now just a higher level.

chinabot · on Feb 2, 2022

If its anything like my job

while(1) { Fuzzy business requirements -> programmer specifies and writes tests -> AI codes }

jensensbutton · on Feb 2, 2022

Maybe the problem transformation will be both the beginning _and_ end of the developer's role.

machiaweliczny · on Feb 2, 2022

But it's easy to create AI conversation that will refine problem.

baobabKoodaa · on Feb 2, 2022

> One of the things I like about competitive programming and the like is just getting to implement a clearly articulated problem

English versions of Codeforces problems may be well-defined but they are often very badly articulated and easy to misunderstand as a human reader. I still can't understand how they got AI to be able to generate plausible solutions from these problem statements.

zbobet2012 · on Feb 3, 2022

They used the tests. The specification being very approximate is fine, because they had a prebuilt way to "check" if their result was good.

baobabKoodaa · on Feb 3, 2022

Wait what, they cheated to get this result? Only pretests are available to competitors before submitting. If they had access to the full test suite, then they had a HUGE advantage over actual competitors, and this result is way less impressive than claimed. Can you provide a source for this claim? I don't want to read the full paper.

sireat · on Feb 3, 2022

If AlphaCode had access to full test suite then the result is not surprising at all.

You can fit anything given enough parameters.

https://fermatslibrary.com/s/drawing-an-elephant-with-four-c...

elb2020 · on Feb 2, 2022

I think they will always be limitations.

Software is, ultimately, always about humans. Software is always there to serve a human need. And the "intelligence" that designs software will always, at some level, need to be intelligence that understands the human mind, with all it's knowledge, needs, and intricacies. There are no shortcuts to this.

So, I think AI as a replacement for software development professionals, that's currently more like a pipe dream. I think AI will give us powerful new tools, but I do not think it will replace, or even reduce, the need for software development professionals. In total it might even increase the need for software development professionals, because it adds another level to the development stack. Another level of abstraction, and another level of complexity that needs to be understood.