Yes, apparently they couldn't use their full approach because "of missing information in the dataset". That points to a further limitation of the approach: it works for Codeforces problems but not for APPS problems (so it's very purpose-specific).
Btw, APPS is not much of a benchmark. It evaluates code generation according to how close it resembles code written by humans. That's standard fare for text generation benchmarks, like evaluating machine translation on some arbitrary set of human translations. There are no good benchmarks for text generation (and there are no good metrics either).
But the comparison against the average competitor on Codeforces is even more meaningless because we have no way to know what is the true coding ability of that average competitor.
> Btw, APPS is not much of a benchmark. It evaluates code generation according to how close it resembles code written by humans.
No, the metric used in this paper was the percentage of questions it could solve against the hidden tests.
> That points to a further limitation of the approach: it works for Codeforces problems but not for APPS problems (so it's very purpose-specific).
This does not matter typically since you'd just pretrain on the data that works. However, “[t]he CodeContests training set has a non-empty intersection with the APPS test set, and therefore CodeContests cannot be used during training when evaluating on the APPS benchmark.” This is purely an evaluation issue; leakage doesn't matter so much in production.
>> No, the metric used in this paper was the percentage of questions it could solve against the hidden tests.
Right, that's my mistake. The APPS dataset has natural language specifications and test cases for evaluation. It actually includes Codeforces problems.
The excuse quoted in the second part of your comment is an excuse. If a large language model can complete a code generation task, that's because it's seen an example of the code it's asked to generate before. Any claims to the contrary need very strong evidence to support them and there's typically no such thing in papers like the AlphaCode one.
Your comment is an excuse, not mine: “ignore how good the advertised model is, because a different, much smaller version without all the techniques merely does pretty well on an extremely tough problem set.”
> Any claims to the contrary need very strong evidence to support them and there's typically no such thing in papers like the AlphaCode one.
This is the opposite of how burden of proof works. You are the one making a claim with certainty based off of guesswork, not me. And the paper does actually have a section on this, and finds copying isn't pervasive outside of utility snippets and functions, which also occur in human solutions. It's a weak objection anyway; just the task of translating english prose into the general algorithm you want to apply is already an impressive feat.
>> “ignore how good the advertised model is, because a different, much smaller version without all the techniques merely does pretty well on an extremely tough problem set.”
Where is that quote from? Why are you quoting it? Am I supposed to reply to it?
>> You are the one making a claim with certainty based off of guesswork, not me.
I'm not talking about you. I'm talking about the paper, the team behind it and work on large language models trained on online data, in general.
The paper indeed makes a vague claim of "guarding" against data leakage by a "strict temporal split" which means they ensured that the validation and test data used for fine-tuning was not available to the model. That of course doesn't mean much. What matters is if the data on which the model was trained included programs like the ones the model was asked to generate. Clearly, it did, otherwise the model would not have been able to generate any programs that could be used as solutions to the test problems.
And I think you have the rule on the burden of proof a bit wrong. I don't have to prove anything that is already well-known. For instance, if I said that gravity makes things fall down, I wouldn't bear any burden of proof. Accordingly, there is no doubt that neural nets can only represent what is in their training set. That's how neural nets work: they model their training data. They can't model data that is not in their training set. It wouldn't even be fair to expect a neural net to learn to represent data that it wasn't trained on, and just to be clear, I'm not saying that there should be such an expectation, or that it is even desirable. This modelling ability of neural nets is useful. In fact, this is the real strength of neural nets, they are extremely good at modelling. I mean, duh! Why are we even discussing this?
But this is something that the deep learning community is trying to deny, to itself primarily, it seems. Which is exceedingly strange. Work like the one linked above prefers to make bizarre claims about reasoning abilty that we are, presumably, expected to believe arises magickally just by training on lots of data, as if there's a threshold of volume above which data is miraculously transsubstantiated into an element with quite different propeties, from which reasoning or "critical thinking" (dear god) emerges even in the complete absence of anything remotely like a reasoning mechanism. This is nonsense. Why not admit that in order for a large language model to be able to generate code, it must see code "like" the one it's asked to generate? Then we can talk about what "like" means, which is the interesting question. All this attempt to pussyfoot around what those systems are really doing is so counter-productive.
Again, this is not about anything you specifically say, but a criticism of deep learning reserach in general. I don't presume you're a deep learning researcher.
>> It's a weak objection anyway; just the task of translating english prose into the general algorithm you want to apply is already an impressive feat.
Not as impressive as you think. The problem descriptions used on CodeForces etc are not arbitrary English prose. They don't ask participants to write a poem about Spring (and I don't mean the old Java library). So it's not "prose" but very precise specifications. They could be represented as a Controlled Natural Language. So something much easier to model than arbitrary English.
And, yet again, the performance of the model is crap.
I was paraphrasing your argument. AFAICT, it remains faithful to what you are saying.
> Accordingly, there is no doubt that neural nets can only represent what is in their training set.
This is not true. If it were true that there was no doubt, the paper wouldn't have challenged it and claimed it false. If you assume your conclusion well obviously your conclusion follows trivially.
> And, yet again, the performance of the model is crap.
Btw, APPS is not much of a benchmark. It evaluates code generation according to how close it resembles code written by humans. That's standard fare for text generation benchmarks, like evaluating machine translation on some arbitrary set of human translations. There are no good benchmarks for text generation (and there are no good metrics either).
But the comparison against the average competitor on Codeforces is even more meaningless because we have no way to know what is the true coding ability of that average competitor.