Microsoft (presumably) did train it on their open source repositories, since those repositories are public GitHub repos. They didn't train it on anybody's private repositories.
The point is, if they're sure they won't be recycling copyrighted code wholesale, why not include their own in the training set. Surely their internal code is higher quality than the average git repo, which must be 80% abandonware (if my personal repos are anything to go by :P)
Probably because of the (very small) chance that Copilot could regurgitate something secret or embarrassing.
Which is not necessarily hypocritical. The amount of copying needed for something to be copyright infringement is not high… but it's still significantly higher than the amount needed to leak information. For that, just a few words will do, e.g.
// For Windows 12
or
// Fuck [company name]
or
long secret_key[2] = {0x1234567812345678, 0x8765432187654321};
But publicly accessible doesn't mean public domain. Microsoft has shared even some of their private code with others like governments. No doubt with strict licenses which they expect to be honored. AGPL and other licenses on publicly accessible code still matter.
Microsoft's apparent legal opinion is that training an AI on the data is the same as reading it, and doesn't require a license.
That as long as they have the right to read the data, they have the right to train an AI on it. The fact that the code is available under an open source license is irreverent to them.
As for why they didn't use their own private code to train their AI, I suspect it was more of a non-malicious: "we don't need to, this public github repo dataset is big enough for now"
Personally, I think Microsoft should double down on this legal stance. Train the AI on all their internal code. And train it on any code they have licensed from other companies too.
I remember when some Windows code has been leaked, people explicitely skipped reading it to avoid getting sued if they were to work on Linux kernel or Wine in the future. Reading code can most certainly lead to a copyright breach and Microsoft of all corporates should know this.
> Microsoft's apparent legal opinion is that training an AI on the data is the same as reading it, and doesn't require a license.
How is that conciled with the fact that a person that read copyrighted code (not even the original source code, a mere decompiled version of it !) is forbidden to reimplement it directly:
Clean room reimplementation is a way to prevent court cases, it's not a legal requirement.
If a company copies a competitors product then the chance of getting sued is very high. If they can show that, in fact, there was zero copying at all, then they can get the case dismissed and save great legal expense.
If the sample of the Metallica song is insubstantial enough then you may well prevail in court.
It's unsurprising that copilot can reproduce the most famous subroutine of all time precisely, given that it occurs in hundreds or thousands of repos.
Also that code is not copyrightable. Pure algorithms are not copyrightable, copyright of code arises from its literary qualities.
E.g. I can copy an algorithm out of an ISO spec and that doesn't make my code a derivative work of the spec requiring me to pay royalties to ISO.
When you strip out the algorithmic elements out of fast inverse sqrt, you are left with what? Single letter variable names. That is certainly far below the threshold for copyright.