It's not the license of the model, it's the license of the output.
As it stands, Copilot is a black-box which strips copyright from a piece of code.
I'd be fine if it were a level playing field and GitHub also trained it on private repositories - that's a signal that they don't care about copyright at all.
I'd be fine as a developer who releases GPL'ed code if the output was licensed as GPL - obviously no license violation.
I'd be fine (within a reasonable scale) if a developer contacted me and asked to use my code under MIT.
I'm not fine that Copilot allows people to take my code, 'change the variable names' and remove the license. Especially because I have no visibility of the fact that this has occurred.
You could also imagine different Copilot models, eg Copilot-GPL, Copilot-MIT etc. Each would be trained only on GPL or MIT code from github. Then which model gets used depends on the license of the file being written at the time.
But Copilot doesn't take your code at best it has learned from a fraction of a fraction of your code and synthesized it with tens or thousands of like examples and the output may look similar to your code because it's trying to achieve the same thing. It's not like Copilot takes your entire repo and clones it and says "we washed the onerous license requirements away for ya".
There's a minimum level of complexity and creativity which constitutes a copyright violation. It's up to a legal professional to draw the line, but I believe it can be a single line of code (`i = 0x5f3759df - ( i >> 1 );`)
If I saw 100 LOC which was very similar to something which I wrote, AND contained a log statement copied verbatim, it's very easy to imply that the entire piece of code is a derivative work.
Let's say I write FizzBuzz:
// Copyright (c) 2022 David Allison. All rights reserved.
for num in range(100):
if num % 3 == 0 and num % 5 == 0:
print("DA: fizzbuzz")
elif num % 3 == 0:
print("DA: fizz")
elif num % 5 == 0:
print("DA: buzz")
else:
print(num)
If I found the modified FizzBuzz algorithm in the wild with one line containing the "DA" prefix, it may have been learned from a fraction of a fraction of my code but it still contains my 'unique' creativity, is that a copyright violation?
Aside: Due to some uniquely named code I've contributed to, I strongly suspect that Copilot would output my GitHub username. I don't really want to open Pandora's box here, but I'd be curious.
Replicating copyrighted code from the training set only happens 1% of the time, it's the exception not the rule. And when it happens it's usually because the same text appears multiple times in the training set. So it will memorize boilerplate and popular code snippets, not unique stuff. Even a replicated piece of code 100 lines long is no big deal in my opinion, unless it contains some kind of unique thing never seen before, like an optimized matrix multiplication function. Certainly not FizzBuzz.
On the practical side, it is actually easy to filter out sequences of words that are too similar to the training set from the output of the model. You just generate another snippet until it is "original" enough.
I have ~400KLOC changed on GitHub. 1% of the time happens multiple times a day given scale.
Pragmatically, people are already knowingly committing commercially viable copyright violations of my work. I'd rather it wasn't encouraged further by a US-based 'big tech', especially if the people using my code aren't aware that they're doing anything questionable.
Some months, I earn over 100x less from OSS than I would in industry. I don't want people taking advantage any more than I'm comfortable with, especially for commercial purposes.
It's the wild west phase, after it settles down there will probably be ways to signal you don't want to allow training on your code. But I think it's just like taking your grain of sand from the beach so nobody else can have it. The beach is going to be just the same.
As part of the wild-west phase, it is possible that the inclusion of a single identifiable AGPL project in the training set leads to the licensing of Copilot as AGPL. Such an outcome might lead hastily to the future you imagine.
Copilot used (for training) copyrighted code without respecting the license and can generate pieces of copyrighted code verbatim without respecting the original license as well.
I pay for copilot and this is very much the truth, but let's see what the court rules out.
So why in your opinion Microsoft did not had the courage to also train copilot on proprietary code or on their own proprietary code? Because from my perspective I conclude that MS knows that things are not as simple so they did not want to "upset" some companies while they can afford to screw over the open source people.
Btw I would have been behind MS if they have done one of this 2
1 use all code they have access , including MS code and including private code in GitHub because that would show they actually belive that the AI works as advertised
2 make the model open , let people use it locally, improve it, test it for copyright issues, do whatever they want
As it stands, Copilot is a black-box which strips copyright from a piece of code.
I'd be fine if it were a level playing field and GitHub also trained it on private repositories - that's a signal that they don't care about copyright at all.
I'd be fine as a developer who releases GPL'ed code if the output was licensed as GPL - obviously no license violation.
I'd be fine (within a reasonable scale) if a developer contacted me and asked to use my code under MIT.
I'm not fine that Copilot allows people to take my code, 'change the variable names' and remove the license. Especially because I have no visibility of the fact that this has occurred.