It's not the license of the model, it's the license of the output. As it stands,...

rwmj · on Nov 10, 2022

You could also imagine different Copilot models, eg Copilot-GPL, Copilot-MIT etc. Each would be trained only on GPL or MIT code from github. Then which model gets used depends on the license of the file being written at the time.

avian · on Nov 10, 2022

Attribution is an important part of most licenses, so not only would you need to track the license, but also the authors.

az226 · on Nov 10, 2022

But Copilot doesn't take your code at best it has learned from a fraction of a fraction of your code and synthesized it with tens or thousands of like examples and the output may look similar to your code because it's trying to achieve the same thing. It's not like Copilot takes your entire repo and clones it and says "we washed the onerous license requirements away for ya".

david_allison · on Nov 10, 2022

There's a minimum level of complexity and creativity which constitutes a copyright violation. It's up to a legal professional to draw the line, but I believe it can be a single line of code (`i = 0x5f3759df - ( i >> 1 );`)

If I saw 100 LOC which was very similar to something which I wrote, AND contained a log statement copied verbatim, it's very easy to imply that the entire piece of code is a derivative work.

Let's say I write FizzBuzz:

    // Copyright (c) 2022 David Allison. All rights reserved.

    for num in range(100):
        if num % 3 == 0 and num % 5 == 0:
            print("DA: fizzbuzz")
        elif num % 3 == 0:
            print("DA: fizz")
        elif num % 5 == 0:
            print("DA: buzz")
        else:
            print(num)

If I found the modified FizzBuzz algorithm in the wild with one line containing the "DA" prefix, it may have been learned from a fraction of a fraction of my code but it still contains my 'unique' creativity, is that a copyright violation?

Aside: Due to some uniquely named code I've contributed to, I strongly suspect that Copilot would output my GitHub username. I don't really want to open Pandora's box here, but I'd be curious.

visarga · on Nov 10, 2022

Replicating copyrighted code from the training set only happens 1% of the time, it's the exception not the rule. And when it happens it's usually because the same text appears multiple times in the training set. So it will memorize boilerplate and popular code snippets, not unique stuff. Even a replicated piece of code 100 lines long is no big deal in my opinion, unless it contains some kind of unique thing never seen before, like an optimized matrix multiplication function. Certainly not FizzBuzz.

On the practical side, it is actually easy to filter out sequences of words that are too similar to the training set from the output of the model. You just generate another snippet until it is "original" enough.

david_allison · on Nov 10, 2022

I have ~400KLOC changed on GitHub. 1% of the time happens multiple times a day given scale.

Pragmatically, people are already knowingly committing commercially viable copyright violations of my work. I'd rather it wasn't encouraged further by a US-based 'big tech', especially if the people using my code aren't aware that they're doing anything questionable.

Some months, I earn over 100x less from OSS than I would in industry. I don't want people taking advantage any more than I'm comfortable with, especially for commercial purposes.

visarga · on Nov 10, 2022

It's the wild west phase, after it settles down there will probably be ways to signal you don't want to allow training on your code. But I think it's just like taking your grain of sand from the beach so nobody else can have it. The beach is going to be just the same.

ISL · on Nov 10, 2022

As part of the wild-west phase, it is possible that the inclusion of a single identifiable AGPL project in the training set leads to the licensing of Copilot as AGPL. Such an outcome might lead hastily to the future you imagine.

ranguna · on Nov 10, 2022

Copilot used (for training) copyrighted code without respecting the license and can generate pieces of copyrighted code verbatim without respecting the original license as well.

I pay for copilot and this is very much the truth, but let's see what the court rules out.

simion314 · on Nov 10, 2022

So why in your opinion Microsoft did not had the courage to also train copilot on proprietary code or on their own proprietary code? Because from my perspective I conclude that MS knows that things are not as simple so they did not want to "upset" some companies while they can afford to screw over the open source people.

Btw I would have been behind MS if they have done one of this 2

1 use all code they have access , including MS code and including private code in GitHub because that would show they actually belive that the AI works as advertised

2 make the model open , let people use it locally, improve it, test it for copyright issues, do whatever they want