> The answer is obvious: sharing the prefiltering solution we used in this analy...

dtrailin · on July 3, 2021

It's not clear why it would need to be if it runs as a online service. I would say this is analogous to a search engine index. Google for example likely has lots of GPL code in it's index in some transformed form and yet there is little dispute that this is legal.

shakna · on July 3, 2021

> I would say this is analogous to a search engine index. Google for example likely has lots of GPL code in it's index in some transformed form and yet there is little dispute that this is legal.

I'd disagree with the comparison, pretty vehemently. Copilot can generate new code. It isn't just some storage mechanism.

It can create derived code, Google Search can't.

dtrailin · on July 4, 2021

Interesting, so your argument is not about parroting but specifically about the novel code Copilot generates. My sense is that it would be fine for all licenses except for AGPL as this is an online service and is not "distributed" per say. That is if you don't buy the argument that machine learning is a transformative work so it doesn't matter what the license is, which is the position of current US caselaw.

shakna · on July 4, 2021

No, my argument is that Copilot is specifically _not_ a storage mechanism, but is producing verbatim GPL'd code, meaning that as a piece of software, it is not exempt from the GPL.

8note · on July 4, 2021

I think that means that if you memorize some GPL code, all code you write needs to be GPL'd, since you contain GPL code

macksd · on July 4, 2021

If the recitation of GPL code required Copilot to be GPL licensed, then that would be incompatible with other licenses. If the copyright is held by someone else, you're simply not the one who can ultimately decide the terms under which it is redistributed.

caconym_ · on July 4, 2021

> If Copilot is reciting pieces of GPL code (which we know it can), then not only does it need to point out where it has grabbed that code from, Copilot itself is (probably) required to be GPL-licensed.

I'm no GPL expert, but if you could trivially swap in a model trained on a different corpus, I think that's enough separation between the Copilot code and the model data that GPLing the former is not necessary. IIRC there is an "at arm's length" criterion that typically applies in cases like this.

taneq · on July 4, 2021

Zip can ‘recite’ GPL code by unzipping a source code archive. The fact that you can swap in a different zip file ‘model’ that was ‘trained’ on different data doesn’t mean the first zip file isn’t GPL’d.

caconym_ · on July 4, 2021

If a non-GPL'd zip implementation is used to unzip an archive of GPL'd code, does that mean the zip implementation is in violation of the GPL?

That's the point of my comment that you replied to, and it is a response to a specific claim (which I quoted) in its parent. I have not and will not make the (nonsensical) claim that you seem to be suggesting I've made--that the ability to change out the model somehow negates the licenses of code used to train a different version of the model.

taneq · on July 4, 2021

Ah, I think I see the disconnect. In the post you replied to, they said “copilot itself” (ie. the copilot model plus code) should be GPL’d. I missed that your reply was about whether the code specifically needs to be GPL’d. I agree with you there, but it’s also a bit of a tangent from the original point (which was that if copilot can regurgitate GPL code then it’s necessarily a derivative work.)

jonathankoren · on July 3, 2021

> I'm afraid I don't think actually comes even close to resolving the legal implications of recitation. If Copilot is reciting pieces of GPL code (which we know it can), then not only does it need to point out where it has grabbed that code from, Copilot itself is (probably) required to be GPL-licensed.

I dont follow. If the suggested code is GPLed, it’s your decision to include it in your code or not. If you accept the GPLed code into your nonGPLed code base, you violated the GPL. As a friend of mine said years ago about situations like this, “Saying ’I was just following the algorithm,’ is not a defense.”

Now let’s is Copilot a violation of the GPL? I’m going to assume that it’s codebase is not derived from GPL code. I have nothing to prove this, but most code is not, and Microsoft, GitHub, and OpenAI are reputable organizations, so assuming good faith here seems fair.

Did Copilot train on GPLed code? Absolutely. I don’t think anyone has ever suggested otherwise.

Does processing the code count as an integration? I’d say no. It certainly isn’t part of the executable code base, even in binary form, which is what the GPL was targeting. Even if it was, it wouldn’t be a GPL violation, since the copilot binary isn’t being distributed, but it would be an AGPL violation. I don’t know how popularly the AGPL is, but let’s assuming that at at least something from some AGPLed file exists inside Copilot. Again that doesn’t matter, because the code isn’t actually being executed.

So is it “distributing” the code? Sure, but that’s not a violation. If you make a binary, you have to distribute the code, but the opposite isn’t true. Anyway, just having a piece of GPL source code in a database isn’t a violation, and never has been. You might as well bring saying that because Google’s search index can return entries of the Linux kernel, all of Google is in violation of the GPL. Not even RMS would take that extreme view.

shakna · on July 4, 2021

> It certainly isn’t part of the executable code base, even in binary form, which is what the GPL was targeting.

The GPL also targets source forms, such as what is being produced verbatim. (See Clause 1 of the GPL.)

> Again that doesn’t matter, because the code isn’t actually being executed.

That's not a requirement of the GPL or AGPL. It's irrelevant.

> So is it “distributing” the code? Sure, but that’s not a violation.

It is, without the license.

> Anyway, just having a piece of GPL source code in a database isn’t a violation, and never has been. You might as well bring saying that because Google’s search index can return entries of the Linux kernel, all of Google is in violation of the GPL. Not even RMS would take that extreme view.

Storage mechanisms are not the problem here. The GPL source code is not in some database, in this case. A search index is irrelevant, because that is just a storage mechanism.

However, Copilot generates verbatim code, and it generates novel code. That is, it both contains the plain text of the original (recitation and redistribution), generates derived code (transformation).

In both these cases it doesn't attribute, so you can say with certainty that the Copilot software contains the source code, and may create derivative works, all without attributing the license.

It is the fact it contains the source code, reciting it verbatim, that makes Copilot probably need to be GPL-licensed itself. As it is not a storage mechanism.

It is that it distributes derived works without attribution that puts the end-user's codebase at risk of violating the GPL.

jonathankoren · on July 4, 2021

>However, Copilot generates verbatim code, and it generates novel code. That is, it both contains the plain text of the original (recitation and redistribution), generates derived code (transformation).

So are you saying that because the language model was trained on a GPL code that even though it spits out novel code, that code is derived?

That seems like a pretty expansive view. I’ve read some GPL code in my life, and I’m sure it has influenced me. Does that make all my code “derived”? I wouldn’t say that. To truly be derived it needs to be a nontrivial amount, otherwise every time you type “i++;” you’re in violation. This is hard to prove.

A clearer cut case is including code it suggested when that’s verbatim. That would be a GPL violation if it’s included in someone’s codebase, but that’s not what it seems you’re arguing. You seem to be arguing that Copilot is in violation simply for suggesting the code.

This means you’re asserting that somehow storing the code in a language model is somehow different from a database, but you haven’t told me why that is.

Databases have a query execution system and a database file. They are separate pieces. The query executor can work on any database file, and swapping out the database file will give different results, even though the execution code is the same.

This is exactly the same case for language generators. You have a language model, and a piece of code that makes predictions based on the given text and the language model. Swap out the language model, you get different results.

The storage formats are different but doesn’t matter. The data and the code are separate. Given this information, why — and be specific — is a language model not like a database?

shakna · on July 5, 2021

> That seems like a pretty expansive view. I’ve read some GPL code in my life, and I’m sure it has influenced me. Does that make all my code “derived”? I wouldn’t say that. To truly be derived it needs to be a nontrivial amount, otherwise every time you type “i++;” you’re in violation. This is hard to prove.

You're not a piece of software, so the areas of copyright law that are applicable are completely different. (And yes, copyright does acknowledge a minimal amount required to be copyrightable - but that minimal amount may sometimes be argued to be a single line.)

However, you can absolutely face civil charges if you reproduce too-similar code for a competitor, after absorbing the technical architecture at another workplace.

> This is exactly the same case for language generators. You have a language model, and a piece of code that makes predictions based on the given text and the language model. Swap out the language model, you get different results.

Legally speaking, Copilot isn't advertised with multiple available language models. It isn't presented that way, so it won't be treated that way. It will be treated as a singular piece of software.

> Given this information, why — and be specific — is a language model not like a database?

In the eyes of the law, and this is very specific, the model is marketed as part of the software, and so is part of the software. The underlying design architecture is utterly irrelevant, because it is presented as a package deal of "GitHub Copilot".

jonathankoren · on July 7, 2021

> You're not a piece of software, so the areas of copyright law that are applicable are completely different. (And yes, copyright does acknowledge a minimal amount required to be copyrightable - but that minimal amount may sometimes be argued to be a single line.)

Putting aside the philosophical aspects of this statement, you proved my point. I said that the ultimate person held liable for violating a license is not a tool, but the person choosing to integrate suggested changes by the tool. But now somehow you expect me to believe that the person that built an automaton, but is not directing the automaton, and certainly doesn't have final say in whether or not to incorporate the automaton's suggestions is at legally culpable, because they're being held to a stricter standard? If that was legal standard with any tool, then literally every manufacturer of every tool would be held liable for any and all misuse. Obviously, this is not the case.

> Legally speaking, Copilot isn't advertised with multiple available language models. It isn't presented that way, so it won't be treated that way. It will be treated as a singular piece of software.

Actually speaking, you're not a lawyer, and that this is INCREDIBLY controversial statement, that doesn't really standup to much scrutiny, since there is a bright line that separating the two.

Even if Github was ruled against (and they won't be), case law is filled with examples where the injunctive relief is limited to claims presented (in this case source related to a specific a work) rather than the entire system including playback device and the recording.