A very interesting interpretation of the github TOS. Kate Downin is saying that users of github is giving a special license to GitHub, one that bypasses the original license. However if that is true then any upload of code that users do not have 100% copyright control of is then a copyright violation since the user would not have the authority to grant github that special license. It would be similar to a user uploading a copyrighted movie to youtube, and google using that as a license to use the movie in an advertisement.
I wonder if a court would think that microsoft in this case has done their due diligent to verify that the license grant that they got from users are correct and in order.
e.g. 4. [..] You grant us [..] the right to [..] parse, and display Your Content [..] as necessary to provide the Service, This license includes [...] show it to [...] other users; parse it into a search index or otherwise analyze it
As the Service now includes copilot, publishing anything on Github seems to give them the right to use it in copilot. Maybe even for private repos
Besides of the issue we're currently discussing, I wonder also about:
5. [..] you grant each User of GitHub a [..] license to use, display, and perform Your Content through the GitHub Service and to reproduce Your Content solely on GitHub as permitted through GitHub's functionality (for example, through forking).
So if you find GPLed content on github, you might be allowed to violate the GPL as long as it happens only on github. I don't know how bad this is in practice. Their CI presumably allows you to run code for other people without granting them the rights the GPL should give them, but that might be a violation of the Github TOS as this might be abuse of the CI servers.
This might also mean you violate the GPL when publishing someone else's GPLed code on github, as you now granted Microsoft and others rights not included in the GPL.
Clearly, IANAL, don't know how valid this reading is, but publishing anything you didn't wrote yourself might not be on a very stable legal basis.
> Clearly, IANAL, don't know how valid this reading is, but publishing anything you didn't wrote yourself might not be on a very stable legal basis.
Yes. This was one of the legal theories behind why Apple refuse to allow GPL in the Mac App Store. The TOS that apple required from developers givens Apple specific rights which the GPL do not grant, and thus any software that get uploaded must be assumed as providing the software under two separate licenses. Given that many free and open source projects has multiple authors, it is a rather large assumption that the person who uploads the software has the complete authority to provide the software under multiple conflicting licenses.
It is after all the distributor that has to do the due diligence to confirm that they are in the right to distribute.
It also falls under the aspect of "hidden surprises" which could mean that this part of the TOS wrt. this specific aspect might not be legally binding/valid. At least in the EU. Or it might.
> if that is true then any upload of code that users do not have 100% copyright control of is then a copyright violation since the user would not have the authority to grant github that special license
That doesn't sound right. Licences can allow sublicensing, and I think all the popular open-source ones do.
Sublicensing can only create additional restrictions on top of the existing conditions inside the license. All open source licenses require at minimum that distribution provides attribution and the original copyright notice. License like GPL has additional conditions.
There is also additional problems specific to sublicenses. In the United States, only exclusive licensees are assumed by statute to have a right to sublicense. The theory is that licensees of exclusive licensees are assumed to have the control/authority similar to that of the author. Nonexclusive licensees are not assumed to be granted such a monopoly by the licensor.
Kate Downing here. This is an excellent question. So, just like YouTube, GitHub would likely argue that they are protected by the DMCA and that so long as they comply with DMCA take-down requests, they are not liable for copyright infringement (direct or indirect) for third party content posted to GitHub by people other than the copyright owners. Remember that the DMCA effectively shifts that due diligence you speak of away from providers of online services and onto copyright holders themselves. Without the DMCA, many businesses that rely on user-generated content just wouldn't exist because that due diligence isn't possible at scale - it's often not even possible for individual pieces of content because the publication of any copyrighted work can be very obscure and because in the US you can hold a copyright without formally registering it.
In practice, I think the entire open source world knows that people post each other's open source code on GitHub. Even projects that have very purposefully chosen to primarily use other services or self-host their source code are well aware that their code gets mirrored on GitHub and/or included in other people's repos on GitHub. Up until now, I don't think this has been controversial and I don't think GitHub gets a lot of takedown requests for this practice. I think most developers see this as a feature, not a bug. Copilot might make people rethink whether or not they want to start sending take-down requests but that'll be a tough call for a lot of people because withholding code from GitHub to avoid its usage in Copilot also effectively means making their code less easily available to the rest of the world. It may be very disruptive to other projects that include the copyright owner's code in their own projects.
I am an Open Source developer. My code is not on GitHub and never will be.
If my code was uploaded on GitHub, I would DMCA it because of Copilot, but it wouldn't matter because the information is already in the model. So the DMCA does not help here.
The only way it would help is if I could DMCA the entire model and force them to retrain without my code. As it stands, this lawsuit is the only way for GitHub to be reined in; I don't have the resources to do so on my own.
IANAL.
Also, about high impact, suppose Copilot has 1 million users that use it on average 10 times a day, 5 days a week. You claim that less than 1% of uses of Copilot would result in copyright violation. Let's assume 0.1%. How many times would copyright violation happen per day? It would happen 10,000 times per day. For five days a week.
It would take a mere twenty weeks (less than six months) to reach a million violations.
A hypothetical question: imagine a filmmaker, who had studied a lot of obviously copyrighted movies by famous renowned directors. This means he has trained his neural network using their copyrighted licensed content. Does he breach copyright when he composes and films a scene? Are visual quotes copyright theft? Homages? Did George Lucas infringe copyright when he was borrowing compositions from "Triumph of the will"?
Just because machine learning uses the word “learning” doesn’t mean it “learns” in the same way a human mind does — that analogy is doing a lot of load bearing in your argument, and needs proving why the program’s nature of creative remixing (for lack of a better word) is the same as a human’s. Right now it seems like you’re just reusing the same word for two phenomena we don’t understand, and therefore claiming they’re equivalent.
See Marvin Minsky’s comment regarding “suitcase words”.
But effectively learning here really means the same thing: Based on the input (source code), you will adapt the synaptic weights between neurons, in a similar way for humans and for the artificial neural networks. Of course, it's not exactly the same. There are some differences in the details, and the artificial neural network is really much more simplified, and thus also less efficient at learning. But why is this relevant for the copyright question?
I can easily write an simplistic suggestion algorithm by doing the following:
Hash the content. If the content has been viewed by the user in the past, halve the numerical hash value. Then sort the list.
Doing this to a music list will create a list that is biased toward music the user has listened to before but will still seem random enough to look like intelligent suggestion that the system has learned to identify. It is just math that mimics learning.
The mathematics of artificial neural networks is math. It is only math. One can make it very complex or very simple but in the end it is just math and pointers.
With math, you can just describe everything, including the human brain.
My original argument was specifically about neural networks, that I don't really see the principal difference in how a human learns from reading code, and how Copilot has learned from reading code.
The science to understand and map the human brain has come quite far, but we have yet to make a functional mathematical model where one can input a bunch of number and then compute what a human person is thinking. People have theorized and written novel about the idea of uploading the human mind into the computer for several decades now, but we are unlikely to even reached 1% of the road to get there. Among technology philosophers it is still highly debated if we will ever get there.
Right now it is impossible to accurate describe the human brain in the form of math. What we can do is write simplified models that either describe or mimics behavior for which, if we apply abstractions, we can call predictive models. Their effectiveness are quite poor but that has never stopped people from trying to use poor predictions models to predict the future.
My statement on human brain was: In principle, you can describe it with math. This doesn't mean that we know how to do that yet.
My statement on Copilot was: Comparing learning of the human brain to learning of the artificial neural network, both are still very similar, much more similar to other (machine or other) learning methods. Sure, there are differences. But my point is: Those differences, why are they relevant for the copyright question?
Since you equate human mind with a neural network (questionable to me but OK), let's swap this around and call them both minds and see how it works out:
The mind that can acknowledge and appreciate your work in this scenario (Copilot) does literally nothing of its own free will except 1) take your code and 2) give it to me, possibly combined with someone else's code. This is the sole purpose of its entire existence and full range of its capabilities. Is this enough of a difference compared to a human mind when copyright is concerned?
It spares me from knowing that you exists, that you wrote a library that does this thing I need, that I can contribute to it, etc. In such a scenario, what is the motivation for you to make your library publicly available in the first place (other than generate revenue for Microsoft or whoever I pay for access to the network)? Does copyright have relevance to OSS now?
> and needs proving why the program’s nature of creative remixing (for lack of a better word
If I ask Stable Diffusion to create a picture of Elon Musk wielding lightnings and riding a giant blue sparrow over a desert during a storm, the result would be more creative than what could be produced by most humans. I believe that counts as a proof.
This is necessarily not the case. The ML model has only been trained on things produced by humans, therefore its derived works. We have no idea what an ML model that hadn't been trained by looking at human art would actually produce, and whether or not it is creative.
Also it's pretty clear you haven't worked with many artists from that statement.
> The ML model has only been trained on things produced by humans, therefore its derived works.
Humans are also trained only on things produced by humans. The only exception is nature, but ML model can be trained on photos of nature, too. Also, you are missing the point.
> Also it's pretty clear you haven't worked with many artists from that statement.
1. I employed quite a few artists over the past 15 years.
2. I wasn't talking about artists. I was talking about regular humans. The vast majority of them are absolutely unable to create anything resembling Elon Musk riding on a giant sparrow.
If him "composing a scene" means copy pasting clips of the movies he studied and smooth things over, then yes that would be obvious infringement.
And that is what copoilots AI mostly does.
It doesn't "understand the concepts and reproduce something alike" in the sense a human does. It might understand some concepts here and there but it also does a lot of heavy lifting my verbatim "remembering" (i.e. copy pasting) code.
This is also why some people argue that the cases for copilot and some of the image generation networks are different as some of the image generation networks get much closer to "understanding and reproducing a style". (Through potentially just by it being much easier to blend over copy-pasted snippets in images to a point its unrecognizable.)
One of the main problems GitHub has IMHO is that anyone who has studied such generative methods knows that:
1) they are prone to copy-pasting
2) you don't know what they remembered (i.e. stored copies of in a obscure human unreadable encoding, i.e. just distributing such a network can be a copyright infrigement)
3) you don't know when they copy past
4) the copy pasted code often is a bit obscured, ironically (and coincidentally) often comparable with how someone who knowingly commits copyright theft would obscure the code to avoid automated detection
Which means GitHub knowingly accepted and continued with tricking its copilote users into committing copyright infringement under the assumption that such infringement is most times obscured enough to evade automatic detection....
> Is eating meat fine? - maybe. Is eating all animals OK? - Hmm...
This argument is hardly less flawed than the one you are criticizing. And you statement that 'there is no equal sign ...' is also unconvincing, as we're not equating these two, but the process of learning, which is quite similar.
I've touched two things - that's why they were put in separate paragraphs. Let me spell it out in different words:
1. People have certain rights, duties and prohibitions. Equating the right of George Lucas to use ideas he saw with rights of a machine to do that misses the point by the same measure as asserting that MS enslaves the copilot, but in the opposite direction.
2. Scale does matter. If I'm an ordinary person then the act of eating won't ruin the ecosystem. Now imagine a construct that operates under the same principle of eating, but its jaw, stomach and speed of eating is many magnitudes larger - do we apply same limitations to both, because the principle of eating is the same?
Also, since I'm spelling things out, the fact that I'm seeing the same argument many times over, and that it is so obviously flawed, makes me think that this is a symptom of astroturfing.
> People have certain rights, duties and prohibitions. Equating the right of George Lucas to use ideas he saw with rights of a machine to do that misses the point
Hardly relevant, given that the machine has no rights, so no one is equating those with anything. The point is that the machine is doing automated learning on behalf of the developers who are training it, so what should be decided is whether those very human people have a right to train their model in that way.
The op hypothetical assertion was that there is no difference between person and machine being 'inspired' by some creative work, so if we don't pursue George Lucas then why would we pursue copilot?
And the thing is that the machine does not benefit from the same rights as a person, so we can't absolve MS from responsibility because "it does a similar thing to what people do".
So, to add one more point to spell out: the context matters! ;)
I'm not going to have an epistemological debate on what is knowledge or skill. Humans are more lossy and able to bring in much more varied knowledge and experience in ways a computer cannot.
Ai/ml is not artificial general intelligence. It's a mathematical model.
I don't think we need epistemology here, and can instead keep it at the level of semantics. According to the definitions I'm use to, an AI/ML system is not just a mathematical model, but rather a concrete implementation of an information processing system, which when treated as a black box also applies to a brain.
If an excavator is digging a trench so large that no human could dig it by hand, does that mean that what the excavator is doing can't be called digging?
We can see that the way a human digs and the way an excavator digs are similar, except for the matter of scale. We don't know if the way humans study code is the same way that Copilot learns. Learning methods aside, humans do seem to be far more sophisticated about the ways they use code (understanding subtleties of copyright, attribution and so on) compared to Copilot.
The question is not whether a person is equal to a program.
The question is whether a person is doing the same as Copilot for this particular case, i.e. reading source code to learn.
You have not really given any argument why this is not the case. Or maybe your reference to scale? So only because Copilot has read more code than a human possibly could, that makes it different? But why exactly is reading a bit of code fine w.r.t. copyright, but reading more code suddenly violates copyright?
Note that the reason why Copilot needs more code to learn is just because the learning currently is not as efficient as for humans.
> But why exactly is reading a bit of code fine w.r.t. copyright, but reading more code suddenly violates copyright?
In all fairness, the article mentions the Fair Use doctrine. You could make the argument that reading a bit of code is allowed, but doing it at a large scale would not be covered as an exemption to copyright.
Humans are not neural networks, that's just a thesis.
Even novelists do not sit all day long in a closed room reading other people's work and then do a collage of what they've read. Otherwise no books would have been written in the first place.
Cut the AI off humans' work, let it interact with the real world and see what it produces. It will be nothing.
Once (if ever?) an AI is capable of producing an actual original work, I'm fine with other AIs stealing from the first one. Please leave humans alone.
> Cut the AI off humans' work, let it interact with the real world and see what it produces. It will be nothing.
That "experiment" could just as well be done on humans, though, cut them off of any work that any human has done before and you may get simple cave paintings, if you're lucky.
A difference is that I can't just spin up a copy of George Lucas on my GPU in seconds and request it to produce something from a prompt like "a disappointing prequel".
Your magic box is not a film maker and the inputs you are encoding with it are verbatim file content. Said content belonging to someone else.
Please study the series of events that unfolded in the music industry after folk begun incorporating recordings made by other artists in their own work and proceeded to sell the result.
Spoiler: The deeply nuanced question of feeding a mechanical recording through a series of complex physical and mathematical apparatus and whether that constituted a transformational creative act did not come up during the proceedings or final judgements!
Just on public repositories, as far as I know, however regardless of license.
There are GPL repositories which force you to open your code, which is one aspect, and there are "source available" repositories, which allows you to see the code, but forbids everything else.
There are a lot of blurry areas about this, and in my opinion, an AI learns like a human is not a solid basis for fair use.
On the other hand, if private repositories are crawled too, this would be very, very bad.
Yes, however humans making mistakes knowingly or unknowingly doesn't make what Copilot does right.
We just talked this with a couple of friends. I always cite what I got from where (it's just two occasions, but it's not zero), and always respect their licenses.
I'm worried about both ways of the permeation: GPL to closed and closed to open. Open source is a widely misunderstood concept and people (and companies) are using that misunderstanding to validate their blanket options. That's wrong on so many (legal to ethical, and everything in between) levels.
Emulator writers are afraid to read leaked console code, because any resemblance of their code to it means destruction of years (or decades) of reverse engineering and clean room development done in that domain. If code licensing is that important and crucial, why a court tested license (e.g. GPL) is so worthless? Is this fair, again in the same cross-section (legal to ethical)?
There's a lot to be discussed, and a lot of ideas to be re-learnt here. Open Source (or precisely Free / Copylefted software) doesn't mean free for all. We need to understand that.
The emulators thing can also run afoul of patents, for example, so they're not purely copyright concerns there. It's not an easy/exact comparison to something that's LGPL licensed for example.
Microsoft decided to only use public available repos. Their don't use their own private code or that of other companies private code, for quite obvious reasons.
I like the scenario: Imagine I've hired an assistant with an eidetic memory who has read loads of books. I pay them to help me write a book and they reproduce a few paragraphs from a different book into my book.
Am I violating copyright? Yes
Imagine they change the character names in those paragraphs. Am I still violating copyright? Yes
At some point you can change enough of the text to not violate copyright. The grey area involves the courts.
It feels very simple to me so I might be missing something.
> At some point you can change enough of the text to not violate copyright. The grey area involves the courts.
> It feels very simple to me so I might be missing something.
In my opinion, you are missing something subtle:
In continental Europe, there is a different law tradition - civil law (https://en.wikipedia.org/wiki/Civil_law_(legal_system) ) - that is different from the Anglo-American common law tradition. To quote from the wikipedia article:
"The civil law system is often contrasted with the common law system, which originated in medieval England, whose intellectual framework historically came from uncodified judge-made case law, and gives precedential authority to prior court decisions. [...] Conceptually, civil law proceeds from abstractions, formulates general principles, and distinguishes substantive rules from procedural rules. It holds case law secondary and subordinate to statutory law."
So if you are attached to the civil law system, you seriously want to avoid this grey area involving the courts (which is much more accepted in common law) and instead want to codify into laws what you mean by this grey area.
The beauty of the law is that it does not take such philosophical things into consideration. The only thing that matters is the text of the law and it's documented interpretation in various court cases. That's why copyright is excluded from this court case because there are a lot of documented interpretations of fair use. Which also apply here.
The simple layman's version of copyright is that copyright applies to a specific form of a thing and not about the ideas behind that thing.
So, no, George Lucas was not infringing anything. Nor is hip hop music making use of samples infringing anything. Or Andy Warhol integrating photos into his works. Nor is it illegal to paraphrase or refer other authors. And as Oracle found out by challenging it in court, trying to claim ownership over APIs to prevent third party implementations is also not going to work.
All of that falls under fair use. Fair use is what makes copyright useful. Without it you'd have to live in fear that legal copyright holders might come after you if you apply the ideas that you might have been exposed to via their copyrighted work. Fair use exists such that you can make use of information provided to you via a copyrighted work.
All those examples you give are transformative in some way or other.
It's an interesting test of open source licensing because I'm not aware of any other area of copyright where works come with an explicit "if you use this somewhere else you must credit me as the initial author" in the implied/provided license.
Comparing music, literature, etc. to code is difficult because of both this difference and the existence of software patents. The manner in which infringement happens (and the scale) is often different as well.
Actually, it might be challenging to programmers to grasp but it's a lot easier for a non technical judge. Same difference and easy comparison: Specific form and fair use. Nothing else matters. Two very simple concepts with a long history of being challenged in courts. There's nothing new here for a judge to consider. The specific form here is small blurbs of code that are suggested to end users by Github. Does that constitute a copyright violation? Answer no, because it's a small sample that falls under fair use.
It doesn't matter whether it's music, literature, or code. Fair use is fair use. And it's been challenged so often that no judge is going to make any exceptions just because we are now dealing with software.
End of story. No basis for any copyright infringement here. Not even worth trying out in a court because you'd be laughed away. The plaintiffs in this case clearly realized that and did not bother with even trying to prove otherwise.
Software patents are not part of this court case either for the obvious reason that the vast majority of copyright holders in this case don't actually hold any patents whatsoever. And if they would, it would not be Github's problem but the problem of those creating possibly infringing products without a license. Github just gives people access to (public) knowledge here. That's what a patent is: public knowledge. It's up to the user to decide if they are OK shipping products that include that. And it's their problem to do any due diligence.
Philosophical bullshitting aside (and it really is philosophical bullshitting), I just genuinely don't care if a human or a machine "think" or "learn" in the same way.
I don't want Github or any other megacorp-backed entity abusing the open source community in the way micro$oft is here, it's as simple as that. If they wish to train it on entirely proprietary Microsoft code, then by all means go nuts, but to take the work of open source projects and to hide behind the pretense of the mathematical model behind the A"I" learning something is simply ridiculous to me.
I find it quite curious that they're not doing that (training it on their own codebase). Perhaps they're afraid of their little intelligence spitting out proprietary code verbatim like it's been shown to do many times with licensed open source code.
The production of anime music videos is a fan activity where tiny clips from animated shows are pasted together, with a piece of music replacing the audio track. The result is typically 3-4 minutes long. The audio may or may not be original; regardless, the video content never is, barring some very very light editing.
These can be quite inventive works; nevertheless, no-one seriously argues that the video content does not breach the original animators' copyright.
The video content of an amv is a much better analogy for what copilot does to third parties' code than anything else I've seen in this post's discussion so far.
Hmmm. I'm interested in the GitHub ToS, which (if I understand correctly) basically says that GitHub and it's affiliates (MS) can use anything you post on GitHub to improve their service.
What if I build an AGPL licenced service, using GitHub to coordinate development. According to the ToS MS could offer a version my service because I posted the code on GitHub, and they are using it to improve their service to me. According to my AGPL licence, they would need to share their source.
So which takes precedence. The licence or the ToS?
Consider that you can post somebody else's code to GitHub, and that may be licensed AGPL (or anything else). In that case, somebody else is the copyright holder so clearly the ToS doesn't magically give GitHub any additional rights and the licence applies.
The most they could do is transfer any liability back to you for posting it in breach of some term in their ToS. But that would be absurd since posting someone else's code, licensed under a common (eg. OSI-approved) license, is an established and normal use case for GitHub. If their ToS really did ban the posting of some AGPL code, they really ought to have pointed it out, and of course it'd render GitHub useless for hosting AGPL code.
This would only apply when posting someone else's code. But of course you could always arrange that.
The ToS do give GitHub an indemnity against the consequences of that scenario - so if the actual copyright holder complains about copilot spitting out their code without proper attribution and license, they could indeed transfer to the liability to the uploader. (That scenario could apply to GPL and MIT code, too, not just AGPL.)
FWIW I find it highly unlikely that (at least in my country, Germany) a court would agree that Microsoft/GitHub could hold you liable for uploading a vendored AGPL dependency in your public GitHub repository because Copilot used your repository as part of its training corpus and someone won a lawsuit against Copilot for reproducing the AGPL code without a license.
Just because it's in the Terms of Use doesn't mean it can be upheld in court (or more specifically: in every court). If you uploaded your repository to a service advertising itself as a version control service, the service using your uploaded code to feed a commercial code generation product would likely be ruled as "surprising", which at least in Germany has been used by courts to dismiss claims of Terms of Use violations (e.g. when WhatsApp banned users for using third party apps).
Replace "AGPL" with Old Microsoft's "public source" (i.e. proprietary code published without an open source license) for a more likely scenario.
> ...so if the actual copyright holder complains about copilot spitting out their code without proper attribution and license, they could indeed transfer to the liability to the uploader.
The defence would probably claim that GitHub effectively invite users to post AGPL code (this being a pretty fundamental part of their business model), including when they don't hold the copyright, so it is implied that the ToS indemnity cannot be interpreted to include this situation. If GitHub tried to claim otherwise, they'd have to contradict themselves and courts usually find that kind of thing unacceptable.
The indemnity would stand for other cases of course, such as users posting code without permission of a license.
OP here. If you own the copyright to a work, you can license it in any way you like. You can offer it to some people under a commercial license and to other people under an open source license. Many entities practices dual (or tri or whatever) licensing. When you post things on GitHub, you are essentially dual licensing your work. You're providing it under a very broad license to GitHub and you are providing it under an OSS license (or whatever you like) to other GitHub users. Neither license takes precedence. One license applies to one group of people and the other license applies to the other group of people.
This is very similar to what happens when you sign a contributor agreement before contributing code to an open source project. When you sign the contributor agreement, you're granting a very broad license to your work to the project maintainers. They can then license your work out under any license they want. But likewise, because you are not granting them an exclusive license, you're free to put your contribution license out into the world under any license of your choosing separate and apart from the project that you contributed it to.
Technically, I think the scenario you're describing with AGPL code may well be possible and legal. But practically, I think people would stop using GitHub if they felt that doing so would lead to GitHub/Microsoft undercutting their projects, stealing their customers, or essentially stripping the project of any AGPL obligations. I think that from a business perspective, they're really gambling on the idea that developers will see Copilot as a big boon rather than a value suck. Time will tell whether their gamble has paid off.
As a follow-on, what if you're mirroring code which is under an AGPL license? Are you allowed to post it on GitHub if you can't grant those rights under the ToS due to the license of the code?
An interesting though experiment is how keen Microsoft would be to allow Copilot to be trained on the Office or Windows source code. If the output is truly free of copyright from its training materials then if not, why not?
The output isn't guaranteed to be free of copyright from its training materials. It just usually is. There have been clear demonstrations of it regurgitating code from the training set verbatim, which would of course still be covered by the original license.
Microsoft isn't going to train Copilot on Windows code for the same reason it didn't train it on private repos: the code is private and they don't want to risk leaking private code.
I imagine there would be no problem training it on e.g. the Unreal Engine code which is not open source but is available to read.
The big practical issue is that there's no warning when Copilot produces code that might violate copyright, so you're taking a bit of a risk if you use it to generate huge chunks of code as-is. I imagine they are working on a solution to that though. It's not difficult conceptually.
Private isn't a useful distinction though right? Even open source code often has a licence and conditions on how you can reuse or integrate it into your own code. Someone owns it.
If you sell a product which can regurgitate large parts of code who's licence doesn't allow it to be completely freely used and modified then this kind of outcome seems a foregone conclusion.
> Private isn't a useful distinction though right?
Yes it is. If code is public you can't necessarily use or copy the code (depending on the license) but there's nothing stopping you reading it and learning how it works and any secrets contained in it (e.g. information about future products).
There's a reason most proprietary software isn't "source available".
> If you sell a product which can regurgitate large parts of code who's licence doesn't allow it to be completely freely used and modified
Yes but they don't do that. I'm not sure why people are finding this so hard to understand.
It's likely already trained on Windows source code unless they have specifically excluded the repositories of the leaked Win 2000 code. Perhaps someone who's never going to contribute to ReactOS and Wine can verify?
Why would they do that, regardless of whether the output could be restricted via Copyright or not? Also, this case isn't about copyright, as the lawyer clearly explains.
I was more talking about the general morality issue than the specifics of the case.
Why wouldn't they? They are both large important codebases which they can do whatever they like with. If they are confident that Copilot does is akin to learning or at least something transformative then it makes perfect sense.
Probably the ToS. You've granted GitHub specifically license to use your code under the terms of the ToS, they effectively have 2 licenses. They can therefore choose under which licence they want to use your code, and will choose the most permissive one, or the one they have the best understanding of: in this case the ToS.
Other parties are not granted license under the ToS, and so will have to abide by the AGPL.
That statement is dangerous. ToS often gets thrown out in business to consumer transactions where the business has a significantly better negotiation position (legal counsel,…) It’s a consumer protection mechanism. In business to business interactions, the chance of getting a ToS thrown out is much lower since it’s assumed that the playing field is much more level.
It’s definitely more than just hosting code - GitHub offers issue/PR management, light weight project management, an online IDE for collaborative editing and CI services at least. Arguing that GitHub provides services that aim to improve developer/development team productivity is not a stretch. And arguing that ML-assisted development support is part of that definition isn’t particularly far out either.
I think copyright itself might be on its way out. What meaning does a copyright have when I can click "Variations" on anything and get 4 suggestions in 10 seconds? Imagine how good they will be by 2030.
> Copyright was originally intended to protect the creators of a work.
No, it wasn’t. Copyright was originally intended to protect the publishers of a work. It was later transformed to nominally focus on the creators, but even this was lobbied for by publishers in their own self-interest after the old law directly protecting them was allowed to lapse, and because it still had the same net effect since realizing value meant licensing to a publisher in most practical cases, so the publishers were still major beneficiaries.
And, of course, US copyrights under the Constitution do not exist for the purpose of protecting creators, instead a private benefit for creators is a mechanism but the purpose is expressly to “promote the progress of science of useful arts”.
No, under the US Constitution it is for a specified public benefit as its purpose, the private benefit is a mechanism to achieve that.
Under the Statute of Anne, it was nominally for creators (but this was lobbied for by printers after the expiration of earlier laws, and they were the prime beneficiaries in practice.)
> No, under the US Constitution it is for a specified public benefit as its purpose, the private benefit is a mechanism to achieve that.
Well, that's false. The actual US Constitution in Article I Section 8 Clause 8 says, "[The Congress shall have power] To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries."
That could, possibly one-day provide public benefits, but it doesn't have to. If public benefits happen, they are side-effects. The purpose is for authors and inventors to have exclusive rights in their writings and discoveries. They are never required to publish or advance anything for the public. This is a private protection not a public benefit for a "specific purpose".
> That could, possibly one-day provide public benefits, but it doesn't have to
The text you quote is explicit: the public benefit—promotion of science and useful arts—is the purpose. Providing benefits to creators is a mechanism for acheiving that purpose, not the purpose itself. That’s what I said before, and it remains true, and you’ve just quoted the bit of the Constitution that says it while claiming it is false.
You are assuming that promoting the progress of science and useful arts is for the public benefit, but it can easily happen in private. That is why public benefits are not primary. That is what I pointed out to you in my earlier post, but you continue to make that same flawed assumption.
The earlier laws were explicitly against printers. They were for scribes, and indirectly for the church, states, and other powers that controlled the scribes and wanted to limit the spread of dissident views.
At least according to most histories of copyright I've seen. Wikipedia seems to agree:
> Copyright was originally intended to protect the publishers of a work.
You are talking about modern US copyright law.
But copyright laws (laws around copy) predate the existence of publishers and the declaration of independence of the United States by over a 1000 years.
No, I’m talking mostly about British copyright law prior to the nominal prioritization of creators in the Statute of Anne (1710).
(Techhically, it was focussed on printers rather than publishers, but the separation of function of those is a more modern arrangement.)
You can tell the part you target isn't about modern US copyright law because I later in the same post distinguish all US copyright law under the Constitution (which includes modern US copyright law) from it.
There has never been more support for tightening and enforcing copyright than there is today. This is very unlikely to change due to megacorps like Microsoft, Disney, Apple et.al. having a massive vested interest to use it to extract maximum profits.
Copyright protection for the rich and powerful, while those who cannot afford armies of lawyers get their stuff stolen by machine learning models. Sounds credible to me.
A human re-using some code according to its license is not the same as an automated machine snorting up all of the code on the platform ignoring any licenses.
Copyright becomes especially important and valuable in these circumstances. Remember, original works is how your variation suggestion engine is trained. With remaining incentives taken away there is no more new stuff to train on, networks get trained on own output, the snake eats own tail.
And sometimes the network improves,depending on the quality or direction of the output,the client can a valuable critic even without being an expert in the field
You didn't make a point. Training generative models on their own output after filtering for quality is a well known technique for improving image quality for a single mode. This can happen automatically when generated images become popular on the internet.
"Excellent, let's see how your car goes faster when you push the gas pedal"
If you are saying that hypothetical client knows a better song/artwork/code from worse, boy are you in for disappointment...
My point was that insane amounts of curated fully original works were required to get the output of these generative tools to the "occasionally impressive" level it is at now, and those original works exist precisely thanks to copyright. To say "oh we don't need copyright now" is to saw off the branch on which this hinges.
If you are into this stuff and want to see it become better you should rather promote copyright and differentiation of fully original works.
> [Github's Terms of Service] specifically identifies “GitHub” to include all of its affiliates (like Microsoft) and users of GitHub grant GitHub the right to use their content to perform and improve the “Service.” Diligent product counsel will not be surprised to learn that “Service” is defined as any services provided by “GitHub,” i.e. including all of GitHub’s affiliates.
No, the misinterpretation of the ToS is not the most interesting part. The part that clearly shows her colors is:
"It looks a lot more like trolling if an otherwise incredibly useful and productivity-boosting technology is being stymied by people who want to receive payouts for a lack of meaningless attributions."
out of curiosity, would anybody else cease to have an issue copilot if it was an open source model?
i'm not paying for copilot right now because i'm waiting for this to shake out. but i'd be happy to pay (even their current asking price) if i knew the model was also open source and could be self hosted.
maybe this is the wrong way to ask the question, but hopefully it makes sense
It's not the license of the model, it's the license of the output.
As it stands, Copilot is a black-box which strips copyright from a piece of code.
I'd be fine if it were a level playing field and GitHub also trained it on private repositories - that's a signal that they don't care about copyright at all.
I'd be fine as a developer who releases GPL'ed code if the output was licensed as GPL - obviously no license violation.
I'd be fine (within a reasonable scale) if a developer contacted me and asked to use my code under MIT.
I'm not fine that Copilot allows people to take my code, 'change the variable names' and remove the license. Especially because I have no visibility of the fact that this has occurred.
You could also imagine different Copilot models, eg Copilot-GPL, Copilot-MIT etc. Each would be trained only on GPL or MIT code from github. Then which model gets used depends on the license of the file being written at the time.
But Copilot doesn't take your code at best it has learned from a fraction of a fraction of your code and synthesized it with tens or thousands of like examples and the output may look similar to your code because it's trying to achieve the same thing. It's not like Copilot takes your entire repo and clones it and says "we washed the onerous license requirements away for ya".
There's a minimum level of complexity and creativity which constitutes a copyright violation. It's up to a legal professional to draw the line, but I believe it can be a single line of code (`i = 0x5f3759df - ( i >> 1 );`)
If I saw 100 LOC which was very similar to something which I wrote, AND contained a log statement copied verbatim, it's very easy to imply that the entire piece of code is a derivative work.
Let's say I write FizzBuzz:
// Copyright (c) 2022 David Allison. All rights reserved.
for num in range(100):
if num % 3 == 0 and num % 5 == 0:
print("DA: fizzbuzz")
elif num % 3 == 0:
print("DA: fizz")
elif num % 5 == 0:
print("DA: buzz")
else:
print(num)
If I found the modified FizzBuzz algorithm in the wild with one line containing the "DA" prefix, it may have been learned from a fraction of a fraction of my code but it still contains my 'unique' creativity, is that a copyright violation?
Aside: Due to some uniquely named code I've contributed to, I strongly suspect that Copilot would output my GitHub username. I don't really want to open Pandora's box here, but I'd be curious.
Replicating copyrighted code from the training set only happens 1% of the time, it's the exception not the rule. And when it happens it's usually because the same text appears multiple times in the training set. So it will memorize boilerplate and popular code snippets, not unique stuff. Even a replicated piece of code 100 lines long is no big deal in my opinion, unless it contains some kind of unique thing never seen before, like an optimized matrix multiplication function. Certainly not FizzBuzz.
On the practical side, it is actually easy to filter out sequences of words that are too similar to the training set from the output of the model. You just generate another snippet until it is "original" enough.
I have ~400KLOC changed on GitHub. 1% of the time happens multiple times a day given scale.
Pragmatically, people are already knowingly committing commercially viable copyright violations of my work. I'd rather it wasn't encouraged further by a US-based 'big tech', especially if the people using my code aren't aware that they're doing anything questionable.
Some months, I earn over 100x less from OSS than I would in industry. I don't want people taking advantage any more than I'm comfortable with, especially for commercial purposes.
It's the wild west phase, after it settles down there will probably be ways to signal you don't want to allow training on your code. But I think it's just like taking your grain of sand from the beach so nobody else can have it. The beach is going to be just the same.
As part of the wild-west phase, it is possible that the inclusion of a single identifiable AGPL project in the training set leads to the licensing of Copilot as AGPL. Such an outcome might lead hastily to the future you imagine.
Copilot used (for training) copyrighted code without respecting the license and can generate pieces of copyrighted code verbatim without respecting the original license as well.
I pay for copilot and this is very much the truth, but let's see what the court rules out.
So why in your opinion Microsoft did not had the courage to also train copilot on proprietary code or on their own proprietary code? Because from my perspective I conclude that MS knows that things are not as simple so they did not want to "upset" some companies while they can afford to screw over the open source people.
Btw I would have been behind MS if they have done one of this 2
1 use all code they have access , including MS code and including private code in GitHub because that would show they actually belive that the AI works as advertised
2 make the model open , let people use it locally, improve it, test it for copyright issues, do whatever they want
If it was a true OSS project, first it would not clearly benefit a single near-monopoly by using my code (as in, that wouldn't be its purpose), and second I'm sure its contributors would be well placed to understand the issue and from the start bake in a reliable, transparent mechanism for opting out.
As is, it's EEE applied to open source-- Microsoft's ultimate play against the ethic that brought us Linux among other things. When your brainchild gets gobbled up faster than you can blink, pushed to people who never learn about your existence, and a megacorp that you are ethically opposed to profits from the process, the need for self-actualization is no longer addressed. The fundamental incentive that pushes us to publish in the open, to have other humans acknowledge you and your work and feel pride in it, is being eliminated.
I agree - it's problematic enough that licensing information gets lost in the Copilot process, but as is we basically have developers contributing their time and expertise, for free, to the development of Microsoft's new paid proprietary product. Worse still, if Copilot is as revolutionary as some people make it out to be, those same developers are inadvertently helping Microsoft build a monopoly in a new market, with all the disastrous consequences that entails.
The project could, yes. It wouldn't necessarily change the legality of using it in non-GPL projects, though. If people were only using it in license-compatible projects and it was license-compatible with GPL, I doubt anyone would have any complaints (even though in theory it could also be picking up stuff from other incompatible licenses).
Yes, I’d be one too. I have no legal opinions about this, but morally, Copilot just doesn’t hit me right. One of the purpose open source exist is for it to be, well, open. It’s so annoying seeing this tool Specifically use only open source code and then have the audacity to close source + paywall access to it.
I used to be a little more agreeable with Copilot with training money and all, but seeing Stable Diffusion is willing to open up hundreds of thousands in training, and more in engineering, and therefore create an active community dedicated to improving it everyday, I just can’t help but be so annoyed when one of the world’s biggest tech companies pulls such petty move.
> That rings a bit like the Facebook memes of yesteryear promising users that if they just copy and paste these magical sentences onto their timelines, then Facebook won’t be able to do something or other with their data or accounts.
The only legal way you can use copyrighted code is due to the license attached to it by the copyright holder.
If a license specifically prohibits copying the code for a purpose, then it is a violation of the copyright to copy the code for that purpose. You have no other legal way to do it.
These aren't magic words, they are legal obligations. Ok, well maybe legal obligations are magic words. But it is magic that works :). Otherwise things like GPL could not function.
> The only legal way you can use copyrighted code is due to the license attached to it by the copyright holder.
This is wrong on so many levels.
1. Copyright grants a limited set of rights to the copyright holder. "Use" typically doesn't fall into that set. Everyone has the right to "use" copyrighted material for any purpose that isn't some kind of copy or distribution.
2. Even when we consider uses which are actually covered by copyright, a license is not the only way to legally copy the material. Fair use exists.
3. There is no such thing as "the license attached to it". Licenses do not "attach" to copyrighted works. A license is an individual agreement between the copyright holder and each and every person who wants to use the material (within the scope of copyright rights and outside of fair use). Those agreements can be different in every instance, if the licensor and licensee have so agreed.
The only thing a LICENSE file or other similar way of indicating a license on code does is make a (binding) offer to license the work under the specified terms to all comers. Once anyone actually has a license by any means, including a separately negotiated license, then the LICENSE file no longer has anything to do with them or their use of the material. In the case of github, they have separately negotiated (by making a binding offer of their own in their ToS) a license to use the material for the provision of their service; therefore the LICENSE file has nothing to do with them (unless they want to use the offered license instead of the one they negotiated, and they haven't negotiated away the right to use the license offered in the LICENSE file).
You make some fair points, although I wasn't trying to be legally accurate.
I agree that use isn't governed by copyright, copying is. However, to "use" code is to make a copy of it (multiple times usually).
As far as attachment goes, I think the common sense meaning was clear. On GitHub, you can attach a license. I wasn't claiming that "attachment" was some feature of copyright law!
The (insightful) point is that if the copyright holder is the one who uploads something to GitHub, that person has also agreed to the ToS. That was something I hadn't considered before reading the article.
That line of argument might defang any claims I might have against Copilot, as I have personally uploaded much of my public open-source code to GitHub.
You give GitHub the right to use anything you upload to improve their service. What you're trying to say is after you create your GitHub account and agree to that, you're posting magic words(OSS License) that somehow binds GitHub which already has a more permissive license that you granted to them. It's articulated in the blog post.
Oh you certainly grant github some rights necessary to operate the service. Here is a relevant part:
"This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service"
Now, I read that to mean they can't sell my content. But apparently they can if they store it in a machine learning model to a greater or lesser accuracy!
It would be a Field of Endeavor restriction so the resulting license wouldn't be open source, and I don't think (?) Copilot is trained on proprietary code.
If you upload it to GitHub, you've already granted them a license to use it to improve their service. You aren't sharing it with GitHub under your custom BSD license, you're sharing it with GitHub users under that.
> It looks a lot more like trolling if an otherwise incredibly useful and productivity-boosting technology is being stymied by people who want to receive payouts for a lack of meaningless attributions.
This one sentence threw off my entire opinion of the article as it demonstrates the author's clear bias in favor of Copilot, not just specifically in this case but in principle.
Legal opinion on Copilot and generative AI in general hinges entirely on metaphors. If the AI is understood to behave like a human being building knowledge and drawing from it for inspiration, Copilot is just another way to write code. But we've already established legal precedent that machines can not hold copyright, which suggests that they can not be deemed to be creative, which could be used to argue that they are therefore just creating an inventory of copyright works and creating mechanical mashups.
The author's dismissal also ignores that this would not JUST result in attribution. If Copilot indexed copyleft code and were required to provide attribution when using this code, the output might also be affected and this could in turn affect the entire code base. Worse yet, Copilot may output code with conflicting licenses. The author considers only the possibility that Copilot itself might have to inherit the license (and the dismissal that it would "help noone" because it runs on a server ignores both the existence of a (presumably self-hosted) enterprise service and the existence of licenses like AGPL, which would still apply) but it seems most people's concerns are with the output instead.
I also fail to understand how the argument that it doesn't reproduce the code exactly 99% of the time is helpful. If I copy code and rename the variables and run an autoformatter on it, it's still a copy of the code. It's odd to see a lawyer use what is essentially obfuscation as a defense against copyright claims. Also 1% is an incredibly large number given how Copilot is supposed to be used and how large the potential customer base is. Given the direction GitHub is heading with "Hello GitHub" (demoed at GitHub Universe yesterday) it's not unlikely that Copilot would in some cases be used to generate hundreds, thousands or tens of thousands of lines of code in a single project.
The question isn't just whether Copilot is violating the law or not, the question is why it is or isn't because that could have wide implications outside GitHub itself. But as the author points out, sadly the lawsuit doesn't try to settle this for copyright, which might be the most impactful question.
I'm actually surprised they allowed Copilot to happen, given this section:
> This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.
One could make the argument they had no intrinsic right to use the software for Copilot except under the terms laid out under the respective softwares' licenses. This means any GPL code they copied by error is now in violation of the GPL by default. But IANAL.
In my memory, when GitHub released it, they were explicit that using data like this “is common practice in machine learning.” Though, I tried to find the quote and couldn’t, so maybe my memory is wrong and I am remembering a blog post from another organization.
edit: The exact quote was “Training machine learning models on publicly available data is considered fair use across the machine learning community” if you want to search for it.
> Frequently Asked Questions -> Training Set -> Why was GitHub Copilot trained on data from publicly available sources?
> Training machine learning models on publicly available data is now common practice across the machine learning community. The models gain insight and accuracy from the public collective intelligence. But this is a new space, and we are keen to engage in a discussion with developers on these topics and lead the industry in setting appropriate standards for training AI models.
Yes, I could claim that pirating music is the "standard" but it doesn't negate the fact that there are copyright laws that could land me with a fine or in jail. GitHub can claim whatever they want, but their policies and the laws surrounding them still stand.
You have apparently misunderstood copyright licenses as being something that attaches to a copyrighted work and now must be respected by all users of that work. But that is totally incorrect.
Licenses are individual agreements between copyright holders (or licensees who have been granted the right to re-license) and people who want to exercise one of the rights normally withheld under copyright. A LICENSE file is nothing but an offer to grant a license with specified terms to anyone who might want to use the work, without having to nag the licensor to sign an agreement. The existence of that offer doesn't have anything to do with any other agreement the licensor and a (potential) licensee might make.
In the GitHub case, GitHub has negotiated a different license with the uploader. (That negotiation happened to take the form of a ToS, which is another kind of binding offer.) The LICENSE file has nothing to do with it. It hasn't been overridden, it's just irrelevant. It doesn't add or subtract any terms from the separate and distinct license GitHub negotiated.
Or it could be that she is experienced with both software and law, and that her assessment is different than yours.
> Kate’s passion for open source began in law school, under the tutelage of Eben Moglen, long-time attorney for the Free Software Foundation, founder of the Software Freedom Law Center, and author of the GPL 3. She interned at the Electronic Frontier Foundation and helped write the first complaint against the NSA for warrantless wiretapping.
> At VMware and ServiceNow, she dedicated her time to designing, building, and testing internal compliance tools in collaboration with their respective internal tools teams. She is no stranger to writing specs, creating wireframes, and massive amounts of QA. So much so, that Kate and her husband, Steve Downing, co-founded Critterdom LLC, a software company whose Open Sorcerer product substantially cuts down the time it takes to manually review source code for licenses and create a customer-facing disclosure of that source code.
Whether or not other countries give you the right to enforce your copyright even if you haven't registered it with the government is not relevant for a class action lawsuit filed in the US.
1. Humans are not neural networks.
2. Humans are not allowed to directly copy even rather short snippets of licenced code.
3. Humans do not have the capacity to memorize the entirity GitHub.
I can't shake the feeling that a lot of the logic around ML models having more or less the same "rights" as humans comes from misleading marketing that they, in any shape or form, resemble human intelligence. AI is a buzzword applied to any kind of algorithm for an activity that people previously thought couldn't be automated.
Back when I was young, graph pathfinding algorithms where called AI. A few decades later they are a well understood commodity and I haven't seen anyone call them AI for a while. Maybe that'll happen to LLMs too, given a few years?
An argument in favour of legality of web scraping is if a human can look at websites and collect data, then why shouldn't they be allowed to do the same programatically?
This is the same but for use of open source code: if humans are allowed to use one specific (organic) neural network to read, process, and use open source code, then why shouldn't they be allowed to use some other neural network, artificial or otherwise.
This is the slippery slope argument. It's not inconsistent to allow human "webscraping" while disallowing massive machine web scraping. Most important, it's about what the owner of the website considers to be appropriate.
A neural network is closer to a database than a human brain. So this is akin to saying: I can store your personal data in my human brain (without your consent), why am I not allowed to do it in PostgreSQL?
No, scraping stems from a service not placing any limits on its access cannot complain that it was accessed.
With code, that is denoted via the license, which when supplied with the code and especially as metadata before downloading (as is the case with GitHub) is the common means with which those limits are placed.
Humans and neural networks process information very differently and it's disingenuous to imply otherwise.
The specifics don't matter so much as the general idea that if a human can do it (anything), then why can't the human make a tool that can do it from them, thus saving them the work.
For one, an organic network (for the sake of the argument I'll play along if you want to reduce a human to this) has rights, freedoms and ethical values and is not controlled by a single entity and has not specifically been instantiated to generate profit for such.
HN is so insanely frustrating, so many comments demonstrate that the user didn't read this article at all. Just immediately jumping into a "but what about this argument that I made?".
Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that."
I wonder if a court would think that microsoft in this case has done their due diligent to verify that the license grant that they got from users are correct and in order.