The ignorance in this comment section is already giving me an aneurysm. Software licenses matter. Copyright matters. If megacorps like Microsoft can sue people into oblivion for violating their copyright terms, people can sue Microsoft into oblivion for violating theirs. I don't use MS Github, I have no skin in the game, but I hope there is at-least a $1000 award to every instance of AGPL and GPL license violation because it's unfair and illegal what they're doing.
Software freedom matters, but I wouldn't expect the typical HN type to understand, since their money is made on exploiting freely-available software, putting it into proprietary little SaaS boxes, then re-selling it.
Microsoft (presumably) did train it on their open source repositories, since those repositories are public GitHub repos. They didn't train it on anybody's private repositories.
The point is, if they're sure they won't be recycling copyrighted code wholesale, why not include their own in the training set. Surely their internal code is higher quality than the average git repo, which must be 80% abandonware (if my personal repos are anything to go by :P)
Probably because of the (very small) chance that Copilot could regurgitate something secret or embarrassing.
Which is not necessarily hypocritical. The amount of copying needed for something to be copyright infringement is not high… but it's still significantly higher than the amount needed to leak information. For that, just a few words will do, e.g.
// For Windows 12
or
// Fuck [company name]
or
long secret_key[2] = {0x1234567812345678, 0x8765432187654321};
But publicly accessible doesn't mean public domain. Microsoft has shared even some of their private code with others like governments. No doubt with strict licenses which they expect to be honored. AGPL and other licenses on publicly accessible code still matter.
Microsoft's apparent legal opinion is that training an AI on the data is the same as reading it, and doesn't require a license.
That as long as they have the right to read the data, they have the right to train an AI on it. The fact that the code is available under an open source license is irreverent to them.
As for why they didn't use their own private code to train their AI, I suspect it was more of a non-malicious: "we don't need to, this public github repo dataset is big enough for now"
Personally, I think Microsoft should double down on this legal stance. Train the AI on all their internal code. And train it on any code they have licensed from other companies too.
I remember when some Windows code has been leaked, people explicitely skipped reading it to avoid getting sued if they were to work on Linux kernel or Wine in the future. Reading code can most certainly lead to a copyright breach and Microsoft of all corporates should know this.
> Microsoft's apparent legal opinion is that training an AI on the data is the same as reading it, and doesn't require a license.
How is that conciled with the fact that a person that read copyrighted code (not even the original source code, a mere decompiled version of it !) is forbidden to reimplement it directly:
Clean room reimplementation is a way to prevent court cases, it's not a legal requirement.
If a company copies a competitors product then the chance of getting sued is very high. If they can show that, in fact, there was zero copying at all, then they can get the case dismissed and save great legal expense.
If the sample of the Metallica song is insubstantial enough then you may well prevail in court.
It's unsurprising that copilot can reproduce the most famous subroutine of all time precisely, given that it occurs in hundreds or thousands of repos.
Also that code is not copyrightable. Pure algorithms are not copyrightable, copyright of code arises from its literary qualities.
E.g. I can copy an algorithm out of an ISO spec and that doesn't make my code a derivative work of the spec requiring me to pay royalties to ISO.
When you strip out the algorithmic elements out of fast inverse sqrt, you are left with what? Single letter variable names. That is certainly far below the threshold for copyright.
Software licenses have barely been tested in court, let alone how they apply to code injected and combined with other code via machine learning. You're extremely overconfident about how this will actually play out.
For one, just because your code is covered by the GPL, it doesn't mean every single line in isolation is copyrightable. It has to demonstrate creativity. That's why you don't have to worry about writing for (int i = 0; i < idx; i++) {.
You're right that code has to demonstrate creativity for copyright. But that also means that an algorithm, even a transformative algorithm, cannot change copyright because an algorithm is not creative, by definition.
This means that the output of any algorithm on copyrighted code is still under the original copyright. I mean, we still apply the copyright of the original to the output of compilers, even though compilers can be transformative with inlining and link-time optimization, to the point that it mixes disparate code in the same way Copilot does.
In fact, I wrote some software licenses [1] that codify the fact that algorithms cannot change copyright.
You sound very confident about this, whereas copyright lawyers I've read discuss this issue seem much less confident overall, but lean toward thinking this would be fair use.
What makes you so confident that this would not be ruled fair use?
(And for people not familiar - if ruled fair use, it doesn't matter what the license is because fair use is an exception to copyright itself.)
I have a feeling you did not read the FAQ of the licenses. I don't blame you, but they explain my position.
Here's the relevant quote:
> GitHub is arguing that using FOSS code in Copilot is fair use because using data for training a machine learning algorithm has been labelled as fair use. [1]
> However, even though the training is supposedly fair use, that doesn’t mean that the distribution of the output of such algorithms is fair use.
My licenses say, basically, "Sure, training is fair use, but distributing the output is not."
The licenses specifically say that the copyright applies to any output of any algorithm that uses the source code code as all or part of its input.
Now, I have not gotten a lawyer to look at my licenses yet (it's in the works), so don't use them yourself. But because everyone keeps saying that training is fair use, I'm fairly confident that only training is fair use.
Of course, it might not be, but that would take more court cases and more precedent. I wanted to poison the well now [2] to make companies nervous about using a model that was partially trained with code licensed under my licenses.
It's mildly interesting that you've decided to express your personal opinion about what is or is not fair use within in your license text, but the fact is that if a use of the work is deemed to be fair use under the law then the terms of the license you're offering are completely irrelevant. Your permission is not required to make fair use of the work, so no one needs to agree to your license.
> It's mildly interesting that you've decided to express your personal opinion about what is or is not fair use within in your license text, but the fact is that if a use of the work is deemed to be fair use under the law then the terms of the license you're offering are completely irrelevant. Your permission is not required to make fair use of the work, so no one needs to agree to your license.
You do not seem to get it. Yes, I understand that if fair use applies, my licenses don't matter. I get that. I promise I do get that.
The purpose of these licenses is to sow doubt that fair use applies to distributing the output of ML models.
Lawyers are usually a cautious lot. If a legal question has not been answered, they usually want to stay away from any possibility of legal risk regarding that question.
The licenses create a question: does fair use apply to the output of ML algorithms? With that question not answered, lawyers and their companies might elect to stay away from ML models trained with my code, and ML companies might stay away from training ML models on my code in the first place.
That is what I mean by "poisoning the well." The poison is doubt about the legality of distributing the output of ML models, and it is meant to put a damper on enthusiasm for code being used to train ML models, especially for my code.
It still amounts to an opinion statement in the license text which has no real bearing on the license. I was trying to be charitable, but your clarification makes it seem even more like you're just trying to spread unsubstantiated FUD in hopes of scaring people away from using your code as input to ML models even when that would be fair use. Which seems to me to be vaguely akin to fraud. Moreover, the license seems like a poor choice of venue to express your opinion since those you're most interested in dissuading (e.g. people using lots of different projects as input to their ML models, without investigating the details of each one) are also the least likely to bother reading it. In terms of raising awareness of how copyright might apply to the output of ML models you'd do better to post your opinions on a blog somewhere and leave the license text for things that can actually be affected by a license.
> The relevant part of the license is the definition of the covered work, which basically says that the output of any algorithm that uses copyrighted code as input is under the same license.
In other words, you are granting unnecessary additional permission to use the output of an ML algorithm trained on the copyrighted code under the terms of the same license, when your permission was not required if the use of that output was already covered by fair use. If the use is not considered fair use—if the output would be deemed a derivative work under copyright law—then this license is beneficial to the developers of ML systems like Copilot since it explicitly grants them permission to use the output under the same terms. In the best case it's fair use and your license is irrelevant, and in the worst case your license grants them a path to proceed anyway, with a few extra rules to follow. Under no circumstances can anything you write in the licence expand the reach of the copyright you have in the original code, no matter how "wonderfully broad and general" the license may be.
Reading through the licenses and FAQs on your site did not improve my opinion of them in the slightest. Especially the part where you attempted to equate what Copilot does with trivial processing of the source code, e.g. with an editor, to argue that classifying the use of any output from an ML algorithm trained on copyrighted inputs as fair use is equivalent to eliminating copyright on software. The reality is of course much more nuanced. Certainly if the ML algorithm merely reproduces a portion of its input from some identifiable source, including non-trivial creative elements to which copyright could reasonably be applied, then the fact that the process involved an ML algorithm does not preclude a claim of infringement, and it would be reasonable to apply something like a plagiarism checker to the ML output to protect the user from accidental copying. However, the purpose of an ML system like Copilot is synthesis, extracting common (and thus non-creative) elements from many sources and applying them in a new context, the same as any human programmer studying a variety of existing codebases and subsequently writing their own code. The reproduction of these common elements in the ML output can be fair use without impacting the copyrights on the original inputs.
The real question here is why I'm wasting my time attempting a good-faith debate with someone who thinks that "spreading FUD is not necessarily a bad thing…".
I never said that the synthesis process was creative; rather the opposite. The point of a tool like Copilot is not to come up with new, creative solutions, but rather to distill many different inputs down to their common elements ("boilerplate") to assist with the boring, repetitive, non-creative aspects of programming. When the tool is working as intended the output will bear a resemblance to many different inputs within the same problem domain and will not be identifiable as a copy of any particular source. Of course there have been certain notable exceptions where the training was over-fitted and a particularly unique prompt resulted in the ML system regurgitating an identifiable input text mostly unchanged, which is why I think it would be a good idea to add an anti-plagiarism filter on the results to prevent such accidental copying, particularly in cases where it might be less obvious to the user.
> When the tool is working as intended the output will bear a resemblance to many different inputs within the same problem domain and will not be identifiable as a copy of any particular source.
You would have a great argument, and I would actually not be so mad at GitHub, if they had only trained Copilot on such boilerplate/non-copyrightable code. However, they trained it on all of the code in all of the public repositories. That's why we see:
> ...there have been certain notable exceptions where the training was over-fitted and a particularly unique prompt resulted in the ML system regurgitating an identifiable input text mostly unchanged...
The fact that this happens is a sign that GitHub did not train it only on boilerplate; they trained it on truly creative stuff. And they expect people to believe that the output is not under copyright. The gall blows my mind.
But even if it were to take the most repeated pieces of code and only synthesize stuff from that. Would that solve the problem?
Not really because some of the best (i.e., most creative) code is forked the most, meaning that Copilot saw some of the best code over and over.
Here's an experiment you can do (if you have access to Copilot): Start a new C source file, and in a comment at the top, say something like:
// A Robin Hood open addressed map.
map_item(
And see what it gives you. I would bet that it will suggest something close to [1], which is my code. (Ignore the license header; the code is actually under the Yzena Network License [2].) Notice that there is no "ymap_item()" function in my code, so this would not be triggering Copilot's overfitting.
The reason I think so is that Copilot doesn't just suggest one line at a time, which if it did, an argument could be made for boilerplate. Instead, it suggests whole sections of code. A good percentage of the time, even maybe a majority of the time, that is not boilerplate.
Licenses can't dictate what is not allowed unless the user wants to use it in a way compliant with the rest of the license. If you decide to not follow the license at all, then it's effectively like any other copyright where you can use it without the owner's permission under fair use.
> it doesn't mean every single line in isolation is copyrightable
Microsoft did not just copy individual lines. They fed whole repositories into their model, ignoring the license (if it exists) even though they knew from the start that information generated by the model will be publicly available. Available usually out of context, but nonetheless - the scope of the input and intent are very clearly "everything" and "redistribution".
Just adding a filter/ML model to the output shouldn't matter. I dare you to build a Copilot clone trained from leaked internal Microsoft code and then trying to argue the output is a bit mixed up.
Copilot was trained on leaked internal Microsoft code that's on github at the moment. Anyway, everyone seems perfectly ok with training langauge models on copyright text.
Everyone is not perfectly OK with training language models on copyrighted text. It's just that evilCorps do it anyways, and there's nothing anyone can do to stop them. I can't do anything. At best, I could get a Twitter account and complain to the ether. The copyright holders can't do anything against the might evilCorps, but that doesn't make them okay with it. The fact you believe this is just sad, and exactly what evilCorps want from you.
This goes beyond fair use or satirical/comedic effect. They are training their models to output text in the style of the authors being absorbed. The style of is exactly the artistic effect that is being copyrighted.
My explanation will not be popular here on HN, but I'm never one to shy away. Especially when asked directly.
Buying a book, buying an audio CD, or buying a DVD/Blu-ray is granting the holder permission to read,listen,view that product as a single instance. You can lend them out, but that's all you're really allowed to do with them. The text,audio/video is not owned by you to do with as you please. People obviously do not like that, and argue making copies/backups is their right. Maybe that's acceptable, but we can agree posting them on torrents and sharing in any other manner from a copy made from the thing you have is not.
Saying that, training a model on someone's copyrighted text is not part of the agreement of the usage of said text whether it's a copyrighted magazine, newspaper, or book. If the people doing the training reach out to the copyright holders and get specific permission to use their copyrighted material in such a manner, then go ahead. The fact that people feel like they can do anything without the common courtesy of asking for permission is troubling to me that we've lost something as a society. There's no acknowledgment that someone has created something by their own work so that the creator can do with it as they please. A large portion of people believe that because it was created they deserve/should be able to/etc do what ever they want with someone else's creation. Including getting paid for derivitave works from the original creation.
> The fact that people feel like they can do anything without the common courtesy of asking for permission is troubling to me that we've lost something as a society.
I see this sentiment a lot in FOSS spaces but I don't really understand why. The role of judicial process _isn't_ to provide a guiding moral philosophy around social organization. Depending on the government in question that's either a role of government functions or isn't something that should be guided at all. The role of law often (and yes, not in all governments, but at least in the US) is to offer a contract between the state and the individual.
I understand the potential for abuse here in using Copilot to regurgitate licensed works without adhering to the terms of the work's license, but I'm not fluent enough in law to know if this is illegal or not. Calling out and specifically applying strict limits this practice is certainly something I'm sympathetic to, and I'm very curious to see what the courts come up with. But swayed by a moral argument I am not.
In the realm of FOSS, I feel like it's not the same comparisons. The FOSS devs created the work, released that work with the express knowing that someone else could update/modify that work. Writing/art/videos are rarely released with copyright that allows this kind of modification. That's a huge difference. There are some FOSS releases that allow people to use for personal/private use while restricting commercial use. This is closer to the books/movies type of scenario.
I mean sure, but these are both legally defined works with licenses that govern their use. The difference is in the style of license. FOSS doesn't get a special moral valence because individuals are authors and they offer their work for editing and remixing under narrow circumstances. I mean, if Jeff Bezos today were to release code he wrote by hand with GPLv3 and were to cry foul over Copilot, I doubt anyone would care (or he'd get made fun of online.) Why does FOSS get treated so differently?
> People obviously do not like that, and argue making copies/backups is their right.
In some jurisdictions this is in fact their right by law as long as they own the original (the music/film industry of course used this as an excuse to slap additional fees on every sale of any storage medium). Redistribution is different however.
> My explanation will not be popular here on HN
How is this better than ’bring on the downvotes’?
Moving on, I’ll put this to you: you claim training a ML model against copyrighted text is in violation of the ‘permission’ granted by the rights holder. However, flip this on its head for a moment – that’s basically all human brains do. Clearly, the greatest writers of our time haven’t written their works in a vacuum. Rather, that historical reading and inspiration becomes sufficiently obfuscated that we deem something adequately creative enough to be granted its own copyright.
Fundamentally, how does Copilot differ, other than perhaps being a poor implementation? Is it by not being ‘adequately creative’ enough?
Is there some future version you could envision that would be, or is it the principle you’re arguing against?
Human beings commit copyright infringement all of the time. People have been lifting riffs from music, sometimes unconsciously, forever. This is why clean room implementations are done sometimes when writing software.
Also, you're taking the machine learning metaphor literally. AI models do not "learn", they're just statistical models, they don't understand anything. There is no comparison to human learning that isn't superficial or metaphorical.
The real question is how Copilot is any different than a compiler, or lossy encoding or compression.
I don't agree to your premise. Humans can consume creative works and be influenced, this is not in question. Unless one is an impressionist, they aren't going to try to recreate exactly the works done by the artists they have been influenced by. Even if an artist does something inspired/influenced by, they have pretty much stated that. Musicians cite prior bands, as do writers, painters, etc all credit those influences.
I'm probably just a curmudgeon, but I don't understand the point of Copilot. So I'm probably not the best to opine about it. However, I am very opinionated about copyright in manner that typical flows against HN group think.
>> My explanation will not be popular here on HN How is this better than ’bring on the downvotes’?
I totally missed the non-wrapped question.
Because I don't give a crap about down-votes/up-votes. I just know from experience my views on copyright do not gel with the majority views on HN. I was just acknowledging that fact. Conversations can be had regardless of votes. My views on Napster/MP3 trading are in the same realm (and somewhat related with copyright issues). I was a co-owner of a small music site when Napster was in its heyday, and we saw direct repurcussions of people not buy music because they got it from mp3 trading. Group think here is all "things for free when I want it, how i want it", yet I still have conversations. I'm not afraid of a measly -4 points because my thoughts are contrary to group think.
At the same time, if something like this gets your goad, how is asking how something is better being better in and of itself?
> Software licenses have barely been tested in court...
OSS licenses have been litigated and upheld. Can't supply details of my own experience for confidentiality reasons but plenty of plaintiffs have prevailed in suits about violations of OSS license terms. My guess is the numbers are higher than you might think because a lot of the cases end in non-public settlements.
A confidential settlement does not mean that a licence has been “tested in court” or “litigated and upheld.” It means the parties thought the risk of losing was high enough to justify a settlement. The state of the law remains uncertain because cases are getting settled rather than litigated.
What about non-traditional-FOSS licenses? There is a lot of source-available not-OSI-compliant licensed software on GitHub like MongoDB, CockroachDB, etc., and that's clearly proprietary. If this thing is trained on that and generates what amount to snippets of that code then it's clearly violating those licenses.
Then there's private repositories. If they included those in the training data set that's even more actionable.
Personally I think this is software piracy at an absolutely unprecedented scale. Machine learning is just information transfer from the training data into weights in a model, a close relative of lossy data compression. Microsoft is now reselling all its GitHub users' code for profit.
> You're extremely overconfident about how this will actually play out.
I'd argue Microsoft too, was/is overconfident about how this would play out. I would have expected a little more caution on selecting the training data.
While they are not tested, anything other than accepting the idea kills the idea of software completely. There is lots of room to change details, but somehow copyright and the fact that the code is copied into computer memory needs to be reconciled.
I don't see how. It might kill specific ideological licensing of software code, but the idea it'd kill software as a whole is pretty unbelievable. Software is too valuable to society.
As we're seeing, there's VERY little software where the specific algorithms or ideas in the software are what's valuable. The value comes from the ability to sell a service based on the software and operate it at scale. Like you said, how much SaaS is mostly open source stuff packaged up? Android is (sort of) open source, companies pay lots of people a lot of money to contribute to the Linux kernel where they give away the code they developed with that money, etc etc.
A software license, like any license, is a permission to operate.
> it doesn't mean every single line in isolation is copyrightable
It is if you can prove reproduction apart from your own original work (fair use). Unlike patents copyright doesn’t protect uniqueness. It is only a shield from reproduction, and if reproduction is demonstrable to a court you are likely at risk.
Copyright certainly matters. It's a big deal legally and economicically all over the world.
Suppose that it's just a bad idea and shouldn't exist. Does that mean that I should release my code into the public domain? I think you could make a good case that even being totally opposed to copyright morally or pragmatically or otherwise, given that it currently is enforced in many places it's worthwhile to play along. For example, some people would prefer a world without copyright, but GPL their code, because it might prevent a greater evil.
Exactly. The copyleft side of me says you can't copyright instructions on how to bake a cake, or a fast route across a city, or a beautiful way to display colored pixels in a grid, or an efficient compression scheme for video data... because it's all intellectual, and not physical, "property". But society disagrees so a nice hack on copyright that perpetually keeps any of the above from being stolen and locked down by profit seeking psychopaths just early enough to the scene to make a buck, seems like the best interim solution.
If you abolish copyright, that will only make it easier for for-profit corporations to use FOSS. There will be nothing stopping them from using FOSS, unless people stop sharing their code altogether.
While True, if you abolish copyright then there is nothing preventing me from Installing Microsoft office on as many machines as I want never paying Microsoft a dime....
This is a common misconception: without copyright, Microsoft would still have many legal means to force you to pay for every copy of windows, from contract law to patent licenses. Without copyright there would not be free software and copyleft as we know it.
There is zero mechanism under patent law to enforce what you are referring to.
Patent law is about selling items not consuming them so they could prevent me from selling a clone of office but they cannot prevent me from installing office
as far as contract law that would be between two parties so if I obtained a copy of office somewhere and I did not have a contract with Microsoft nothing I would not be violating a contract with Microsoft copyright is the only mechanism they use to stop unauthorized distribution of their software
How so? the MIT license allows you to do everything with the code. It doesn't allow to sue the author, but that's about it. Here it is: https://opensource.org/licenses/MIT
No, it's not clear, and I guess that's up to the courts to decide.
But in my (non-lawyer) opinion - if the reproduced code is substantial/unique enough to be deemed to be covered by the license, then it's also substantial/unique enough to be subject to that license requirement.
>I don't use MS Github, I have no skin in the game
You don't have to use Github to have a skin in the game.
As long as someone has access to your open source code, no matter where it's hosted, anyone is free to upload it to Github. The open source license of your code allows that.
>I hope there is at-least a $1000 award to every instance of AGPL and GPL license violation
So much this. If a neural network is capable of regurgitating code verbatim (with comments!), it's not a stretch to say it's a derivative work of the GPL code used to feed it.
> Yes, he did wrong and gross things, but in the same breath he's brushed under the rug, so are his ideas.
His ideas being "brushed under the rug" had nothing to do with his public "cancellation" that happened in the past few years.
Stallman has always been an extreme purist that prioritized his ideological stance over anything else that matters to users. And his ideas were "brushed under the rug" just as much 5 years ago (before public revelations about his misdoings) as they are now. It might just feel like he has been increasingly "brushed under the rug" more recently because he has been becoming increasingly irrelevant and more of just a spokesperson.
Stallman was looking out for USERS, not developers. the problem was the developers thought it was them that Stallman was wanting to protect.
GPL, and Libre Software is about keeping software open from the Dev to the EndUser. Non-Copyleft "Open Source" is about keeping libraries open for Dev's to exploit into their closed source products...
There is a big difference, I support Free Software, not "Open Source"
RMS is not the dictator of FOSS. There are plenty of valid competing opinions of what "freedom" means and not all of those include legally compelling everyone to share. The MIT license, for example, is both older and more popular than GPL. There has always been a lot of people who do not agree with his opinions.
Most of the people who go nuts when you point these things out are FOSS zealots reacting to the idea that FOSS licenses should be adjusted to prevent billion dollar companies from co-opting it for profit.
Profit is fine. Building anti-competitive monopolies that don't share and that seek to own more and more of computing was an unanticipated side effect.
This isn't ML, it is a ripoff and is violating clear software licensing terms. https://news.ycombinator.com/item?id=27710287
Software freedom matters, but I wouldn't expect the typical HN type to understand, since their money is made on exploiting freely-available software, putting it into proprietary little SaaS boxes, then re-selling it.