Hacker News new | past | comments | ask | show | jobs | submit login

I have a feeling you did not read the FAQ of the licenses. I don't blame you, but they explain my position.

Here's the relevant quote:

> GitHub is arguing that using FOSS code in Copilot is fair use because using data for training a machine learning algorithm has been labelled as fair use. [1]

> However, even though the training is supposedly fair use, that doesn’t mean that the distribution of the output of such algorithms is fair use.

My licenses say, basically, "Sure, training is fair use, but distributing the output is not."

The licenses specifically say that the copyright applies to any output of any algorithm that uses the source code code as all or part of its input.

Now, I have not gotten a lawyer to look at my licenses yet (it's in the works), so don't use them yourself. But because everyone keeps saying that training is fair use, I'm fairly confident that only training is fair use.

Of course, it might not be, but that would take more court cases and more precedent. I wanted to poison the well now [2] to make companies nervous about using a model that was partially trained with code licensed under my licenses.

[1]: https://valohai.com/blog/copyright-laws-and-machine-learning...

[2]: https://gavinhoward.com/2021/07/poisoning-github-copilot-and...




> My licenses say, basically, "Sure, training is fair use, but distributing the output is not."

Licenses basically by definition cannot say what is and isn't fair use...


> Licenses basically by definition cannot say what is and isn't fair use...

Yes. However, my licenses only say what people already say. Then the licenses go further and say, "But anything else is not allowed."

Everyone else says training is fair use. My licenses agree. But they make it clear that I don't believe that anything else is fair use.

Yes, these licenses must be tested in court. Except that they poison the well now.


It's mildly interesting that you've decided to express your personal opinion about what is or is not fair use within in your license text, but the fact is that if a use of the work is deemed to be fair use under the law then the terms of the license you're offering are completely irrelevant. Your permission is not required to make fair use of the work, so no one needs to agree to your license.


> It's mildly interesting that you've decided to express your personal opinion about what is or is not fair use within in your license text, but the fact is that if a use of the work is deemed to be fair use under the law then the terms of the license you're offering are completely irrelevant. Your permission is not required to make fair use of the work, so no one needs to agree to your license.

You do not seem to get it. Yes, I understand that if fair use applies, my licenses don't matter. I get that. I promise I do get that.

The purpose of these licenses is to sow doubt that fair use applies to distributing the output of ML models.

Lawyers are usually a cautious lot. If a legal question has not been answered, they usually want to stay away from any possibility of legal risk regarding that question.

The licenses create a question: does fair use apply to the output of ML algorithms? With that question not answered, lawyers and their companies might elect to stay away from ML models trained with my code, and ML companies might stay away from training ML models on my code in the first place.

That is what I mean by "poisoning the well." The poison is doubt about the legality of distributing the output of ML models, and it is meant to put a damper on enthusiasm for code being used to train ML models, especially for my code.


It still amounts to an opinion statement in the license text which has no real bearing on the license. I was trying to be charitable, but your clarification makes it seem even more like you're just trying to spread unsubstantiated FUD in hopes of scaring people away from using your code as input to ML models even when that would be fair use. Which seems to me to be vaguely akin to fraud. Moreover, the license seems like a poor choice of venue to express your opinion since those you're most interested in dissuading (e.g. people using lots of different projects as input to their ML models, without investigating the details of each one) are also the least likely to bother reading it. In terms of raising awareness of how copyright might apply to the output of ML models you'd do better to post your opinions on a blog somewhere and leave the license text for things that can actually be affected by a license.


[flagged]


> The relevant part of the license is the definition of the covered work, which basically says that the output of any algorithm that uses copyrighted code as input is under the same license.

In other words, you are granting unnecessary additional permission to use the output of an ML algorithm trained on the copyrighted code under the terms of the same license, when your permission was not required if the use of that output was already covered by fair use. If the use is not considered fair use—if the output would be deemed a derivative work under copyright law—then this license is beneficial to the developers of ML systems like Copilot since it explicitly grants them permission to use the output under the same terms. In the best case it's fair use and your license is irrelevant, and in the worst case your license grants them a path to proceed anyway, with a few extra rules to follow. Under no circumstances can anything you write in the licence expand the reach of the copyright you have in the original code, no matter how "wonderfully broad and general" the license may be.

Reading through the licenses and FAQs on your site did not improve my opinion of them in the slightest. Especially the part where you attempted to equate what Copilot does with trivial processing of the source code, e.g. with an editor, to argue that classifying the use of any output from an ML algorithm trained on copyrighted inputs as fair use is equivalent to eliminating copyright on software. The reality is of course much more nuanced. Certainly if the ML algorithm merely reproduces a portion of its input from some identifiable source, including non-trivial creative elements to which copyright could reasonably be applied, then the fact that the process involved an ML algorithm does not preclude a claim of infringement, and it would be reasonable to apply something like a plagiarism checker to the ML output to protect the user from accidental copying. However, the purpose of an ML system like Copilot is synthesis, extracting common (and thus non-creative) elements from many sources and applying them in a new context, the same as any human programmer studying a variety of existing codebases and subsequently writing their own code. The reproduction of these common elements in the ML output can be fair use without impacting the copyrights on the original inputs.

The real question here is why I'm wasting my time attempting a good-faith debate with someone who thinks that "spreading FUD is not necessarily a bad thing…".


The real question I have is how you think an algorithm doing synthesis is creative.


I never said that the synthesis process was creative; rather the opposite. The point of a tool like Copilot is not to come up with new, creative solutions, but rather to distill many different inputs down to their common elements ("boilerplate") to assist with the boring, repetitive, non-creative aspects of programming. When the tool is working as intended the output will bear a resemblance to many different inputs within the same problem domain and will not be identifiable as a copy of any particular source. Of course there have been certain notable exceptions where the training was over-fitted and a particularly unique prompt resulted in the ML system regurgitating an identifiable input text mostly unchanged, which is why I think it would be a good idea to add an anti-plagiarism filter on the results to prevent such accidental copying, particularly in cases where it might be less obvious to the user.


> When the tool is working as intended the output will bear a resemblance to many different inputs within the same problem domain and will not be identifiable as a copy of any particular source.

You would have a great argument, and I would actually not be so mad at GitHub, if they had only trained Copilot on such boilerplate/non-copyrightable code. However, they trained it on all of the code in all of the public repositories. That's why we see:

> ...there have been certain notable exceptions where the training was over-fitted and a particularly unique prompt resulted in the ML system regurgitating an identifiable input text mostly unchanged...

The fact that this happens is a sign that GitHub did not train it only on boilerplate; they trained it on truly creative stuff. And they expect people to believe that the output is not under copyright. The gall blows my mind.

But even if it were to take the most repeated pieces of code and only synthesize stuff from that. Would that solve the problem?

Not really because some of the best (i.e., most creative) code is forked the most, meaning that Copilot saw some of the best code over and over.

Here's an experiment you can do (if you have access to Copilot): Start a new C source file, and in a comment at the top, say something like:

    // A Robin Hood open addressed map.
    map_item(
And see what it gives you. I would bet that it will suggest something close to [1], which is my code. (Ignore the license header; the code is actually under the Yzena Network License [2].) Notice that there is no "ymap_item()" function in my code, so this would not be triggering Copilot's overfitting.

The reason I think so is that Copilot doesn't just suggest one line at a time, which if it did, an argument could be made for boilerplate. Instead, it suggests whole sections of code. A good percentage of the time, even maybe a majority of the time, that is not boilerplate.

[1]: https://git.yzena.com/Yzena/Yc/src/branch/master/src/map/map...

[2]: https://yzena.com/yzena-network-license/


Licenses can't dictate what is not allowed unless the user wants to use it in a way compliant with the rest of the license. If you decide to not follow the license at all, then it's effectively like any other copyright where you can use it without the owner's permission under fair use.

That doesn't usually mean you can use code though, see: https://news.ycombinator.com/item?id=27726343




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: