Hacker News

nybble41 · on July 30, 2021

> The relevant part of the license is the definition of the covered work, which basically says that the output of any algorithm that uses copyrighted code as input is under the same license.

In other words, you are granting unnecessary additional permission to use the output of an ML algorithm trained on the copyrighted code under the terms of the same license, when your permission was not required if the use of that output was already covered by fair use. If the use is not considered fair use—if the output would be deemed a derivative work under copyright law—then this license is beneficial to the developers of ML systems like Copilot since it explicitly grants them permission to use the output under the same terms. In the best case it's fair use and your license is irrelevant, and in the worst case your license grants them a path to proceed anyway, with a few extra rules to follow. Under no circumstances can anything you write in the licence expand the reach of the copyright you have in the original code, no matter how "wonderfully broad and general" the license may be.

Reading through the licenses and FAQs on your site did not improve my opinion of them in the slightest. Especially the part where you attempted to equate what Copilot does with trivial processing of the source code, e.g. with an editor, to argue that classifying the use of any output from an ML algorithm trained on copyrighted inputs as fair use is equivalent to eliminating copyright on software. The reality is of course much more nuanced. Certainly if the ML algorithm merely reproduces a portion of its input from some identifiable source, including non-trivial creative elements to which copyright could reasonably be applied, then the fact that the process involved an ML algorithm does not preclude a claim of infringement, and it would be reasonable to apply something like a plagiarism checker to the ML output to protect the user from accidental copying. However, the purpose of an ML system like Copilot is synthesis, extracting common (and thus non-creative) elements from many sources and applying them in a new context, the same as any human programmer studying a variety of existing codebases and subsequently writing their own code. The reproduction of these common elements in the ML output can be fair use without impacting the copyrights on the original inputs.

The real question here is why I'm wasting my time attempting a good-faith debate with someone who thinks that "spreading FUD is not necessarily a bad thing…".

gavinhoward · on July 30, 2021

The real question I have is how you think an algorithm doing synthesis is creative.

nybble41 · on July 30, 2021

I never said that the synthesis process was creative; rather the opposite. The point of a tool like Copilot is not to come up with new, creative solutions, but rather to distill many different inputs down to their common elements ("boilerplate") to assist with the boring, repetitive, non-creative aspects of programming. When the tool is working as intended the output will bear a resemblance to many different inputs within the same problem domain and will not be identifiable as a copy of any particular source. Of course there have been certain notable exceptions where the training was over-fitted and a particularly unique prompt resulted in the ML system regurgitating an identifiable input text mostly unchanged, which is why I think it would be a good idea to add an anti-plagiarism filter on the results to prevent such accidental copying, particularly in cases where it might be less obvious to the user.

gavinhoward · on July 31, 2021

> When the tool is working as intended the output will bear a resemblance to many different inputs within the same problem domain and will not be identifiable as a copy of any particular source.

You would have a great argument, and I would actually not be so mad at GitHub, if they had only trained Copilot on such boilerplate/non-copyrightable code. However, they trained it on all of the code in all of the public repositories. That's why we see:

> ...there have been certain notable exceptions where the training was over-fitted and a particularly unique prompt resulted in the ML system regurgitating an identifiable input text mostly unchanged...

The fact that this happens is a sign that GitHub did not train it only on boilerplate; they trained it on truly creative stuff. And they expect people to believe that the output is not under copyright. The gall blows my mind.

But even if it were to take the most repeated pieces of code and only synthesize stuff from that. Would that solve the problem?

Not really because some of the best (i.e., most creative) code is forked the most, meaning that Copilot saw some of the best code over and over.

Here's an experiment you can do (if you have access to Copilot): Start a new C source file, and in a comment at the top, say something like:

    // A Robin Hood open addressed map.
    map_item(

And see what it gives you. I would bet that it will suggest something close to [1], which is my code. (Ignore the license header; the code is actually under the Yzena Network License [2].) Notice that there is no "ymap_item()" function in my code, so this would not be triggering Copilot's overfitting.

The reason I think so is that Copilot doesn't just suggest one line at a time, which if it did, an argument could be made for boilerplate. Instead, it suggests whole sections of code. A good percentage of the time, even maybe a majority of the time, that is not boilerplate.

[1]: https://git.yzena.com/Yzena/Yc/src/branch/master/src/map/map...

[2]: https://yzena.com/yzena-network-license/