Hacker News new | past | comments | ask | show | jobs | submit login

> it doesn't mean every single line in isolation is copyrightable

Microsoft did not just copy individual lines. They fed whole repositories into their model, ignoring the license (if it exists) even though they knew from the start that information generated by the model will be publicly available. Available usually out of context, but nonetheless - the scope of the input and intent are very clearly "everything" and "redistribution".

Just adding a filter/ML model to the output shouldn't matter. I dare you to build a Copilot clone trained from leaked internal Microsoft code and then trying to argue the output is a bit mixed up.

That is a clear violation imho.




Copilot was trained on leaked internal Microsoft code that's on github at the moment. Anyway, everyone seems perfectly ok with training langauge models on copyright text.


Everyone is not perfectly OK with training language models on copyrighted text. It's just that evilCorps do it anyways, and there's nothing anyone can do to stop them. I can't do anything. At best, I could get a Twitter account and complain to the ether. The copyright holders can't do anything against the might evilCorps, but that doesn't make them okay with it. The fact you believe this is just sad, and exactly what evilCorps want from you.

This goes beyond fair use or satirical/comedic effect. They are training their models to output text in the style of the authors being absorbed. The style of is exactly the artistic effect that is being copyrighted.


Could you explain why you think training models on copyrighted text is illegal or copyright infringement or whatever else it might be?


Training the models is fine. Applying the models, which reproduces copyrighted works without proper attribution, is where it gets sticky.


My explanation will not be popular here on HN, but I'm never one to shy away. Especially when asked directly.

Buying a book, buying an audio CD, or buying a DVD/Blu-ray is granting the holder permission to read,listen,view that product as a single instance. You can lend them out, but that's all you're really allowed to do with them. The text,audio/video is not owned by you to do with as you please. People obviously do not like that, and argue making copies/backups is their right. Maybe that's acceptable, but we can agree posting them on torrents and sharing in any other manner from a copy made from the thing you have is not.

Saying that, training a model on someone's copyrighted text is not part of the agreement of the usage of said text whether it's a copyrighted magazine, newspaper, or book. If the people doing the training reach out to the copyright holders and get specific permission to use their copyrighted material in such a manner, then go ahead. The fact that people feel like they can do anything without the common courtesy of asking for permission is troubling to me that we've lost something as a society. There's no acknowledgment that someone has created something by their own work so that the creator can do with it as they please. A large portion of people believe that because it was created they deserve/should be able to/etc do what ever they want with someone else's creation. Including getting paid for derivitave works from the original creation.


> The fact that people feel like they can do anything without the common courtesy of asking for permission is troubling to me that we've lost something as a society.

I see this sentiment a lot in FOSS spaces but I don't really understand why. The role of judicial process _isn't_ to provide a guiding moral philosophy around social organization. Depending on the government in question that's either a role of government functions or isn't something that should be guided at all. The role of law often (and yes, not in all governments, but at least in the US) is to offer a contract between the state and the individual.

I understand the potential for abuse here in using Copilot to regurgitate licensed works without adhering to the terms of the work's license, but I'm not fluent enough in law to know if this is illegal or not. Calling out and specifically applying strict limits this practice is certainly something I'm sympathetic to, and I'm very curious to see what the courts come up with. But swayed by a moral argument I am not.


In the realm of FOSS, I feel like it's not the same comparisons. The FOSS devs created the work, released that work with the express knowing that someone else could update/modify that work. Writing/art/videos are rarely released with copyright that allows this kind of modification. That's a huge difference. There are some FOSS releases that allow people to use for personal/private use while restricting commercial use. This is closer to the books/movies type of scenario.


I mean sure, but these are both legally defined works with licenses that govern their use. The difference is in the style of license. FOSS doesn't get a special moral valence because individuals are authors and they offer their work for editing and remixing under narrow circumstances. I mean, if Jeff Bezos today were to release code he wrote by hand with GPLv3 and were to cry foul over Copilot, I doubt anyone would care (or he'd get made fun of online.) Why does FOSS get treated so differently?


> People obviously do not like that, and argue making copies/backups is their right.

In some jurisdictions this is in fact their right by law as long as they own the original (the music/film industry of course used this as an excuse to slap additional fees on every sale of any storage medium). Redistribution is different however.


> My explanation will not be popular here on HN How is this better than ’bring on the downvotes’?

Moving on, I’ll put this to you: you claim training a ML model against copyrighted text is in violation of the ‘permission’ granted by the rights holder. However, flip this on its head for a moment – that’s basically all human brains do. Clearly, the greatest writers of our time haven’t written their works in a vacuum. Rather, that historical reading and inspiration becomes sufficiently obfuscated that we deem something adequately creative enough to be granted its own copyright.

Fundamentally, how does Copilot differ, other than perhaps being a poor implementation? Is it by not being ‘adequately creative’ enough? Is there some future version you could envision that would be, or is it the principle you’re arguing against?


Human beings commit copyright infringement all of the time. People have been lifting riffs from music, sometimes unconsciously, forever. This is why clean room implementations are done sometimes when writing software.

Also, you're taking the machine learning metaphor literally. AI models do not "learn", they're just statistical models, they don't understand anything. There is no comparison to human learning that isn't superficial or metaphorical.

The real question is how Copilot is any different than a compiler, or lossy encoding or compression.


I don't agree to your premise. Humans can consume creative works and be influenced, this is not in question. Unless one is an impressionist, they aren't going to try to recreate exactly the works done by the artists they have been influenced by. Even if an artist does something inspired/influenced by, they have pretty much stated that. Musicians cite prior bands, as do writers, painters, etc all credit those influences.

I'm probably just a curmudgeon, but I don't understand the point of Copilot. So I'm probably not the best to opine about it. However, I am very opinionated about copyright in manner that typical flows against HN group think.


Copilot isn't intending to copy entire code bases either.


>> My explanation will not be popular here on HN How is this better than ’bring on the downvotes’?

I totally missed the non-wrapped question.

Because I don't give a crap about down-votes/up-votes. I just know from experience my views on copyright do not gel with the majority views on HN. I was just acknowledging that fact. Conversations can be had regardless of votes. My views on Napster/MP3 trading are in the same realm (and somewhat related with copyright issues). I was a co-owner of a small music site when Napster was in its heyday, and we saw direct repurcussions of people not buy music because they got it from mp3 trading. Group think here is all "things for free when I want it, how i want it", yet I still have conversations. I'm not afraid of a measly -4 points because my thoughts are contrary to group think.

At the same time, if something like this gets your goad, how is asking how something is better being better in and of itself?


If a trained language model exactly reproduces copyrighted text, is there any question about whether copyright still applies?


But then the infringement is done by the person who publishes that output, not by the text editor that copies the code.


This is a useless hypothetical, no language models do that


And yet there are plenty of examples of Copilot reproducing copyrighted code verbatim, like is does in this example[1] that was posted on HN.

[1] https://twitter.com/mitsuhiko/status/1410886329924194309


This is precisely what Copilot does, regularly.


The search engine on Github also calls up entire pages of GPL licensed code verbatim. Does it run afoul of copyright?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: