Hacker News new | past | comments | ask | show | jobs | submit login

The thing to note about copyright is that you can't launder it away, infringement "reaches through" however many layers of transformation you add to the process. The question of infringement is purely:

* Did you have access to the original work?

* Did you produce output substantially similar to the original?

* Is the substantial similarity of something that's subject to copyright?

* Is the copying an act of fair use?

To explain what happens to Model B, let's first look at Model A. It gets fed in, presumably, a copyrighted data set. We expect it to produce new outputs that aren't subject to copyright. If they're actually entirely new outputs, then there's no infringement. Though, thanks to a monkey named after a hyperactive ninja[0], it's also uncopyrightable. If the outputs aren't new - either because Model A remembered its training data or because it remembered characters, designs, or passages of text that are copyrighted - then the outputs are infringing.

Model A itself - just the weights alone - could be argued to either be an infringing copy of the training data or a fair use. That's something courts haven't decided yet. But keep in mind that, because there is no copyright laundry, the fair use question is separate for each step; fair use is not transitive. So even if Model A is infringing and not fair use, the outputs might still be legally non-infringing.

If you manually picked out the noninfringing outputs of Model A and used that solely as the training set for Model B, then arguing that Model B itself is 'tainted' becomes more difficult, because there isn't anything in Model B that's just the copyrighted original. So I don't think Model B would be tainted. However, this is purely a function of there being a filtering process, not there being two models. If you just had one model and human-curated noninfringing data, then there would be no taint there either. If you had two models but no filtering, then Model B still can copy stuff that Model A learned to copy. Furthermore, automating the curation would require a machine learning model with a basic idea of copyrightability, and the contents of the original training set.

[0] https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...

The whole monkey selfie just being on the article in full resolution is an interesting flex.




Good point. And I meant it should not be an unsolvable problem. Was thinking along the lines of Model B training off of non-infringing work from A.

All good points!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: