You’re hitting on the distinction between duplication and training.
I own hundreds of paperback books. Copyright law does not limit what I can learn from them.
It may be that assembling a corpus for training is illegal, but if so, that would be true even if it was never used for training. The act of training an AI is orthogonal to the collection of the corpus.
I think I’m not necessarily talking about training. It’s the duplication before training, and maybe just the fact that training is consuming the entire work, which I think under copyright law requires permission (which typically means payment).
> It may be that assembling a corpus for training is illegal, but if so, that would be true even if it was never used for training.
Yeah, exactly! You’re right that copyright law doesn’t limit what you can learn at all, and doesn’t copyright ideas. But it does, I think, limit whether you’re allowed to read the copyrighted work in it’s entirety the first place, if you haven’t paid for it or legally borrowed a copy or whatever. Gaining access to the material is covered under the law, right? This does mean, I suspect, that assembling a corpus of copyrighted training material is not allowed under copyright law, unless it was all paid for or licensed with permission.
If the AI companies have paid for all the material they used to train, then my question might be moot, I’m assuming they didn’t pay for it. This is murky when there’s a lot of copyrighted material that’s available online, maybe with the intent that it would be consumed in small parts and not copied wholesale by machines for the sole purpose of making software that can replicate the content and style of what it learned.
I own hundreds of paperback books. Copyright law does not limit what I can learn from them.
It may be that assembling a corpus for training is illegal, but if so, that would be true even if it was never used for training. The act of training an AI is orthogonal to the collection of the corpus.