The problem is that AI doesn't really "reference" data. When you "train" an AI o...

yorwba · on April 28, 2023

> you need a clear value chain from licensed training data to each output, through the magical linear algebra data blender that is gradient descent. Nobody knows if your training example helped

The magical linear algebra data blender that is gradient descent boils down to small additive modifications to the model parameters. We already know how to compute the effects of small additive modifications to the model parameters on the output: that's what the gradient is.

So if you want to know how much each training sample contributed to the output, just compute the dot product between the two gradients.

Actually doing that for a billion-parameter model would be slightly expensive because the gradients are also billion-dimensional, so you'd need to approximate the dot product via dimensionality reduction and use a vector database to filter for training samples with high approximated dot product.

But I think those layers of approximations would still be better than throwing your hands up in the air and claiming you have no way to know because linear algebra is magic.

ec109685 · on April 28, 2023

AI could be used to decide if a source should be included or not (or the benefit to the model could be the qualifier). That would solve your problem of people just peddling spam.

No this is an unsolvable problem.

The future also seems like less about making the model an all knowing oracle but instead making it smart enough to know how to lookup data it needs, so it could end up where licensed data is all that is needed for training.

Lastly, what if you use model A to generate data for model B? Would B be tainted? There have been lots of examples where LLM’s are used to train simpler models by synthesizing training data.

kmeisthax · on April 28, 2023

The thing to note about copyright is that you can't launder it away, infringement "reaches through" however many layers of transformation you add to the process. The question of infringement is purely:

* Did you have access to the original work?

* Did you produce output substantially similar to the original?

* Is the substantial similarity of something that's subject to copyright?

* Is the copying an act of fair use?

To explain what happens to Model B, let's first look at Model A. It gets fed in, presumably, a copyrighted data set. We expect it to produce new outputs that aren't subject to copyright. If they're actually entirely new outputs, then there's no infringement. Though, thanks to a monkey named after a hyperactive ninja[0], it's also uncopyrightable. If the outputs aren't new - either because Model A remembered its training data or because it remembered characters, designs, or passages of text that are copyrighted - then the outputs are infringing.

Model A itself - just the weights alone - could be argued to either be an infringing copy of the training data or a fair use. That's something courts haven't decided yet. But keep in mind that, because there is no copyright laundry, the fair use question is separate for each step; fair use is not transitive. So even if Model A is infringing and not fair use, the outputs might still be legally non-infringing.

If you manually picked out the noninfringing outputs of Model A and used that solely as the training set for Model B, then arguing that Model B itself is 'tainted' becomes more difficult, because there isn't anything in Model B that's just the copyrighted original. So I don't think Model B would be tainted. However, this is purely a function of there being a filtering process, not there being two models. If you just had one model and human-curated noninfringing data, then there would be no taint there either. If you had two models but no filtering, then Model B still can copy stuff that Model A learned to copy. Furthermore, automating the curation would require a machine learning model with a basic idea of copyrightability, and the contents of the original training set.

[0] https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...

The whole monkey selfie just being on the article in full resolution is an interesting flex.

ec109685 · on April 28, 2023

Good point. And I meant it should not be an unsolvable problem. Was thinking along the lines of Model B training off of non-infringing work from A.

All good points!