Hacker News new | past | comments | ask | show | jobs | submit login

The problem is that AI doesn't really "reference" data. When you "train" an AI on some data, you're adjusting billions of model parameters to make them closer to the desired output. Except you're also doing that on billions of pieces of other data, many times over, and every bit of data you train on is stepping on everything else. In order to pay people a share of the 'profits' of AI, you need a clear value chain from licensed training data to each output, through the magical linear algebra data blender that is gradient descent. Nobody knows if your training example helped for the same reason why we don't know if an LLM is lying or not.

In lieu of that, you could pay everyone a fixed cut based on presence in the training set, but that then gives you the Spotify problem of a fixed pot being shared millions of different ways. For example, Adobe recently announced they were building an AI drawing tool trained on exclusively licensed sources - specifically, Adobe Stock contributors[0]. They're used to being paid when someone buys their image, which means that they have incentives to produce broadly relevant stock photography. But with a fixed "AI pot" paying you, now you have an incentive to produce as much output as possible as cheaply as possible purely to get a larger part of the pot. This is bad both for the stock photo market[1] AND the AI being trained.

AI is extremely sensitive to bias in the dataset. Normally, when we talk about bias, we think about things like "oh if I type CEO into Midjourney all the output drawings are male"; but it goes a lot deeper. Gradient descent does not know how to recognize duplicate training set features, those features get more chances to adjust the model. Eventually that training example or image is common enough to make memorization 'worth it' in terms of parameters used[2].

Ironically that sort of thing would actually make attribution and profit-sharing 'easier', at the expense of the model being far less capable.

[0] Who, BTW, I don't think actually have the ability to opt-in to this? Like, as far as I'm aware this is being done through the medium of contractual roofies being dropped into stock photographers' drinks.

[1] Expect more duplicates and spam

[2] This is why early Craiyon would give you existing imagery when you asked for specific famous examples and why Stable Diffusion draws the Getty Images watermark on things that look like a stock photo of a newsworthy event.




> you need a clear value chain from licensed training data to each output, through the magical linear algebra data blender that is gradient descent. Nobody knows if your training example helped

The magical linear algebra data blender that is gradient descent boils down to small additive modifications to the model parameters. We already know how to compute the effects of small additive modifications to the model parameters on the output: that's what the gradient is.

So if you want to know how much each training sample contributed to the output, just compute the dot product between the two gradients.

Actually doing that for a billion-parameter model would be slightly expensive because the gradients are also billion-dimensional, so you'd need to approximate the dot product via dimensionality reduction and use a vector database to filter for training samples with high approximated dot product.

But I think those layers of approximations would still be better than throwing your hands up in the air and claiming you have no way to know because linear algebra is magic.


AI could be used to decide if a source should be included or not (or the benefit to the model could be the qualifier). That would solve your problem of people just peddling spam.

No this is an unsolvable problem.

The future also seems like less about making the model an all knowing oracle but instead making it smart enough to know how to lookup data it needs, so it could end up where licensed data is all that is needed for training.

Lastly, what if you use model A to generate data for model B? Would B be tainted? There have been lots of examples where LLM’s are used to train simpler models by synthesizing training data.


The thing to note about copyright is that you can't launder it away, infringement "reaches through" however many layers of transformation you add to the process. The question of infringement is purely:

* Did you have access to the original work?

* Did you produce output substantially similar to the original?

* Is the substantial similarity of something that's subject to copyright?

* Is the copying an act of fair use?

To explain what happens to Model B, let's first look at Model A. It gets fed in, presumably, a copyrighted data set. We expect it to produce new outputs that aren't subject to copyright. If they're actually entirely new outputs, then there's no infringement. Though, thanks to a monkey named after a hyperactive ninja[0], it's also uncopyrightable. If the outputs aren't new - either because Model A remembered its training data or because it remembered characters, designs, or passages of text that are copyrighted - then the outputs are infringing.

Model A itself - just the weights alone - could be argued to either be an infringing copy of the training data or a fair use. That's something courts haven't decided yet. But keep in mind that, because there is no copyright laundry, the fair use question is separate for each step; fair use is not transitive. So even if Model A is infringing and not fair use, the outputs might still be legally non-infringing.

If you manually picked out the noninfringing outputs of Model A and used that solely as the training set for Model B, then arguing that Model B itself is 'tainted' becomes more difficult, because there isn't anything in Model B that's just the copyrighted original. So I don't think Model B would be tainted. However, this is purely a function of there being a filtering process, not there being two models. If you just had one model and human-curated noninfringing data, then there would be no taint there either. If you had two models but no filtering, then Model B still can copy stuff that Model A learned to copy. Furthermore, automating the curation would require a machine learning model with a basic idea of copyrightability, and the contents of the original training set.

[0] https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...

The whole monkey selfie just being on the article in full resolution is an interesting flex.


Good point. And I meant it should not be an unsolvable problem. Was thinking along the lines of Model B training off of non-infringing work from A.

All good points!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: