Are those same people also going to stop speaking in public? Throughout human history, anyone can profit off of hearing something you said in public. You should be fearful of a timeline where people are compensated for whatever they post on the internet. I shudder to think what those discussions would look like
> Throughout human history, anyone can profit off of hearing something you said in public.
I think the big difference here is that this will now be automated. This comment I'm writing right now is being "donated" to any company that wants Hacker News on their dataset, but they didn't even need to go through the work of reading my comment to use it, they just feed along with hundreds of terabytes of random text they get from the internet and apply heuristics/other models to filter the dataset.
I feel somewhat uneasy about it, specially when writing code now. I don't think my code is good in any way shape or form, but there's a reason I license them as GPL, I don't want it used without being contributed back in some way, I write that for the greater good. License infringement was always a touchy subject since it's quite hard to find license infringement in proprietary software, but now that an opaque black box is involved in that process the people write the software might not even know they violated a license.
Just throwing ideas, I'm not really in favor or against LLMs using public datasets. At least on one side, it levels the playing field among all participants. On the other side, however, big tech will always have an edge in training and testing these large language models. My current stance is to just wait and see what happens, and then react accordingly.