>And if you're a data provider, are there any assurances that openai isn't just scraping the output and using it as part of their RLHF training loop, baking your proprietary data into their model?
I don't think this should be a major concern for most people
i) What assurance is there that they won't do that anyway? You have no legal recourse against them scraping your website (see linkedin's failed legal battles).
ii) Most data providers change their data sometimes, how will ChatGPT know whether the data is stale?
iii) RLHF is almost useless when it comes to learning new information, and finetuning to learn new data is extremely inefficient. The bigger concern is that it will end up in the training data for the next model.
To me the logical outcome of this is siloization of information.
If display ad revenue as a way of monetizing knowledge and expertise dries up, why would we assume that all of the same level of information will still be put out there for free on the public internet?
Paywalls on steroids for "vetted" content and an increasingly-hard-to-navigate mix of people sharing good info for free + spam and misinformation (now also machine generated!) to try to capture the last of the search traffic and display ad monetization market.
The AI has to learn from something. A lot of people feeding the internet with content today are getting paid for it one way or another. In ways that wouldn't hold up if people stop using the web as-is.
Solving that acquisition and monetization of new stuff into the AI models problems will be interesting.
People are highly egotistical and love feeding endless streams of video and pictures online, and our next generation models will be there to slurp it all up.
I don't think this should be a major concern for most people
i) What assurance is there that they won't do that anyway? You have no legal recourse against them scraping your website (see linkedin's failed legal battles).
ii) Most data providers change their data sometimes, how will ChatGPT know whether the data is stale?
iii) RLHF is almost useless when it comes to learning new information, and finetuning to learn new data is extremely inefficient. The bigger concern is that it will end up in the training data for the next model.