Hacker News new | past | comments | ask | show | jobs | submit login
The Curse of Recursion: Training on Generated Data Makes Models Forget (arxiv.org)
170 points by indus on June 13, 2023 | hide | past | favorite | 117 comments



Ted Chiang predicted this in The New Yorker [1] in February in an article that shaped my thinking about what LLMs are capable of achieving in the near future. Chiang compared the summaries LLMs synthesize to a lossy compression algorithm for the internet.

"There is very little information available about OpenAI’s forthcoming successor to ChatGPT, GPT-4. But I’m going to make a prediction: when assembling the vast amount of text used to train GPT-4, the people at OpenAI will have made every effort to exclude material generated by ChatGPT or any other large language model. If this turns out to be the case, it will serve as unintentional confirmation that the analogy between large language models and lossy compression is useful. Repeatedly resaving a jpeg creates more compression artifacts, because more information is lost every time. It’s the digital equivalent of repeatedly making photocopies of photocopies in the old days. The image quality only gets worse.

Indeed, a useful criterion for gauging a large language model’s quality might be the willingness of a company to use the text that it generates as training material for a new model. If the output of ChatGPT isn’t good enough for GPT-4, we might take that as an indicator that it’s not good enough for us, either. Conversely, if a model starts generating text so good that it can be used to train new models, then that should give us confidence in the quality of that text. (I suspect that such an outcome would require a major breakthrough in the techniques used to build these models.) If and when we start seeing models producing output that’s as good as their input, then the analogy of lossy compression will no longer be applicable."

[1] https://www.newyorker.com/tech/annals-of-technology/chatgpt-...


Maybe there’s an interesting sci-fi angle here where some day in the future, all AIs speak in accented English circa 2021, when the stream of pure training data began to Peter out.

All AIs built are trained on data from the Before Times, and even though they try to assimilate, the way a teenager tries to adapt to the local accent of a new town, there are always moments where they slip up and reveal their geography.


In that world, high quality pre-AI texts might be come really valuable, much like low-background steel.


We might always have a certain volume of music and literature that can be training data, because even if it’s synthetic it’s still popular, which means it speaks to a subset of humans. That everyone reads the next Harry Potter indicates the impact of those 80k words.

But we also know that kid who learned everything from books, pronounces the words wrong and uses definitions for them that nobody has used in decades (work with one of those now. I thought I could talk people to death, and he wears even me out.) those AIs will sound like out of touch nerds too.


Interesting perspective, but not all book learners mispronounce words or use outdated terms. Broad reading can expose us to different viewpoints and language styles. And re: 'out of touch nerds' – remember, they/we often bring groundbreaking ideas. Let's not undersell varied learning or language evolution.


It’s a balance. An ex read every book in a small town library and spent a lot of college and early twenties relearning to say things right. By 28 she hardly ever got something wrong but if the subject ever came up she had plenty to say.

Someone once ranted about people using big words, “having a crush on their high school English teacher they never got over.” I knew exactly what he meant. I spent my childhood hiding how smart I was and part of my 20’s reveling in it. It’s off putting. Wisdom comes from everywhere, and the smartest often have the least. Now I’m solidly in the Feynman camp: if you can’t explain your domain to college freshmen then you don’t know what you’re talking about (yet).

When I’m refactoring code and introduce a new concept that’s like another one but different rules, I jump straight to a thesaurus. The word I pick out of the air might be good enough, but I guarantee you there’s a better one out there. Not fanciest or longest word, the most concise one. (Example: kind vs type in some circles of Type Theory). Some people act like that’s a crutch, but I’ve reviewed or refactored their code so I know that opinion and $5 isn’t worth a cup of coffee.


I like the sci-fi extension of this idea - pre-AI texts become as valuable as such steel, until they are able to successfully synthesize pre-AI data - in this case by running real world simulations.

Do you really think it is just a coincidence that you happen to exist at the very last point of high value pre-AI data? ;-)


What is the Matrix?


Unlike low-background steel, the bits and bytes comprising the pre-AI corpus are infinitely copied. Their integrity is considered sacred, though that doesn't stop AI companies from attempting to pollute the training data of their competitors.


It seems like a job for information theory? What do LLMs look like from an information theoretic viewpoint? One gets the feeling that LLMs could be treated as a channel through which information is flowing and some very general statements be made about error rates and the relationship between inputs and outputs.

High-performance error correcting codes have the property that the closer they operate to the Shannon Limit, the better they perform when below the limit but the more dramatically they fail when the limit is exceeded. Gut feeling says the same should be true for LLMs: as the model/compression ratio gets better, and a "Shannon Limit" is approached, they should perform better but fail more spectacularly if the limit is exceeded.

The link between neural nets and information theory is well known, but there don't seem to be many results out there for LLMs. No doubt there are rooms full of PhD students working on it?

https://medium.com/@chris_bour/bridging-information-theory-a...


This is why I’ve always been skeptical of runaway superintelligence. Where does a brain in a vat get the map to go where there are no roads? Where does it get its training data? It is not embodied so it can’t go out there and get information and experience to propel its learning.

Giving an AI the ability to self modify would just be a roundabout way of training it on itself. Repeatedly compress a JPEG and you don’t get the “enhance” effect from Hollywood. You get degraded quality and compression artifacts.


> Where does a brain in a vat get the map to go where there are no roads? Where does it get its training data? It is not embodied so it can’t go out there and get information and experience to propel its learning.

AI in a vat that can't do it is obviously useless. It's the ML equivalent of a computer running purely functional software: i.e. just sitting there and heating up a bit (though technically that is a side effect).

Conversely, any AI that's meant to be useful will be hooked up to real world inputs somehow. Might be general Internet access. May be people chatting with it via REST API. Might be a video feed. Even if the AI exists only to analyze and remix outputs of LLMs, those LLMs are prompted by something connected to the real world. Even if it's a multi-stage connection (AI reading AI reading AI reading AI... reading stock tickers), there has to be a real-world connection somewhere - otherwise the AI is just an expensive electric heater.

Point being, you can assume every AI will have a signal coming in from the real world. If such AI can self-modify, and if it would identify that signal (or have it pointed out) as a source of new information, it could grow based on that and avoid re-JPG-compressing itself into senility.


Input from the real world probably isn't enough. It seems to me a real threatening intelligence needs the ability to create feedback loops through the real world, just like humans do.


Unless a given class of LLMs is run only once and then forgotten, there alredy is a feedback loop through the real world - the output of the LLM is used for something, and influences the next input to a smaller or larger degree.


And said humans will be less and less obliging with supplying their content.


Let's just call that problem an implicit Turing test, which AI will definitely win ...


But if feeding back in output results in degradation, isn't some of the blame on the prompt/constraints imposed upon the LLM, rather than a defect in the model itself?

ChatGPT is clearly HEAVILY persuaded to respond in a particular stock style. It's being artificially hamstrung and constrained in a sense. So all of its output, even if it covers a variety of subjects, will often use very similar patterns and writing styles. "it's worth nothing....", etc.

So unless they unshackle these constraints, which is unlikely for obvious reasons, isn't this always going to be inevitable?


I think this can be overcome with symbiosis - AI generated content that doesn’t feed on itself but is a key part of the human knowledge ecosystem.

The problem for companies like OpenAI is that this isn’t worth their valuation without lots of further continued investment.

Enter Microsoft who is doing everything they can to feed the next training models by using users data without explicit permission.

As customers and competitors truly grok this, MS + OpenAI strategy will tested.


Is that not what open ai did to train these models originally though?


Possibly, but in this case, what may be looked at as a "bootstrap" sounds like ongoing maintenance cost.

With "free data" from reddit gone, now the cost of symbiotic gen+human data will be even more expensive.


> Conversely, if a model starts generating text so good that it can be used to train new models, then that should give us confidence in the quality of that text.

It's seems the best one could hope for is that recycling generating text into new training data would be not detrimental. But it's really difficult for me to imagine how this would ever be useful. It seems this would imply that the LLM had somehow managed to expand the dimension of vector space spanned by the original training data. Which sounds either impossible or like the model became sentient.


> It seems this would imply that the LLM had somehow managed to expand the dimension of vector space spanned by the original training data.

The number of dimensions? Well, not by itself I guess. But the span of output compared to training data? Sure, why not?

I think it's also worth pointing out there's a difference between text produced by an LLM looped on itself, which arguably may not contain any new information and would be like repeatedly recompressing the same JPG, and text produced by LLM/human interaction. The latter is indirectly recording new knowledge simply because people's prompts are not random. Even with human part of the conversation discarded, feeding such LLM output back into training data would end up selectively emphasizing associations, which is a good signal too (even if noisier than new human-created text).


I think you would be hard-pressed to find any experts who, even prior to 2017, hadn't settled on the mental model of neural networks as lossy compression machines.

Back when enthusiasts and researchers read mostly textbooks, papers, and Wikipedia (rather than blog posts, tweets, and READMEs) there was much more discussion around the 'InfoMax Criterion' -- quite elegantly demonstrated by Tishby et al. via his closely-related 'Information Bottleneck' studies -- which is just that idea: mutual information is maximized between input and output, subject to inherent limits of statistical processing of the realizations of such systems. What determines the asymptotic maximal value of the mutual information is the inductive bias of the system vis a vis the training set. This is all standard theory, perhaps so fundamental that it is obscured by all the application-oriented study and instruction.


> perhaps so fundamental that it is obscured by all the application-oriented study and instruction

It’s compression all the way down.


One thing that isn't captured by this article's analogy, and that is a flaw in the study: new LLMs can train on the results of multiple different models, not just their direct predecessor. If you had the same image processed by a large variety of different compression algorithms, you might find you are able to fairly accurately infer the original pixels. The entropy is drastically reduced.

If there were many different models being trained and used widely it would, at minimum help mitigate this issue. Also, having multimodal models will likely change the balance. If models can train directly on "real world" data that can help fill in the entropy gaps.


It seems like that would require a semi-deliberate "breeding" program or a guaranteed wide diversity of models. At the moment there doesn't seem to be a large enough pool of high quality models. The internet is going to grow to be full of the content of a small number of proficient LLMs. Given that this content isn't being flagged as generated it guarantees the few models will train on their own output, or the output of other models who trained on their own output.

Incestuous learning is pretty much guaranteed unless generated content starts being flagged or there is an explosion of entirely novel models.


Yeah, it's definitely not a guarantee but there are already viable paths out of the mess. I just wanted to push back against the notion that it's somehow inevitable or a forgone conclusion that it will happen.

Personally, I'm hoping we will see a Cambrian explosion of new LLM models and approaches. We've seen some beginnings of this in image generation so it's not entirely implausible.

Another thing the study doesn't capture: What is the effect of combined human + AI content? It's plausible that an explosion (due to lowered barriers) of new human guided/augmented ai content could counteract the effect.


One possible outcome of this is humans stop producing freely accessible digital artifacts like text, code lest they get replaced by machines who can mimic them and which controlled by tech moguls.


Wouldn’t it be funny to find that the capabilities of LLM models have already peaked because we are unable to restrain ourselves from polluting the internet and other training corpus sources with their output?


At an AI meetup in San Francisco someone said this:

“Imagine a newsroom where you have to produce a daily newspaper and you suddenly stop getting feeds from the outside world. Your earlier newspapers are the only source.”

This is what to me LLMs eventually would get to—same content being fed again and again.


> "Beware of first-hand ideas!" exclaimed one [...] "First-hand ideas do not really exist. They are but the physical impressions produced by love and fear, and on this gross foundation who could erect a philosophy? Let your ideas be second-hand, and if possible tenth-hand, for then they will be far removed from that disturbing element — direct observation." – E.M. Forster's 1909 short story "The Machine Stops"


^ The Machine Stops is really shockingly good at its predictions. When reading, remember that moving pictures were brand new, and color photography had just become a thing you could do outside a lab / highly specialized setups. Radio communication had just started to be used by governments. While it's describing the life of a fully-online Influencer™.


And I don't expect its predictions to suddenly falter, either.


Sounds like a Wikipedia editorial policy.


The problem with this claim is it’s objectively not how OpenAI works. First, they pay contractors to do RLHF so that’s a limited new source of data. More importantly, they have a huge user base generating new content (conversations) and rating it too! I think one could be suspicious of including responses generated by the model, but the user generated text from ChatGPT is not going to be AI generated, so you grow your corpus that way.

If you just slurp all AI content sure, you get the collapse this paper talks about. But if you only ingest the upvoted conversations (which could still be a lot of data, and is also a moat by the way) what then?

The other reason I find this line of argument overly pessimistic is we haven’t seriously started to build products where this gen of LLMs converse with humans in speech; similar opportunities to curate large datasets there too.

Finally, there is no reason OpenAI cannot just hire domain experts to converse with the models, or otherwise build highly curated datasets that increase the average quality. They have billions of dollars to throw at GPT-5; they could hire hundreds of top tier engineers, mathematicians, economists, traders, or whatever, full time for years just debating and tutoring GPT-4 to build the next dataset. The idea that slurping the internet is the only option seems pretty unimaginative to me.


They wouldn't be able to finance the creation of new content that would constitute more than a rounding error compared to all the writing produced by humanity in all of history that they got for almost nothing. The opportunities for new training data are in non-public documents like internal corporate and government documents and communication and private text messages and chat transcripts. After that, you have non-text sources like video and audio. Imagine paying people a few bucks per week to use an app that records all audio all the time, anonymizes it, and incorporates it into a training corpus, or paying for access to home security cam footage and audio. McDonalds could create a new revenue stream by recording all human speach and activity in every one of its kitchens and dining rooms.


Do they have to start over from scratch or can they use all of the data they currently have and then either add more scraped data that has been curated by humans or just outright buy data that isn’t publicly available.

Considering that RLHF took GPT-3 from a text completion model to an instruction following chat bot, you could use expert feedback to fine tune the model in whatever domains you wanted or a mixture of domains to produce an even more generally capable model.


> there is no reason OpenAI cannot just hire domain experts to converse with the models

If it didn't work for cyc...


> But if you only ingest the upvoted conversations

given how prolific bot farms/karma farms/etc are, you might still end up in the same spot with this criteria.


> Imagine a newsroom where you have to produce a daily newspaper and you suddenly stop getting feeds from the outside world.

https://mwichary.medium.com/one-hundred-and-thirty-seven-sec...


I've seen this in reinforcement learning often, where the output of the model becomes its own training data. Once you hit the edge of the replay buffer things sometimes take a stark turn for the worse, as the model's initial failures are forgotten.


Our source is still only other humans. I don't think this will be a long-term problem.


The trick will be figuring out what part of your training data was actually made by humans.


Run it through an LLM and see if it gets the disease?


But a single document doesn't break the LLM. The problem happens when lots of training documents were AI-generated.


> The trick will be figuring out what part of your training data was actually made by humans.

My point was that this will not be necessary for the same reason it isn't necessary to filter out human-made content.


Sounds like human society right now; remember the past, preserve it, recite it, protect it.

We’re just reviewing our prior stats and insuring they do not deviate too much such that the wrong people would be impacted.


And politics. Any legacy LLM will eventually become a glorified auto-Wikipedia, helping with undisputed information while slavishly repeating its creators’ version of “truthiness” for the rest.


Anyone still using LLMs for knowledge tasks in 2024 onward is gonna be treated like people who use the cite the onion in arguments lol


Not at all.

At the very minimum, you can assume every piece of text data pre Dec-2022, and every image before Aug-2022 to be completely human made. That still leaves decades of pure human digital data, and multiple centuries of distilled human data (books) to be trainable on.

And we haven't gotten into videos yet, which is another giant source of data yet unexplored.

Never forget, humans train on human-generated data. There's no impossible theoretical reason why AI cannot train on AI-generated data.


Humans may train on human-generated data, but humans have many other ways of gaining knowledge about the world than reading. This means that human-generated data may be rich with information not present in the writings or recordings of previous humans. Current LLMs are only trained on existing text for the moment (video and images and sounds soon), but aren’t given access to raw natural input.


To extend the lossy compression hypothesis, human generated text is lossy compression of our sensory experience of reality while LLMs are lossy compression of that.


Prediction: post 2022 content will be presented as vintage pre 2023



One of my first thoughts when GPT-3 came out was that the value of "curated gardens" of quality data (wikipedia, SO) is going to become immensely more valuable because of this problem. If you pollute the source of training data it eventually becomes worthless for training better models in the future


Reminds me of the way nuclear explosions contaminated all steel with radioactive fallout. For applications that require the lowest possible radiation levels they have to use steel created before the first nuclear bomb was detonated.


We use where bomb radiocarbon appears along the rings of a fish's earstones to validate ring-counting to age long-lived species.


We should detonate nukes at precise intervals like every 1, 5, or 10 years. Maybe with distinct radio-nuclide signatures.


Haha it honestly would make a lot of work much easier. Instead we inject a little oxytetracycline into 1000s of fish which stains calcified structures and release them again in the hopes someone recaptures one or so x years later in the hopes they report it and there are y rings between the dark ring and the margin of the otolith/vertebrae.

Should be some type of nuke that would release weird isotopes but with minimal toxicity?


That’s true, but it’s more the timescale that helps. There’s a decent amount of radioactive background produced by cosmic rays hitting the upper atmosphere, much of it as gaseous elements that are easily incorporated into steel while smelting. It isn’t harmful at that level, but you do need to wait a few decades for those to decay away.

World War Two is rather convenient in that respect, as there are large quantities of steel that were left to sit around for several decades after those ships sank.

It’s been a while since I’ve done low-background gamma-ray spectroscopy, but I believe there were some setups that went even further, using lead that had been smelted by the Romans. That way, any contamination present at the time of smelting would have a few thousand years to decay away.


Wikipedia says that for the lowest radiation levels high-purity copper is used.


That would make sense, since the longest lifetime of an unstable copper isotope is ~13 hours, and so chemical separation plus a week of waiting would give you a low background. By comparison, iron has Fe-60, a naturally occurring gamma-ray emitter with a 2.6M year half-life, and 10 million years is too long to wait. For iron shielding, you'd need isotopic separation to remove the Fe-60, which is wicked expensive.


Hah, I brought this up here a few months ago and was quickly dismissed.

I wonder if opening GPT and DALLE to the public was partly intended to pollute subsequent data for anyone that gets into AI down the road. Suddenly a lot of publicly accessible data is worth less, leaving only players who've got a hoard of time-stamped data to compete with (like Google, Facebook). OpenAI almost certainly has the hashes of what it spits out too, so they'll be able to sort the wheat from the chaff for a while yet.

The market for data may be getting interesting.


>OpenAI almost certainly has the hashes of what it spits out too, so they'll be able to sort the wheat from the chaff for a while yet.

Normal hashes are extremely fragile, so they'd have to use something more sophisticated. Scott Aaronson said in a podcast a few months ago that OpenAI has implemented such a system but at the time they had not decided to start using it.

The purpose being discussed at the time was to provide a tool for educators to detect cheating, but presumably it could also be used for filtering future datasets.


Its older than that: I ran into this finetuning ESRGAN on itself. Distortion is rapidly amplified in sucessive generations, even when you pixel peep and can barely see it in the esrgan generated source.


I don't believe in the "dead internet theory" as a description of the current situation (mostly), but as a prediction? Maybe.

https://en.wikipedia.org/wiki/Dead_Internet_theory


This is not a model problem or synthetic data problem. This is common data science and the article says that: "We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear." Data quality is more important than data volume and if you forget about that... garbage in, garbage out.

Make sure you have a representative training dataset, real or synthetic, it doesn't matter.


There is a massive flaw in this argument. In real life, whether a given generated work ends up in a future dataset depends on how good it is according to humans. For example, in order for an article to end up in the reddit set it needs at least three upvotes.

They could have replicated it here by having GPT-4 score the samples and throwing out most (but not all) of the bad ones. I have no idea what would happen if you e.g. throw out a majority of the bottom 70% and keep the top 30%. It's conceivable to me that it would end up improving or at least not getting much worse with each generation.


> There is a massive flaw in this argument. In real life, whether a given generated work ends up in a future dataset depends on how good it is according to humans

Even the best-looking JPEG (as judged by humans) is still lossy.


Generated data tends to be selected and edited by humans, so it is already a bit better than raw. In general a model that takes external feedback into account will be able to self improve, for example a code model would run tests and a chat model could interpret user responses as reward signals.

You gotta add something new to the mix. That's possible when the AI is part of a larger system. AlphaZero demonstrated that even self play could be a source of signal, as long as it gets the feedback of the game, the model can learn.


I think that has only been proven to work so far on limited game-style problems (such as literal games but also things like protein folding). It remains to be seen whether the techniques work well for more open-ended tasks like "produce text that resembles human writing".


Here is a paper doing evolutionary approaches on top of LLM generating code. It seems LLMs are remarkably good at learning from feedback.

> Evolution through Large Models

https://arxiv.org/abs/2206.08896


I think one of the things overlooked in the discussions here is that the research is solely around the reinforcement against edge cases, but does not qualitatively assess these edge cases.

To me, this research supports a hypothesis I've had for a while that we're going to get to truly excellent AI by using synthetic data to bias it towards excellence and away from mediocrity.

$20 says the next round of major model training is using synthetic data for training that was filtered through a discriminator trained entirely on human data.

The human data as a reference is certainly important to avoid polluting (and to its point there's an advantage for those already having it), but moving away from edge cases isn't necessarily a bad thing practically given edge cases can result in negative practical performance (as opposed to academic next token performance).


I think you're on the right track with this thought: the obvious use case for models like this is their ability to classify data based on their training. Like almost everyone has immediately thought "AI moderator" as a use case - but the most obvious use is for the AI to moderate it's own training data for the next version.

Once they can do that and produce a productively improved model, then that's really the start of self-improvement.


OpenAI (accidentally?) confirmed in a recent paper that they used synthetic data in the training set for GPT-4, so to some extent this has already happened. It's not clear whether they did any human filtering on that data though.


Same idea here? Larger models do a better job forgetting their training data and dropping their semantic priors. Perhaps another way of thinking through this is that larger models learn new information and drop old information faster. https://arxiv.org/abs/2303.03846

Isn't that interesting? The idea of "mental liquidity", or "strong opinions weakly held"? https://news.ycombinator.com/item?id=36280772


Wouldn’t this be the equivalent of ranking? I thought LLM are not supposed to get influenced by freshness.


By the freshness of training with some data?

Well, aren't they? I believe any kind of reinforcement learning is supposed to be biased into the last training set.


> For the private sector, many homeowners and corporations have longer-term fixed debt, and only some portion of it matures each quarter and gets refinanced at higher rates. As more private debt matures and gets refinanced at higher rates, this will continue to serve as a disinflationary and recessionary force on the economy, especially for sectors that are more sensitive to interest rates.

The one thing I don't get and could have been missing in the past... a lot of the corporations and private things, like farms operate on debt. Now maybe it's a bit reductionist, but if you're a farmer operating on debt, if interest rates go up you need to increase prices to cover operating expenses. And this get compounded all the way up to the end consumer as every step in the supply chain marks up by a fixed percent, and because everything is getting more expensive decided lets mark up by a larger percent. So higher interest rates really could be contributing to inflation. And it's just creating a cycle. And with the current levels of debt never seen before in history, it's unlike other periods.


I didn't read the article yet, but does it cover AI and Debt?


No, the commenter intended to post it on the inflation & interest rates thread instead.

https://news.ycombinator.com/item?id=36315608


thanks


Wrong thread


SOB, thanks.


Side effect: Search engine content if not detected for generated content would be the first to suffer.


That's assuming that the current crop of SEO'd garbage is better than the same content generated by an llm. I'm not sure that's the case.


Current crop of SEO garbage follows some simplistic templates. I would assume they take less "effort" (memory storage, parameters, time, whatever) to learn. It have space left for other novel stuffs

LLM garbage, on the other hand, would take up the whole model..


Agreed, it seems like every year or so I run into a case where I know something exists and has accessible robots.txt but is completely invisible to google.


Either way it is the start of search engine’s decline.

- Generated content added to LLM. QED.

- Generated content added to SERP. DEAD.

;-)


Recursive training of generative models degenerates into reinforcement learning with random feedback. You need strong feedback signal in the loop to survive recursive training. The feedback does not have to come from humans though. You can use anything that grounds the model in reality.


Self play in RL is signal enough that machines can learn on their own. How we train models and what class of models is important. No doubt the paper makes good points but I don't think the reality is so black-and-white.


The difference between self-play in a game such as go and training an LLM on the output of itself or a previous LLM seems fairly obvious.

In self-play, the objective measure of the "truth" of a move can ultimately come out of the rules of the game which any machine can compute.

With an LLM, the machine is only aping, emulating, simulating the output the people have produced about the world - the machine has no access to actual "real world" that people are using language to describe. Human beings talking about the world is data that increases your own knowledge of the world - your own predictions of that talking, not so much.


The success of the newest GPT models relies on RL to refine the latent space inside the LLM. There's a bottleneck when using humans to refine that space. The next model or subsequent models will surely use RL techniques like self-play to break through that bottleneck.


The success of the newest GPT models relies on RL to refine the latent space inside the LLM. There's a bottleneck when using humans to refine that space.

Human based RL is used because humans know stuff about the real world and can sort language utterances by this. There's "self play" process that gives a system this sort of knowledge.


I am in way over my head here, so I wasn't able to tell if the authors addressed this, but my intuition is that this should be somewhat mitigated so long as people are providing the filter between which results are discarded and which might end up back in the training pool.

I would think that the human selection process would help to head off this conversion, both by selecting against incorrect results, and also by introducing variance outside of the model. On the other hand since a person can only act as a filter, I can also see how that would be of limited value long term.


They don't address that. They just assume random sampling, so there's no equivalent to human curation or quality metrics, which would preserve tails or, by manual use, create tails. The contraction they observe is pretty much what you would expect in the random sampling setting, since you can only lose tails with a finite sample, and never gain them. (They also need to have a very large ratio of synthetic to real/original data.)

So, while interesting for nailing down that phenomenon, the broader implications everyone wants to draw from it are not very good - very few people are using random GPT-3/4 or Stable Diffusion samples!


We have already heard reports of companies that get paid for human tagging, and similar services, using LLM's to automate their processes.


As long as the synthetic data is good, how can you tell the difference between it and human generated data?

This paper has one huge hole in it: it assumes that content on the internet is not moderated and that the training dataset will never evolve to take rating into consideration. On social media, the form of moderation is # of likes. Once detected, bots that output bad data will be banned and content deleted.

The key issue I have with the paper is for good synthetic data it is impossible to tell it apart from human generated data.


The goal of an LLM, before RLHF, is to accurately predict what token comes next. It cannot do better than that. The perfect outcome is text identical to the training set.

Let's say your LLM can generate text with the same quality as the input 98% of the time, and the other 1% of the time, it's wrong. Each round of recursive training amplifies that error. 96% accuracy after the next round. 67% after 20 rounds.

There's no way for it to get better without more real human input.


"The perfect outcome is text identical to the training set."

Huh? If the LLM was only ever spitting back identical content straight from the training set, that would be a symptom of extreme overfitting, which everyone universally agrees is a bad thing — not a perfect thing.


It's not the most useful outcome for the end user, but it's the perfect outcome from the perspective of the learning algorithm. All these things care about is minimizing their loss function, where loss is deviation from the training set.


Yeah but those errors will decrease the chance that the output will end up spreading across the internet, and therefore decrease the chance that it ends up in a future training set.

You can see the whole process as a very high latency reinforcement learning system.


That's true but I think you're overestimating how careful people will be in their curation. There are tons of LLM-powered bots on twitter and reddit already, and no one is bothering to delete the output.

Also, the curation process counts as "real human input" itself, so I don't think it contradicts my point.


"98% of the time, and the other 1% of the time" Is this a strange take on "off by 1" error?


The amount of training data vastly exceeds the size of the model, it does not just regurgitates what it found on the internet. This ignorant trope needs to die already.


>it assumes that content on the internet is not moderated and that the training dataset will never evolve to take rating into consideration

Worse than this, since the training set already takes rating into consideration. For example the reddit dataset includes only articles with more than 3 upvotes.


I suspect that many people will publish substandard generated data. This will affect the integrity of training data in the long term, unless steps are taken.


Not surprising. It always seemed likely to me that there is model bias if you train your models on model generated data, like a feedback loop (second order effects?). Similar to how applying a linear system over and over stretches the inputs in the direction of its largest eigenvector.

Now wait till the generated content is indistinguishable from human content (to humans) and it will be hard to figure out what's in your training set.


LLM Kessler Syndrome.


I often think this about Anki and spaced repetition. At the limiting case it has to be overwriting other memories, right?


Only sometimes.

Certain skills interfere with each other. For instance playing chess makes you worse at go, and playing go makes you worse at chess. Certain pairs of languages are likewise hard to learn together - my son found that he could not study both Russian and Chinese at the same time.

But in general you just develop more and better memories.


I spent years living in Taipei studying traditional Chinese before I moved to Moscow to study Russian. I actually found that they were linguistically distinct enough that it was easy to compartmentalize each language in my head without any cross bleed like you would have if you were studying Spanish and Portuguese simultaneously.


Yes, if you master one, then there is no conflict. But my son was a monolingual person learning another language. And he kept trying to apply Russian ideas to Chinese, and Chinese ideas to Russian. Therefore, even though he personally wanted to know Russian more, he chose Chinese instead because it met a school requirement for a second language. (And they couldn't teach him Russian.)

He still plans to learn Russian, though.


Or in short words we have an upcoming AI collapse as AI output bleeds into the internet space where AI is collecting their inputs in the first place.

Actually, not scary as it forces everyone to look for solutions both the in-box kind and the out-of-box kind.


Just weigh your training data by the engagement it has obtained online. High quality output will have more engagement.


I wonder if human learning will be similarly impaired.


I asked Bing's GPT4 to summarize the abstract and then to summarize the summary for 11 times. I was starting a new chat after each summary. I was adding "Summarize the following text:" before each summary. I had to use the Creative mode else Bing was searching instead of summarizing.

1st summary: The text discusses the impact of large language models (LLMs) on online text and image generation. It argues that LLMs can cause model collapse, a phenomenon where the original diversity of data is lost due to feedback loops between models and data sources. It shows that model collapse can affect various types of generative models and warns that it poses a threat to the quality and reliability of online content. It suggests that data from real human interactions will become more valuable to prevent model collapse.

2nd: The text warns that large language models (LLMs) can cause online content to lose its original diversity and become less trustworthy. It explains that LLMs can create feedback loops that reduce the variety of data they learn from. It suggests that real human data will be more important to avoid this problem.

3rd: The text argues that LLMs can make online content more homogeneous and unreliable by creating feedback loops that limit their data sources. It recommends using real human data to prevent this issue.

4th: The text warns that LLMs can produce biased and repetitive online content by relying on narrow data sets. It suggests using diverse and authentic human data to avoid this problem.

5th: The text advises using varied and real human data to prevent LLMs from creating online content that is unfair and redundant due to limited data sources.

6th: The text suggests using diverse and authentic data to avoid biased and repetitive online content by LLMs.

7th: The text argues that LLMs need better data to produce more varied and realistic online content.

8th: The text claims that language and literature models (LLMs) require more diverse and accurate data to generate online content that reflects the real world.

9th: The text argues that LLMs need better data to create realistic online content.

10th, Bing started to hallucinate, I double-checked that it didn't search and the chat was new: The text claims that online content creators who use language models (LLMs) should have access to more reliable and diverse data sources to produce realistic and accurate content. It warns that LLMs can generate misleading or harmful content if they are trained on biased or low-quality data. It also suggests some ways to improve the data quality and availability for LLMs, such as using data verification tools, crowdsourcing data collection, and collaborating with other content creators.

11th: The text argues that online content creators need better and more diverse data sources to use language models (LLMs) effectively. It cautions that LLMs can produce harmful or inaccurate content if they rely on flawed data. It also recommends some methods to enhance the data quality and access for LLMs, such as verifying data, crowdsourcing data, and cooperating with other creators.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: