> it was trained using synthetic data Is this not supposed to cause Model collap...

rhdunn · 2024-12-16T10:37:26 1734345446

It depends on how you construct the synthetic data and how the model is trained on that data.

For diffusion-based image generators training only on synthetic data over repeated model training can cause model collapse as errors in the output can amplify in the trained model. It's usually the 2nd or 3rd model created this way (with output of the previous used as input for the first) for it to collapse.

It was found that using primary data along side synthetic data avoided the model collapse. Likewise, if you also have some sort of human scoring/evaluation you can help avoid artefacts.

simonw · 2024-12-16T10:39:35 1734345575

This is why I don't think model collapse actually matters: people have been deliberately training LLMs on synthetic data for over a year at this point.

As far as I can tell model collapse happens when you deliberately train LLMs on low quality LLM-generated data so that you can write a paper about it.

ziofill · 2024-12-16T16:30:24 1734366624

I may have misunderstood, but I think that it depends a lot on the existence of a validation mechanism. Programming languages have interpreters and compilers that can provide a useful signal, while for images and natural language there isn’t such an automated mechanism, or at least its not that straightforward.

nxobject · 2024-12-16T13:52:38 1734357158

As someone who's a completely layman: I wonder if the results of model collapse are no worse than, say, sufficiently complex symbolic AI (modulo consistency and fidelity?)

fulafel · 2024-12-16T09:50:50 1734342650

belter · 2024-12-16T13:49:29 1734356969

Is this paper wrong? - https://arxiv.org/abs/2311.09807

simonw · 2024-12-16T16:45:25 1734367525

It shows that if you deliberately train LLMs against their own output in a loop you get problems. That's not what synthetic data training does.

belter · 2024-12-16T19:50:36 1734378636

I understand and appreciate your clarification. However would it not be the case some synthetic data strategies, if misapplied, can resemble the feedback loop scenario and thus risk model collapse?