Hacker News new | past | comments | ask | show | jobs | submit login

> it was trained using synthetic data

Is this not supposed to cause Model collapse?




It depends on how you construct the synthetic data and how the model is trained on that data.

For diffusion-based image generators training only on synthetic data over repeated model training can cause model collapse as errors in the output can amplify in the trained model. It's usually the 2nd or 3rd model created this way (with output of the previous used as input for the first) for it to collapse.

It was found that using primary data along side synthetic data avoided the model collapse. Likewise, if you also have some sort of human scoring/evaluation you can help avoid artefacts.


This is why I don't think model collapse actually matters: people have been deliberately training LLMs on synthetic data for over a year at this point.

As far as I can tell model collapse happens when you deliberately train LLMs on low quality LLM-generated data so that you can write a paper about it.


I may have misunderstood, but I think that it depends a lot on the existence of a validation mechanism. Programming languages have interpreters and compilers that can provide a useful signal, while for images and natural language there isn’t such an automated mechanism, or at least its not that straightforward.


As someone who's a completely layman: I wonder if the results of model collapse are no worse than, say, sufficiently complex symbolic AI (modulo consistency and fidelity?)


No.


Is this paper wrong? - https://arxiv.org/abs/2311.09807


It shows that if you deliberately train LLMs against their own output in a loop you get problems. That's not what synthetic data training does.


I understand and appreciate your clarification. However would it not be the case some synthetic data strategies, if misapplied, can resemble the feedback loop scenario and thus risk model collapse?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: