Hacker News new | past | comments | ask | show | jobs | submit login

> such data will rapidly become as precious as 'low background steel'.

I'm also totally not convinced by this argument.

Synthetic data as an input to a careful training regimen will result in better outputs, not worse, because you're still subjecting the model to optimization and new information. Over time you can pull out the worse performing (original and synthetic) training data. That careful curation is the part that makes the difference.

It's like DNA in the chemical soup. It's been replicating polymers since the beginning, but in the end intelligence arises. It didn't need magical ingredients. When you climb a gradient, it typically takes you somewhere better.




> in the end intelligence arises. It didn't need magical ingredients.

That's the current prevailing hypothesis, but we don't yet fully understand the phenomenon of intelligence enough to definitively rule out any magical ingredients: unknown variables or characteristics of the system/inputs/data that made it possible for intelligence to emerge.

This proposed snapshot of the web, before it gets further "contaminated" by synthetic AI/LLM-generated data, might prove to be valuable or it might not. The premise could be wrong. Maybe we learn that there's nothing fundamentally special about human-generated data, compared to synthetic data derived from it.

It seems worthwhile to consider though, in case it turns out that there is some yet unknown quality of the more or less "pure" human data. In the metaphor of low-background steel, we could be entering a period of unregulated nuclear testings without being fully aware of the consequences.


I don't buy this at all. AI data is a real part of the environment. The thing to modify is the loss function not the training data. You need to be able to evaluate text on the internet and so do models.

This idea of contamination by AI vs pristine human data isn't persuasive to me at all. It feels like a continuation of the wrong idea that LLMs are parrots.


"Careful curation" is the part you lose when you use synthetic data. Subjecting models to "new information" isn't useful otherwise you could just feed it random 01s and hope to carefully curate it later

(also, how much time did the soup take? Can you wait that long?)


Training AI on AI generated data produces some increasingly weird outputs, I am sure we are already seeing the results of this in some models but the level of Hallucination is only going to increase unless some kind of checks and balances are implemented


Hallucination^2


Im convinced just cleaning existing dataseys would be more effective




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: