Yes. And the cost of these synthetic datasets is very high. Nobody is sharing. I suspect people are underestimating the amount of hardware OpenAI/Microsoft are using to build massive amounts of synthetic data. I doubt they are just training models over and over with the common crawls and such.
> the cost of these synthetic datasets is very high. Nobody is sharing
There are plenty of synthetic datasets generated from GPT-4 and other models^[1]. But MS created a large one, 150B tokens. Still 2 orders of magnitude smaller than the 13T used to train GPT-4.
But in the future this will be the main way to improve models - put them to work, and filter their good stuff. Then retrain. Very expensive, but that is the cost of evolution. It took humans a very long time to create the culture and technology that underlies LLMs, it will take a similar effort to push them forward.
Human generated text was the low hanging fruit, but now that it's picked, synthetic data is the only way forward. Models generating their own experience and feedback, doing exploration, combinatorial search, learning from their interactions with humans, from games, experiments and simulations.
But if we're talking about synthetic data - then the elephant in the room is the chat logs of OpenAI. They got 180M users, assume 10K tokens/user/month, that would be 1.8B tokens per month, mostly AI written but interspersed with human replies and tool generated output. This means they can collect in less than a year about as much synthetic data as the original training set.
What if they train GPT-5 solely on synthetic data? That would simplify the copyright issues a lot, and give a 5x boost in efficiency.
Nobody underestimates it. It is clear that this stuff is not cheap. However, all publications without datasets are garbage because you can't replicate them. Why publish at all? It's just noise.
All world-class scientists who don't cite every book they've ever read or teacher they've ever had are garbage because you can't replicate them. Why be born at all? They're just noise.
It is not the same. If you can't replicate, you can't verify. There is a difference between what you can infer from the provided information and what you can prove. Replication is a cornerstone of scientific experimentation. Thus, the argument you are using here is bullshit.