Hacker News new | past | comments | ask | show | jobs | submit login

This is why the big bet for AI-assisted AI-development long term is synthetic data. A big part of the reason so much money and resources is going into synthetic data right now is not just out of economic necessity, but because there have been extremely encouraging results with synthetic data (e.g. 'Textbooks Are All You Need', AlphaZero).



I wouldn't count aplha zero since it's reinforcement learning. That technique you can generate high quality data all the time since the rules are fixed. Not everything can be trained using that way


The chess knowledge and skills of LLMs comes from them ingesting a sufficient number of chess games in text format (the amount will be proportional to both other data you have and the compute you have), same with the ability of LLMs to play other games or solve other fixed rule/perfect knowledge puzzles. AlphaZero and its cousins showed that you can generate an effectively infinite quantity of extremely high-quality data in those domains. There is a possibility that the benefit to an LLM's general intelligence from giving it e.g. one billion ~4600 ELO level games is only in improving its ability to play chess. Given the results many studies have reported in cross-learning with LLMs, I doubt that though. The potential is that generating a lot of extremely high level logic and puzzle solving and providing it as extremely high quality synthetic data to an LLM can improve its general reasoning and logic capabilities - that would be huge, and is one of the promises of synthetic data.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: