Hacker News new | past | comments | ask | show | jobs | submit login

The vast knowledge trove of Google can't be understated, even if sometimes the model isn't as competent at certain tasks as OpenAI's GPT models.



If there's one thing that's becoming clear in the open source LLM world, it's that the dataset really is the 'secret sauce' for LLMs. There are endless combinations of various datasets plus foundation model plus training approach, and by far the key determinant of end model performance seems to be the dataset used.


> it's that the dataset really is the 'secret sauce'

alwayshasbeen.jpg

There have been articles about how "data is the new oil" for a couple of decades now, with the first reference I could find being from British mathematician Clive Humby in 2006 [0]. The fact that it rings even more true in the age of LLMs is simply just another transformation of the fundamental data underneath.

[0] https://en.wikipedia.org/wiki/Clive_Humby#cite_ref-10



> There have been articles about how "data is the new oil" for a couple of decades now, with the first reference I could find being from British mathematician Clive Humby in 2006

I am specifically referring to the phrase I quoted, not some more abstract sentiment.


The best answer to this is https://www.youtube.com/watch?v=ab6GyR_5N6c :)


Isn't there just a comment today on HN saying Google had an institutional reluctance to use certain data sets like libgen? I honestly don't think Google used everything they had to train their LLM.

https://news.ycombinator.com/item?id=38194107




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: