The vast knowledge trove of Google can't be understated, even if sometimes the m...

taneq · on Nov 9, 2023

If there's one thing that's becoming clear in the open source LLM world, it's that the dataset really is the 'secret sauce' for LLMs. There are endless combinations of various datasets plus foundation model plus training approach, and by far the key determinant of end model performance seems to be the dataset used.

satvikpendem · on Nov 9, 2023

> it's that the dataset really is the 'secret sauce'

alwayshasbeen.jpg

There have been articles about how "data is the new oil" for a couple of decades now, with the first reference I could find being from British mathematician Clive Humby in 2006 [0]. The fact that it rings even more true in the age of LLMs is simply just another transformation of the fundamental data underneath.

[0] https://en.wikipedia.org/wiki/Clive_Humby#cite_ref-10

still_grokking · on Nov 9, 2023

2006?

https://en.wikipedia.org/wiki/Scientia_potentia_est

satvikpendem · on Nov 11, 2023

> There have been articles about how "data is the new oil" for a couple of decades now, with the first reference I could find being from British mathematician Clive Humby in 2006

I am specifically referring to the phrase I quoted, not some more abstract sentiment.

sharkoz · on Nov 9, 2023

The best answer to this is https://www.youtube.com/watch?v=ab6GyR_5N6c :)

kccqzy · on Nov 9, 2023

Isn't there just a comment today on HN saying Google had an institutional reluctance to use certain data sets like libgen? I honestly don't think Google used everything they had to train their LLM.

https://news.ycombinator.com/item?id=38194107