This is most certainly true. If you look back to my comment and the discussion f...

3abiton · 2024-02-07T08:47:15 1707295635

I am not that verses on this topic but am curious what would be the biggest impact of leakage/spoilage on LLM perfermance? Is it similar to overfitting?

godelski · 2024-02-07T17:45:27 1707327927

Yes, it'll generally lead to overfitting. This will look a lot like memorization btw. And just an fyi, you can still not diverge on the train/test split (as is common) and still overfit. That's an obvious signal but there are many ways to have a model overfit. As far as I'm aware, all giant LLMs and image generators show signs of overfitting. But note that sometimes this can be helpful. Obviously these tools are useful still so it is more about where they break than anything.