Hacker News new | past | comments | ask | show | jobs | submit login

> They probably just use publicly-available resources like The Pile

I’d be very surprised if the big orgs don’t have in house efforts that far exceed the pile. Hell we know Google paid Reddit a pile of money for data and other orgs are also willing to pay




Yeah they absolutely do not use the pile.


GPT-Neo and Llama were both trained on The Pile, and both of those were fairly influential releases. That's not to say they don't also use other resources, but I see no reason not to use The Pile; it's enormous.

It's also not everything there is, but for public preservation purposes I think the current archives are fine. If Google or Meta turn out to have been secretly stockpiling old training data without our knowledge, I'm not exactly sure what "we" would lose.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: