> They probably just use publicly-available resources like The Pile
I’d be very surprised if the big orgs don’t have in house efforts that far exceed the pile. Hell we know Google paid Reddit a pile of money for data and other orgs are also willing to pay
GPT-Neo and Llama were both trained on The Pile, and both of those were fairly influential releases. That's not to say they don't also use other resources, but I see no reason not to use The Pile; it's enormous.
It's also not everything there is, but for public preservation purposes I think the current archives are fine. If Google or Meta turn out to have been secretly stockpiling old training data without our knowledge, I'm not exactly sure what "we" would lose.
I’d be very surprised if the big orgs don’t have in house efforts that far exceed the pile. Hell we know Google paid Reddit a pile of money for data and other orgs are also willing to pay