> They probably just use publicly-available resources like The Pile I’d be very ...

vagabund · 2024-04-16T23:47:55 1713311275

Yeah they absolutely do not use the pile.

talldayo · 2024-04-16T23:59:25 1713311965

GPT-Neo and Llama were both trained on The Pile, and both of those were fairly influential releases. That's not to say they don't also use other resources, but I see no reason not to use The Pile; it's enormous.

It's also not everything there is, but for public preservation purposes I think the current archives are fine. If Google or Meta turn out to have been secretly stockpiling old training data without our knowledge, I'm not exactly sure what "we" would lose.