There is one trained on 600B tokens from SlimPajama [1], but that's fairly tiny ...

moffkalast · on Feb 23, 2024

Eh quality is subjective. There are good parts, like Books3 and arxiv, but a large part of it is common crawl which has just about anything people put up on the internet, random IRC chat logs, HN and Reddit shitposts, Youtube subtitles which are in broken English half the time, and of course the Enron corporate email dump to make every model sound like an HR middle manager.