Hacker News new | past | comments | ask | show | jobs | submit login

There is one trained on 600B tokens from SlimPajama [1], but that's fairly tiny compared to other recent releases (ex. stablelm-3b [2] trained on 4T tokens).

> low quality data (the Pile only)

The Pile is pretty good quality wise. It's mostly the size (300B tokens) that's limiting.

[1]: https://huggingface.co/state-spaces/mamba-2.8b-slimpj [2]: https://huggingface.co/stabilityai/stablelm-3b-4e1t




Eh quality is subjective. There are good parts, like Books3 and arxiv, but a large part of it is common crawl which has just about anything people put up on the internet, random IRC chat logs, HN and Reddit shitposts, Youtube subtitles which are in broken English half the time, and of course the Enron corporate email dump to make every model sound like an HR middle manager.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: