Hacker News new | past | comments | ask | show | jobs | submit login

It is indeed:

We introduce new datasets derived from the fol- lowing sources: PubMed Central, ArXiv, GitHub, the FreeLaw Project, Stack Exchange, the US Patent and Trademark Office, PubMed, Ubuntu IRC, HackerNews, YouTube, PhilPapers, and NIH ExPorter. We also introduce OpenWebText2 and BookCorpus2, which are extensions of the original OpenWebText (Gokaslan and Cohen, 2019) and BookCorpus (Zhu et al., 2015; Kobayashi, 2018) datasets, respectively.

From https://arxiv.org/abs/2101.00027 (The Pile: An 800GB Dataset of Diverse Text for Language Modeling)




...HackerNews...

And there's my incentive to stop posting on HN.

It's been a blast, guys. I'm going back to lurker mode.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: