We introduce new datasets derived from the fol-
lowing sources: PubMed Central, ArXiv, GitHub,
the FreeLaw Project, Stack Exchange, the US
Patent and Trademark Office, PubMed, Ubuntu
IRC, HackerNews, YouTube, PhilPapers, and NIH
ExPorter. We also introduce OpenWebText2 and
BookCorpus2, which are extensions of the original
OpenWebText (Gokaslan and Cohen, 2019) and
BookCorpus (Zhu et al., 2015; Kobayashi, 2018)
datasets, respectively.
We introduce new datasets derived from the fol- lowing sources: PubMed Central, ArXiv, GitHub, the FreeLaw Project, Stack Exchange, the US Patent and Trademark Office, PubMed, Ubuntu IRC, HackerNews, YouTube, PhilPapers, and NIH ExPorter. We also introduce OpenWebText2 and BookCorpus2, which are extensions of the original OpenWebText (Gokaslan and Cohen, 2019) and BookCorpus (Zhu et al., 2015; Kobayashi, 2018) datasets, respectively.
From https://arxiv.org/abs/2101.00027 (The Pile: An 800GB Dataset of Diverse Text for Language Modeling)