It is indeed: We introduce new datasets derived from the fol- lowing sources: Pu...

It is indeed:

We introduce new datasets derived from the fol- lowing sources: PubMed Central, ArXiv, GitHub, the FreeLaw Project, Stack Exchange, the US Patent and Trademark Office, PubMed, Ubuntu IRC, HackerNews, YouTube, PhilPapers, and NIH ExPorter. We also introduce OpenWebText2 and BookCorpus2, which are extensions of the original OpenWebText (Gokaslan and Cohen, 2019) and BookCorpus (Zhu et al., 2015; Kobayashi, 2018) datasets, respectively.

From https://arxiv.org/abs/2101.00027 (The Pile: An 800GB Dataset of Diverse Text for Language Modeling)