Interesting that they enriched the training data by asking people to point out Y...

Interesting that they enriched the training data by asking people to point out YouTube videos in specific languages for which they needed data

> YT-513-U: We create an additional dataset called YT-513-U to ensure coverage of lower resource languages in our pre-training dataset. We reached out to vendors and native speakers to identify YT videos containing speech in specific long tail languages, collecting a dataset of unlabeled speech in 513 languages. [1]

1. https://arxiv.org/abs/2303.01037