Hacker News new | past | comments | ask | show | jobs | submit login

Interesting that they enriched the training data by asking people to point out YouTube videos in specific languages for which they needed data

> YT-513-U: We create an additional dataset called YT-513-U to ensure coverage of lower resource languages in our pre-training dataset. We reached out to vendors and native speakers to identify YT videos containing speech in specific long tail languages, collecting a dataset of unlabeled speech in 513 languages. [1]

1. https://arxiv.org/abs/2303.01037




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: