Wonder where the training data for this is.

heigh · 2024-10-27T09:53:18 1730022798

They supply their paper in the Git repo, here: https://github.com/usefulsensors/moonshine/blob/main/moonshi...

The section "3.2. Training data collection & preprocessing" covers what you're inquiring about: "We train Moonshine on a combination of 90K hours from open ASR datasets and over 100K hours from own internally-prepared dataset, totalling around 200K hours. From open datasets, we use Common Voice 16.1 (Ardila et al., 2020), the AMI corpus (Carletta et al., 2005), Gi- gaSpeech (Chen et al., 2021), LibriSpeech (Panayotov et al., 2015), the English subset of multilingual Lib- riSpeech (Pratap et al., 2020), and People’s Speech (Galvez et al., 2021). We then augment this training corpus with data that we collect from openly-available sources on the web. We discuss preparation methods for our self-collected data in the following."

It does continue...