Good article. Speech recognition for real time use cases must get a really worki...

ALittleLight · on April 17, 2020

Your point about needing a dataset made me think about how a post on hackernews like this may be a good way to get data. How many people would contribute by reading a prompt if they visited a link like this and had the option to donate some data? That would get many distinct voices and microphones and some different conditions.

The article mentions that they used a dataset composed of 100 hours of audiobooks. A comment thread here [1] estimates 10-50k visitors from a successful hackernews post. Call it 30k visitors. If 20% of visitors donated by reading a one minute prompt, that's another 6,000 minutes, or, oddly, also 100 hours.

Seems like a potentially easy way to double your dataset and make it more diverse.

1 - https://news.ycombinator.com/item?id=20612717

Isn0gud · on April 17, 2020

You might be interested in a project, doing exactly that: https://voice.mozilla.org/

Audio data of people reading prompts is quite common, what is missing for robust voice recognition is plenty of data of e.g. people screaming it across the room. There is only so much physics simulations can do.

ALittleLight · on April 17, 2020

That is interesting. I gave my contribution!

starpilot · on April 17, 2020

There might be some sampling bias with an HN user dataset. At my company, many of our customer service calls are from older people, especially women, who call because they don't like using the internet (or they don't even have internet). Different voices and patterns of speech. This could be a really different demographic from HN users.