Your point about needing a dataset made me think about how a post on hackernews ...

Isn0gud · on April 17, 2020

You might be interested in a project, doing exactly that: https://voice.mozilla.org/

Audio data of people reading prompts is quite common, what is missing for robust voice recognition is plenty of data of e.g. people screaming it across the room. There is only so much physics simulations can do.

ALittleLight · on April 17, 2020

That is interesting. I gave my contribution!

starpilot · on April 17, 2020

There might be some sampling bias with an HN user dataset. At my company, many of our customer service calls are from older people, especially women, who call because they don't like using the internet (or they don't even have internet). Different voices and patterns of speech. This could be a really different demographic from HN users.