Your point about needing a dataset made me think about how a post on hackernews like this may be a good way to get data. How many people would contribute by reading a prompt if they visited a link like this and had the option to donate some data? That would get many distinct voices and microphones and some different conditions.
The article mentions that they used a dataset composed of 100 hours of audiobooks. A comment thread here [1] estimates 10-50k visitors from a successful hackernews post. Call it 30k visitors. If 20% of visitors donated by reading a one minute prompt, that's another 6,000 minutes, or, oddly, also 100 hours.
Seems like a potentially easy way to double your dataset and make it more diverse.
Audio data of people reading prompts is quite common, what is missing for robust voice recognition is plenty of data of e.g. people screaming it across the room. There is only so much physics simulations can do.
There might be some sampling bias with an HN user dataset. At my company, many of our customer service calls are from older people, especially women, who call because they don't like using the internet (or they don't even have internet). Different voices and patterns of speech. This could be a really different demographic from HN users.
The article mentions that they used a dataset composed of 100 hours of audiobooks. A comment thread here [1] estimates 10-50k visitors from a successful hackernews post. Call it 30k visitors. If 20% of visitors donated by reading a one minute prompt, that's another 6,000 minutes, or, oddly, also 100 hours.
Seems like a potentially easy way to double your dataset and make it more diverse.
1 - https://news.ycombinator.com/item?id=20612717