Good article. Speech recognition for real time use cases must get a really working open source solution. I have been evaluating deepspeech, which is okay. but there is lots of work needed to make it working close to Google Speech engine. Apart from a good Deep neural network, a good speech recognition system needs two important things:
1. Tons of diverse data sets (real world)
2. Solution for Noise - Either de-noise and train OR train with noise.
There are lots of extra challenges that voice recognition problem have to solve which is not common with other deep learning problems:
1. Pitch
2. Speed of conversation
3. Accents (can be solved with more data, I think)
Your point about needing a dataset made me think about how a post on hackernews like this may be a good way to get data. How many people would contribute by reading a prompt if they visited a link like this and had the option to donate some data? That would get many distinct voices and microphones and some different conditions.
The article mentions that they used a dataset composed of 100 hours of audiobooks. A comment thread here [1] estimates 10-50k visitors from a successful hackernews post. Call it 30k visitors. If 20% of visitors donated by reading a one minute prompt, that's another 6,000 minutes, or, oddly, also 100 hours.
Seems like a potentially easy way to double your dataset and make it more diverse.
Audio data of people reading prompts is quite common, what is missing for robust voice recognition is plenty of data of e.g. people screaming it across the room. There is only so much physics simulations can do.
There might be some sampling bias with an HN user dataset. At my company, many of our customer service calls are from older people, especially women, who call because they don't like using the internet (or they don't even have internet). Different voices and patterns of speech. This could be a really different demographic from HN users.
1. Tons of diverse data sets (real world)
2. Solution for Noise - Either de-noise and train OR train with noise.
There are lots of extra challenges that voice recognition problem have to solve which is not common with other deep learning problems:
1. Pitch
2. Speed of conversation
3. Accents (can be solved with more data, I think)
4. Real time inference (low latency)
5. On the edge (i.e. Offline on mobile devices)