Good article. Speech recognition for real time use cases must get a really working open source solution. I have been evaluating deepspeech, which is okay. but there is lots of work needed to make it working close to Google Speech engine. Apart from a good Deep neural network, a good speech recognition system needs two important things:
1. Tons of diverse data sets (real world)
2. Solution for Noise - Either de-noise and train OR train with noise.
There are lots of extra challenges that voice recognition problem have to solve which is not common with other deep learning problems:
1. Pitch
2. Speed of conversation
3. Accents (can be solved with more data, I think)
Your point about needing a dataset made me think about how a post on hackernews like this may be a good way to get data. How many people would contribute by reading a prompt if they visited a link like this and had the option to donate some data? That would get many distinct voices and microphones and some different conditions.
The article mentions that they used a dataset composed of 100 hours of audiobooks. A comment thread here [1] estimates 10-50k visitors from a successful hackernews post. Call it 30k visitors. If 20% of visitors donated by reading a one minute prompt, that's another 6,000 minutes, or, oddly, also 100 hours.
Seems like a potentially easy way to double your dataset and make it more diverse.
Audio data of people reading prompts is quite common, what is missing for robust voice recognition is plenty of data of e.g. people screaming it across the room. There is only so much physics simulations can do.
There might be some sampling bias with an HN user dataset. At my company, many of our customer service calls are from older people, especially women, who call because they don't like using the internet (or they don't even have internet). Different voices and patterns of speech. This could be a really different demographic from HN users.
This seems to be a CTC model. CTC is not really the best option for a good end-to-end system. Encoder-decoder-attention models or RNN-T models are both better alternatives.
There is also not really a problem about available open source code. There are countless of open source projects which already have that mostly ready to use, for all the common DL frameworks, like TF, PyTorch, Jax, MXNet, whatever. For anyone with a bit of ML experience, this should really not be too hard to setup.
But then to get good performance, on your own dataset, what you really need is experience. Probably taking some existing pipeline will get you some model, with an okish word-error-rate. But then you should tune it. In any case, even without tuning, probably encoder-decoder-attention models will perform better than CTC models.
Author here! CTC models perform quite well and are easy to get started for beginners with an added benefit of real-time streaming capabilities. RNNT's and Encoder-Decoder like Listen-attend-spell are also very solid choices, and literature points to slightly higher accuracy with them on academic datasets. RNNT's being and extension of CTC are streamable as well.
That's right - most literature does show that encoder-decoder architectures outperform CTC. I think one of the main reasons for this is that CTC assumes the label outputs are conditionally independent of each other, which is a pretty big flaw in that loss function.
The blog does mention Listen-Attend-Spell (which is an encoder-decoder architecture) as an alternative to the CTC model.
Agreed, and that's where it seems having lots of experience working with speech data helps more than trying to brute-force it with just larger CTC models and "more" data of dubious quality.
I question that you need full attention in the acoustic model. The pronunciation of a word in the middle of phrase does not have much dependence on the beginning.
You do need attention in the language model part of the pipeline
Almost all the literature which compares CTC and encoder-decoder-attention models shows pretty well that encoder-decoder-attention performs better than CTC in the acoustic model.
See for example here as an overview (my own work, already a bit outdated, but attention has even improved much more since then): https://openreview.net/pdf?id=S1gp9v_jsm
If you need streaming, then yes, RNNT is a good option. If not, encoder-decoder-attention performs a bit better than RNN-T.
Note that there are also approaches for encoder-decoder-attention to make that streaming capable, e.g. MoChA or hard attention, etc.
Google uses RNN-T on-device. But they are researching on extending it with another encoder-decoder-attention model on-top, to get better performance.
This is a quite active research area, and it has not really settled. But CTC is not really too much relevant anymore, as RNNT is just a better variant.
ESPnet apparently already has implementations of versions of transformer transducers with RNN-T loss (though, with different network architecture). At least the paper cites results on a freely available dataset, right?
This is probably really good but the linked Colab notebook is failing on the first step with some unresolvable dependencies. This does seem to be a bit of a common theme whenever I try running example ML projects.
Edit: I think I've fixed it by changing the pip command to:
Hah, classic. But in all seriousness I think its a pretty interesting issue. A lot of the ML and data science sees people coming in who do not have formal computer science and software development backgrounds. We build tools and methodologies around abstracting away some of the code development process and hope that it lands us in an environment that's easy to share with others. This is unfortunately rarely the case.
Its a problem that as an industry I think we are in the middle of "solving" (probably can't be solved fully, but things are getting better). I'm really excited to see what kinds of tools and tests will be developed around getting ML projects with some better practices.
Author here, thanks for pointing that out is and it's fixed now! I also made sure to add the pip version for torchaudio and torch as well so this should not be an issue anymore.
Have a look at NeMo https://github.com/nvidia/NeMo it comes with QuartzNet (only 19M of weights and better accuracy than DeepSpeech2) pretrained on thousands of hours of speech.
Mentioned once in the other comments here without any link, but another open source speech recognition model I heard about recently is Mozilla DeepSpeech:
Author here! Deep Speech is an excellent repo if you just want to pip install something. We wanted to do a comprehensive writeup to give devs the ability to build their own end-to-end model.
Dunno why (probably dataset) but open source Speech Recognition models are performing very poorly on real world data compared to google speech to text or azure cognitive.
One of the main factors for this is probably due to dataset size. Commercial STT models are trained on 10s of thousands of hours of data from real-world data. Even a decent model architecture is going to perform pretty well on that much data.
Most open source models are trained on Libri, SWB, etc. which are not really big or diverse enough for real-world scenarios.
But to max-out results the devil is in the details IMO (network architecture, optimizer, weight initialization, regularization, data augmentation, hyperparam tuning, etc) which requires a lot of experiments.
There are new, very large public English speech datasets: Mozilla Common Voice, National Speech Corpus, which can be combined with LibriSpeech to train large models.
The key is to fine-tune on your data. Take publicly available pretrained model, fine-tune on your data and you can often get results better than google’s service or azure cognitive on your use-case (google and azure asr are great general services, but they cannot do better than custom, say, health-center specific model)
1. Tons of diverse data sets (real world)
2. Solution for Noise - Either de-noise and train OR train with noise.
There are lots of extra challenges that voice recognition problem have to solve which is not common with other deep learning problems:
1. Pitch
2. Speed of conversation
3. Accents (can be solved with more data, I think)
4. Real time inference (low latency)
5. On the edge (i.e. Offline on mobile devices)