Building an end-to-end Speech Recognition model in PyTorch

zerop · on April 17, 2020

Good article. Speech recognition for real time use cases must get a really working open source solution. I have been evaluating deepspeech, which is okay. but there is lots of work needed to make it working close to Google Speech engine. Apart from a good Deep neural network, a good speech recognition system needs two important things:

1. Tons of diverse data sets (real world)

2. Solution for Noise - Either de-noise and train OR train with noise.

There are lots of extra challenges that voice recognition problem have to solve which is not common with other deep learning problems:

1. Pitch

2. Speed of conversation

3. Accents (can be solved with more data, I think)

4. Real time inference (low latency)

5. On the edge (i.e. Offline on mobile devices)

ALittleLight · on April 17, 2020

Your point about needing a dataset made me think about how a post on hackernews like this may be a good way to get data. How many people would contribute by reading a prompt if they visited a link like this and had the option to donate some data? That would get many distinct voices and microphones and some different conditions.

The article mentions that they used a dataset composed of 100 hours of audiobooks. A comment thread here [1] estimates 10-50k visitors from a successful hackernews post. Call it 30k visitors. If 20% of visitors donated by reading a one minute prompt, that's another 6,000 minutes, or, oddly, also 100 hours.

Seems like a potentially easy way to double your dataset and make it more diverse.

1 - https://news.ycombinator.com/item?id=20612717

Isn0gud · on April 17, 2020

You might be interested in a project, doing exactly that: https://voice.mozilla.org/

Audio data of people reading prompts is quite common, what is missing for robust voice recognition is plenty of data of e.g. people screaming it across the room. There is only so much physics simulations can do.

ALittleLight · on April 17, 2020

That is interesting. I gave my contribution!

starpilot · on April 17, 2020

There might be some sampling bias with an HN user dataset. At my company, many of our customer service calls are from older people, especially women, who call because they don't like using the internet (or they don't even have internet). Different voices and patterns of speech. This could be a really different demographic from HN users.

albertzeyer · on April 17, 2020

This seems to be a CTC model. CTC is not really the best option for a good end-to-end system. Encoder-decoder-attention models or RNN-T models are both better alternatives.

There is also not really a problem about available open source code. There are countless of open source projects which already have that mostly ready to use, for all the common DL frameworks, like TF, PyTorch, Jax, MXNet, whatever. For anyone with a bit of ML experience, this should really not be too hard to setup.

But then to get good performance, on your own dataset, what you really need is experience. Probably taking some existing pipeline will get you some model, with an okish word-error-rate. But then you should tune it. In any case, even without tuning, probably encoder-decoder-attention models will perform better than CTC models.

mikaelphi · on April 17, 2020

Author here! CTC models perform quite well and are easy to get started for beginners with an added benefit of real-time streaming capabilities. RNNT's and Encoder-Decoder like Listen-attend-spell are also very solid choices, and literature points to slightly higher accuracy with them on academic datasets. RNNT's being and extension of CTC are streamable as well.

dylanbfox · on April 17, 2020

That's right - most literature does show that encoder-decoder architectures outperform CTC. I think one of the main reasons for this is that CTC assumes the label outputs are conditionally independent of each other, which is a pretty big flaw in that loss function.

The blog does mention Listen-Attend-Spell (which is an encoder-decoder architecture) as an alternative to the CTC model.

bginsburg · on April 17, 2020

Actually end-2-end CTC models are very good - Wav2letter, Jasper, QuartzNet - all these models are much better than DeepSpeech2.

woodson · on April 17, 2020

Agreed, and that's where it seems having lots of experience working with speech data helps more than trying to brute-force it with just larger CTC models and "more" data of dubious quality.

option · on April 17, 2020

I question that you need full attention in the acoustic model. The pronunciation of a word in the middle of phrase does not have much dependence on the beginning.

You do need attention in the language model part of the pipeline

albertzeyer · on April 17, 2020

Almost all the literature which compares CTC and encoder-decoder-attention models shows pretty well that encoder-decoder-attention performs better than CTC in the acoustic model.

See for example here as an overview (my own work, already a bit outdated, but attention has even improved much more since then): https://openreview.net/pdf?id=S1gp9v_jsm

tasubotadas · on April 17, 2020

It would seem that the best practical approach is to use RNNT as it still lets you do streaming predictions (while Attention won't really let that).

albertzeyer · on April 17, 2020

If you need streaming, then yes, RNNT is a good option. If not, encoder-decoder-attention performs a bit better than RNN-T.

Note that there are also approaches for encoder-decoder-attention to make that streaming capable, e.g. MoChA or hard attention, etc.

Google uses RNN-T on-device. But they are researching on extending it with another encoder-decoder-attention model on-top, to get better performance.

This is a quite active research area, and it has not really settled. But CTC is not really too much relevant anymore, as RNNT is just a better variant.

woodson · on April 17, 2020

Recent work on transformer transducers with limited right (and left) context seem to give decent results as well: https://arxiv.org/abs/2002.02562

p1esk · on April 17, 2020

I opened the pdf, did ctrl+F for 'github', got zero results. Have you reproduced their "decent results"?

woodson · on April 17, 2020

ESPnet apparently already has implementations of versions of transformer transducers with RNN-T loss (though, with different network architecture). At least the paper cites results on a freely available dataset, right?

p1esk · on April 17, 2020

Why would you not expect Google to post their code? E.g. the transformer paper had it: https://arxiv.org/abs/1706.03762

mikaelphi · on April 17, 2020

Author here! RNNT's is indeed a good approach. RNNT's with masked attention are also capable of streaming as well.

spzb · on April 17, 2020

This is probably really good but the linked Colab notebook is failing on the first step with some unresolvable dependencies. This does seem to be a bit of a common theme whenever I try running example ML projects.

Edit: I think I've fixed it by changing the pip command to:

!pip install torchaudio comet_ml==3.0

zanew101 · on April 17, 2020

Hah, classic. But in all seriousness I think its a pretty interesting issue. A lot of the ML and data science sees people coming in who do not have formal computer science and software development backgrounds. We build tools and methodologies around abstracting away some of the code development process and hope that it lands us in an environment that's easy to share with others. This is unfortunately rarely the case.

Its a problem that as an industry I think we are in the middle of "solving" (probably can't be solved fully, but things are getting better). I'm really excited to see what kinds of tools and tests will be developed around getting ML projects with some better practices.

mikaelphi · on April 17, 2020

Author here, thanks for pointing that out is and it's fixed now! I also made sure to add the pip version for torchaudio and torch as well so this should not be an issue anymore.

option · on April 17, 2020

Have a look at NeMo https://github.com/nvidia/NeMo it comes with QuartzNet (only 19M of weights and better accuracy than DeepSpeech2) pretrained on thousands of hours of speech.

sniper200012 · on April 17, 2020

really interesting repo

coder543 · on April 17, 2020

Mentioned once in the other comments here without any link, but another open source speech recognition model I heard about recently is Mozilla DeepSpeech:

https://github.com/mozilla/DeepSpeech

https://hacks.mozilla.org/2019/12/deepspeech-0-6-mozillas-sp...

I haven't had a chance to test it, and I wish there were a client-side WASM demo of it that I could just visit on Mozilla's site.

mikaelphi · on April 17, 2020

Author here! Deep Speech is an excellent repo if you just want to pip install something. We wanted to do a comprehensive writeup to give devs the ability to build their own end-to-end model.

komuher · on April 17, 2020

Dunno why (probably dataset) but open source Speech Recognition models are performing very poorly on real world data compared to google speech to text or azure cognitive.

dylanbfox · on April 17, 2020

One of the main factors for this is probably due to dataset size. Commercial STT models are trained on 10s of thousands of hours of data from real-world data. Even a decent model architecture is going to perform pretty well on that much data.

Most open source models are trained on Libri, SWB, etc. which are not really big or diverse enough for real-world scenarios.

But to max-out results the devil is in the details IMO (network architecture, optimizer, weight initialization, regularization, data augmentation, hyperparam tuning, etc) which requires a lot of experiments.

bginsburg · on April 17, 2020

There are new, very large public English speech datasets: Mozilla Common Voice, National Speech Corpus, which can be combined with LibriSpeech to train large models.

solidasparagus · on April 17, 2020

Amazon worked with 7k hours of labeled data + 1 million hours of unlabeled data - https://arxiv.org/pdf/1904.01624.pdf

eindiran · on April 17, 2020

If you combine them you get 5ish K hours of speech for English, which is still fairly small compared to what most big players have access to.

option · on April 17, 2020

The key is to fine-tune on your data. Take publicly available pretrained model, fine-tune on your data and you can often get results better than google’s service or azure cognitive on your use-case (google and azure asr are great general services, but they cannot do better than custom, say, health-center specific model)

tasubotadas · on April 17, 2020

LibreSpeech is 1000h, WSJ is 73h.

Google is training on datasets that are as big as 30kh and MS seems to work on a 10k h dataset.

At the moment, I am working on a similar e2e system but 80h big dataset makes it a really challenging task to generalize well.