DeepSpeech: Scaling up end-to-end speech recognition

cbcase · on Dec 18, 2014

Thought it best to post the arXiv link, but there's some press coverage as well:

- https://gigaom.com/2014/12/18/baidu-claims-deep-learning-bre... - http://www.forbes.com/sites/roberthof/2014/12/18/baidu-annou...

cbcase · on Dec 18, 2014

I should add that I had the opportunity to work on this project and am happy to answer questions.

hyperbovine · on Dec 18, 2014

Why 5 hidden layers? Why are the first 3 non-recurrent? How did you decide how wide to make the internal layers? Are there some organizing principles behind these design decisions, or is it just trial and error?

cbcase · on Dec 18, 2014

As in many things, it's a combination of both. For example:

- We wanted no more than one recurrent layer, as it's a big bottleneck to parallelization.

- The recurrent layer should go "higher" in the network, as it's more effective at propagating long-range context when using the network's learned feature representation than using raw input values.

Other decisions are guided by a combination of trial+error and intuition. We started on much smaller datasets which can give you a feel for the bias/variance tradeoff as a function of the number of layers, the layer sizes, and other hyperparameters.

Caligula · on Dec 18, 2014

Any chance of releasing the training data you used? Also what are the plans with DeepSpeech? Just for use by baidu or will it be released as open source or a developer api service?

FractalNerve · on Dec 18, 2014

How much latency does the system have in the best/worst and average case? And is your implementation public?

cbcase · on Dec 19, 2014

For a single utterance, it's fast enough that we can produce results in real time. Of course, building a production system for millions of users might require just a bit more engineering work...

pesenti · on Dec 18, 2014

To put it in perspective, my team in IBM Watson has already published better numbers (10.4% WER vs 13.1% WER for Baidu) on the SWB dataset. We haven't run our model on the CH part so we can't compare on the full test set. Paper here: http://www.mirlab.org/conference_papers/International_Confer....

cbcase · on Dec 18, 2014

Hi Jerome, those are great results! We got an email this morning from someone else on the Watson team pointing out that we didn't include the latest IBM number -- we'll be sure to update the results in the next version of the paper (three cheers for arXiv).

Of course, we openly say in the paper that we don't have the best result on easy subset of Hub5'00 (we had it as 11.5%). We're more interested in advancing the state of the art on challenging, noisy, varied speech. Of course we'll be working to push the SWB number down too :)

pesenti · on Dec 18, 2014

The team is already working on seeing what we get with CH. We'll let you know where we land. But your results are definitely impressive. We love to see new published innovation in the field. Kudos to the team!

ogrisel · on Dec 19, 2014

What is the average and standard deviation of the performance level on this dataset?

pesenti · on Dec 20, 2014

For CH we get 19.1% for a combined rate of 14.75% - this is using 300 hours of training data.

brandonb · on Dec 18, 2014

This is very fast progress from Baidu's Silicon Valley AI lab! Andrew Ng only joined Baidu in May, and (nearly?) all of the co-authors of this paper have joined him since then: http://www.technologyreview.com/news/527301/chinese-search-g...

Congrats to Carl, Sanjeev, Andrew, and the others.

cbcase · on Dec 19, 2014

Thanks for the kind words, Brandon! Been a busy couple of months :)

greeneggs · on Dec 18, 2014

Very nice. I wonder if training can be simplified by training pieces of the model separately, instead of training all together. For example, the DeepSpeech model has three layers of feedforward neurons (where the inputs to the first layer are overlapping contexts of audio), followed by a bi-directional recurrent layer, followed by another feedforward layer. What would the results be if we trained the first layers (perhaps all three) on a different problem, such as autoencoding or fill-in-the-blank (as in word2vec), and then fixed those network weights to train the rest of the network?

Breaking the network up like this would reduce training time and perhaps reduce the needed training data. Since the first layers could be trained without supervision, less labeled data would be needed to train the last two layers. It would also facilitate transferring models between problems; the output of the first few layers, like a word2vec, could be fed into arbitrary other machine learning problems, e.g., translation.

If this does not work, then how about training the whole model together, but only once? The final results are reported for an ensemble of six independently trained networks. What if started by training one network, and then fixed the first three layers to train other networks? (Instead of fixing the first layers, you could also just give them a slower training rate, although it isn't clear whether that would save you much.)

gok · on Dec 18, 2014

So with 300 hours of training data it does worse on SWB than a DNN-HMM, or even a GMM-HMM system? But when they give it 2300 hours or training data, it can beat those 300 hour trained systems?

This is still very cool, but that comparison doesn't seem fair at all.

sherjilozair · on Dec 19, 2014

Why not? DNN-HMM and GMM-HMM won;t have done any better even if trained for 2300 hours.

cbcase · on Dec 19, 2014

Mostly this, though it's not so black-and-white. The paper discusses results from a DNN-HMM system (Maas et al., using Kaldi) trained on 2k hours, and it does provide a small generalization improvement over 300 hours.

Much of the excitement about deep learning -- which we see as well in DeepSpeech -- is that these models continue to improve as we provide more training data. It's not obvious a priori that results will keep getting better after thousands of hours of speech. We're exited to keep advancing that frontier.

gok · on Dec 19, 2014

That was an even weirder comparison. They compare a system trained on 2000 hours of acoustic data mismatched with the testing data to their system, which was trained on 300 hours of matched data in addition to the 2000 hours of mismatched acoustic data.

dingdingdang · on Dec 19, 2014

Are any of these systems open source?

cbcase · on Dec 19, 2014

Both Kaldi[1] and CMU Sphinx[2] are high-quality open source speech systems. I know for a fact that Kaldi includes support for DNN acoustic models (I'm less familiar with Sphinx).

[1] http://kaldi.sourceforge.net/ [2] http://cmusphinx.sourceforge.net/

dingdingdang · on Dec 19, 2014

Thanks, appreciated, but my dear lord, without a PhD in AI systems these things are a bit beyond what most users, me included, would casually play around with. Be great if this tech made it into Dragon Naturally Speaking-like end product to use privately.