I *really* want real-world speech data. I think speech-to-text could open the do...

Smerity · on Oct 1, 2016

Have you seen the LibriSpeech ASR corpus[1]? It's large scale (1k hours) Creative Commons licensed English speech with the original book text.

The data is derived from read audiobooks from the LibriVox project and has been carefully segmented and aligned.

If you mean real world as in "with realistic noise", Baidu's DeepSpeech showed that adding noise to clean audio can make a resilient model. Indeed, having clean audio is better as you can add different types of noise to clean data to, in effect, create regularization.

[1]: http://www.openslr.org/12/

Houshalter · on Oct 1, 2016

I wonder why semi supervised learning hasn't taken off more. There is so much unlabelled data out there. E.g. in podcasts and youtube videos and television. A small amount of it is labelled, like captioned television programs and youtube videos. You could use those captions to train a weak model which could provide labels for the unlabelled data.

You can correct many of it's errors with simple language models. For instance the phrase "wreck a nice beach" has much less probability than "recognize speech". So if the model isn't sure which one it is, you can assume it's the more probable one. Then train on that, and it will get even better at recognizing those words.

rm999 · on Oct 1, 2016

>You can correct many of it's errors with simple language models. For instance the phrase "wreck a nice beach" has much less probability than "recognize speech". So if the model isn't sure which one it is, you can assume it's the more probable one. Then train on that, and it will get even better at recognizing those words.

Why not just build that heuristic into your initial supervised "weak" model? Training on data labeled by a model introduces no new information, so you're not gaining anything there.

Houshalter · on Oct 1, 2016

It does introduce new information. This is how semi-supervised learning works. Many words may be missed by the weak model, but can be inferred correctly from their context. Then you have new labels to train the weak model on, to make it better at those words.

The way I'm describing it is probably not the optimal way to do it. I don't know if there is a better way. But the point is it must be possible to take advantage of the vast quantities of unlabelled data we have. Humans brains somehow do something similar.

Semi supervised learning is really cool. I saw one successful example where they labelled just a few emails as spam and not spam. Then they trained used their weak classifier to label thousands of unclassified emails, and then used those as training data for an even stronger model. It actually works: http://matpalm.com/semi_supervised_naive_bayes/semi_supervis... https://en.wikipedia.org/wiki/Semi-supervised_learning

visarga · on Oct 1, 2016

If you make your training set the output of Google's speech recognition algo from YT there will be lots of label noise.

It's better to go with movies and audiobooks because the text 'transcription' is human generated. Aligning known text to audio is much more accurate than speech recognition.

mynameislegion · on Oct 1, 2016

http://voxforge.org/