This project is the result of a one year long learning process in speech recognition and speech synthesis.
The original task was to automate the testing of a voice-enabled IVR system. While we started with real audio recordings, very soon it was clear that this approach is not feasible for a non-trivial app and it will be impossible to reach a satisfying test coverage. On the other hand, we had to find a way to transcribe the voice app response to text for doing our automated assertions.
As cloud-based solutions where not an option (company policy), we very quickly got frustrated as there was no "get shit done" Open Source stack available for doing medium-quality text-to-speech and speech-to-text conversions. We learned how to train and use Kaldi, which is according to some benchmarks the best available system out there, but mainly targeting academic users and research. We made heavy-weight MaryTTS work to synthesize speech in reasonable quality.
And finally, we packaged all of this in a DevOps-friendly HTTP/JSON API with a Swagger definition.
As always, feedback and contributions are welcome!
I've built a speech synthesis system with Marytts before, it works, but unfortunately, the quality is not very good, HMM and unit selection speech synthesis are now very old approaches, they are far from the current state of the art, you should try open-source implementations of Tacotron2 or wavnet, you will surely achieve better quality.
You should indicate somewhere which languages are supported. You mention Deepspeech German in the credits, but did you use it to implement support for German or just to inform your architectural choices?
good point, will add this information. in short, german and english, because those are the languages i am comfortable with. i hope to find native speakers contributing more languages.
deepspeech was evaluated, but right now kaldi provides better performance, thats why we stick to kaldi by default.
Mozilla DeepSpeech has seen significant improvements in the quality of results it provides (and RAM/CPU usage) since early December, the accuracy of transcriptions has substantially improved in our testing.
Can you provide some quantifiable results, like word error rate or something? Do you think it is better than what Kaldi offers currently? (In terms of accuracy, not in terms of computational performance.)
> deepspeech was evaluated, but right now kaldi provides better performance
Was this before the streaming API was added to DeepSpeech? I recently did some testing with it, and it provides text within ~100ms of last audio block on my PC.
edit: that is, the most significant latency I had was from having to wait a bit to detect end of speech.
with "performance" i meant the error rate, not the speed. speed was not a criteria for us, so we didnt evaluate it.
i read that deepspeech is way quicker in training phase as it is smarter in using gpu computing power.
we used a context-free data set, augmented with additional domain-specific sound samples and it worked out fine, although the additiinal samples made nearly no difference
I built something quite similar on my own product. Is there any interest on adding more STT/TTS backends to the software? Think services like Lyrebird or Trint.
I could contribute towards it since I have done it before.
zamia-speech: asr training scripts for research purposes, several ready trained asr models for download, based on voxforge data. zamia-speech is the (very hard in terms of know-how, hardware and software requirements) training part to be done where projects like botium speech processing are be based upon.
the other one is an example for packaging kaldi in a docker container.
in the past to provide Speech training data, but they were not really interested. I thought it would be great to improve STT having a real good and HUGE set of german audiobooks based on Text, that is publicly available... unfortunately i had no success trying to script something for this purpose (mainly lack of time).
It basically describes the thing you mentioned - matching freely available audio books with the source text and using some tools to preprocess the data suitable for ASR training (alignment, splitting).
Can anyone in the space expand on why it's increasingly rare to see people using/building on Sphinx[0]? Do people avoid it simply because of an impression that it won't be good enough compared to deep learning driven approaches?
I've avoided Sphinx after trying to use it, because:
1. Compiling it is hit and miss. Sometimes it works, sometimes it doesn't. There is no official package in any Linux distribution, so packaging anything with it is incredibly painful. There's no easy way to cross-compile your project, so you'll end up working around the build process.
2. The documentation is woeful.
> Recent CMUSphinx code has noise cancellation featur. In sphinxbase/pocketsphinx/sphinxtrain it’s ‘remove_noise’ option. In sphinx4 it’s Denoise frontend component. So if you are using latest version you should be robust to noise in some degree already. Most modern models are trained with noise cancellation already, if you have your own model you need to retrain it.
> The algorithm impelmented is spectral subtraction in on mel filterbank. There are more advanced algorithms for sure, if needed you can extend the current implementation.
Inconsistent methods across the codebases prevents you knowing where to look, and if it is documented, it may involve spelling errors which you have to guess around (like above). Also plenty of vague references to other documents that may or may not even exist.
for me it was exactly that, yes. we had powerful hardware and budget available so there was no reason to stick to statistical models. the available benchmarks showed that kaldi easily outperforms cmusphinx - when starting from scratch you typically select the leader i would say
I think that sphinx is about as good as it can get. Without really significant technical change there is no prospect of the performance that would allow successful application to the kind of applications that people want to try.
I maintain a platform which features live video events we'd like to add captioning and so far can only see IBM Watson providing a websockets interface for near real time stt.
forgot to mention: when doing realtime parallel processing, the default configuration of this project is not a feasible setup. you have to run way more decoder workers, maybe distributed on various machines.
Is MaryTTS still as good as it gets for free TTS? I've been researching this topic and it seems like there are some open-source implementations of Tacotron, but the quality isn't necessarily great.
with ssml formatting it is good enough for a customer facing ivr, though clearly recognizable as robotic voice - if this should be an issue i would not recommend it
it means: easy to install, easy to use, medium performance. no further know-how needed.
compared to the total effort for selecting, training, deploying speech recogniction and speech synthesis it provides an extremly quick boilerplate to add voice to your pipeline.
i wish there was something like that when we started the project.
But 40gb? I feel like that includes training data or something. A model can't just be 40 GB or else all of the audio would have to be passed through all 40gb of the model during inference. That seems huge.
40GB is maybe too much, but when building the docker images there is some space wasted. The image size after building is around 20GB (12GB marytts, 3GB kaldi de, 6GB kaldi en)
picotts looks cool, but it's not maintained (maybe it's perfect already?). nanotts is a fork of picotts with improvements to its cli interface, but it's not very maintained either (I was not able to compile it, due to it expecting Alsa; somehow -noalsa switch didn't help). I also discovered gtts (and the simpler google_speech), both available through pip. They're interfaces to google's apis, and have routines for handling large texts properly.
in our tests the performance was comparable, so we had no reason to switch from kaldi to something else (german data only). after all, what matters the most is the available training data.
are there ready trained models available for wav2letter ?
Oops, there I go being Anglocentric again. The datasets are out there; but idk if there are German trained models. Fortunately wav2letter++ looks like it's been built to be very friendly to train a more extensive one if there isn't. My understanding is they are still releasing parts of it.
What's your reasoning? Google's latest published research is analogously parsimonious with this approach. They are doing decoding using a simple beam search through a single neural network. See https://ai.googleblog.com/2019/03/an-all-neural-on-device-sp...
very very interesting, didnt know this one. as soon as i am having some hours left will try to run some evaluation on this. after all, for my project only performance in german language counts
we used existing recipies for training (tuda german), slightly adapted and with additional training data. i am sure with more knowledge we would have made some extra percent ... performance is not really good, but good enough for our purposes, and with a satisfying cost vs quality trade off
Did you miss wav2letter achieving a new 2019 SOTA on librispeech? Both projects have a _lot_ of architectures and papers in play, I know it can be hard to keep up.
Aren’t some of the peak Kaldi numbers also with _huge_ language models that aren’t so practical to deploy? The best kaldi results I’ve found here both use “fglarge” LMs, but a quick search didn’t tell me how big that actually is.
The best I can find for the test set are (from main RESULTS):
Sure, one of their techniques is to semi supervised train on 60k hours of audio, so I’ll post their best numbers both with and without that step (as you said “if kaldi is trained on the same data”)
Without 60k unsupervised audio:
test-clean: 2.31% WER
test-other: 5.18% WER
With 60k audio:
test-clean: 2.03% WER
test-other: 4.11% WER
That’s so far ahead of kaldi’s librispeech RESULTS, let’s go up the list and look at their acoustic model tests that don’t even bother with a language model:
Without 60k or language model:
(test-clean / test-other WER)
3.05 / 7.01
With 60k without language model:
2.30 / 5.29
Am I missing a kaldi whitepaper here?
I found “State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention With Dilated 1D Convolutions” on wer_are_we claiming 2.2 / 5.8 using Kaldi as a base, which is impressive, but they changed the network architecture and I didn’t find this in any of the open-source kaldi librispeech subdirectories.
(If arbitrary model changes count, at some point you might as well treat both wav2letter and kaldi as “tensorflow with some speech helpers” because I’m sure you can implement either framework’s best neural architecture in the other framework.)
If there’s a better paper on kaldi than 2.2 / 5.8 you should comment with it and PR it to wer_are_we, because wav2letter is currently beating every other LS result posted there.
The reason those kaldi results are noticeably worse is because they're not using RNN rescoring. The E2E models also take far more compute to train (facebook and google train their models for hundreds of epochs, kaldi for... 4!). All the papers you read game results by adding unpruned ngram LMs or massive NNLMs.
Point being if you put some more effort I believe you can get better results with kaldi. Remember we're discussing with which toolkit one can achieve better performance (given a certain amount of effort), which is not the same as just taking the results from different teams some of which have spent an order of magnitude more compute in gaming the results.
For SOTA results with a hybrid approach check out RWTH's stuff. The kaldi people have recently been busy rewriting their backend to work with pytorch.
In general using librispeech to gauge performance is a bad idea. It's read speech. I prefer to base my opinions on real world datasets.
Have to say I am impressed by how good their (FB) results are without using an LM though.
I agree that librispeech isn’t the best benchmark, but it’s the benchmark we have. If you have a significantly better benchmark that both wav2letter SOTA and Kaldi SOTA have participated in, I’d love to see it. Without some kind of numbers cited, it’s easier to assume you’re speculating. Especially because, as I think we both acknowledge, wav2letter requires a lot of training power to actually use, so I doubt you’ve done the training yourself in the same way you’ve worked with Kaldi to be able to talk about it authoritatively. (One of my goals with releasing w2l models is that people will be able to do e.g. transfer learning or distillation and not repeat the expensive up front training themselves)
RWTH has an impressive entry on the wer_are_we list for librispeech (2.3 / 5.0), but unfortunately I couldn’t find a way to actually use that research without a lot of work. Facebook has easy recipes posted that reproduce their results, which have been quite nice to work with. RWTH, Google, and Capio held an impressive top spot in the results for a while, which might be where you’re basing your judgement.
I don’t think the biggest benefit of wav2letter was the architecture (until their SOTA paper), I think it was the speed and simplicity. I managed to port the original wav2letter inference to numpy/pytorch in two small files (same model weights, same inference output) because it was such a simple convolutional architecture. (The only fiddly part was matching the output of their featurization code)
I agree the data and training overhead is a big aspect, which is why I’ve been training and releasing freely available wav2letter models on far more data than what Facebook has published using librispeech.
My upcoming goal is to reproduce the SOTA results but with stronger training data. My baseline data is around 4000h currently and is far more diverse than librispeech. Unfortunately I’m independent and can’t afford the LDC datasets, though I have some neat ideas for extending FB’s semi supervised work to even more data.
My models are here: https://talonvoice.com/research/
I haven’t yet posted the model I’ve been working on most recently. I’m 120 epochs into a large size model trained on all of my datasets. I also have another 1000-1500h of audio I haven’t finished prepping to train on.
so this is a real cool project. as soon as you finished your training I will be happy to add it as option (or maybe default setup) in the botium speech processing setup (if you want that).
Do you have any experience with online decoding in wav2letter ? Is there something like a Websocket API available somewhere ?
from my experience, when trained with the same data, kaldi is slightly better and with custom recipes adaptable to changing conditions.
deepspeech has way better documentation and is more developer friendly.
wav2letter seems to be the quickest.
i guess there is no real winner here ... depends what criteria are applied.
just an example, kaldi is a weird mixture of c++, python2, python3, shell scripts, java, perl. hard to oversee. deepspeech is python. wav2letter is an exe file.
I was under the impression DeepSpeech was native (C++), with bindings for Python and others. Personally I've used it with Node.js so far, and I couldn't see any dependencies on Python.
edit: I was talking client, you're talking training I guess.
yes there are plenty of them. just google for something like "kaldi vs google".
in short: not surprising the blockbuster cloud services provide better results as they have way more training data. tradeoff between price, privacy, quality.
currently included german and english. contributions for other languages welcome, native speakers will have better insights into quality of speech output and recognition model
Cool! Thanks! I've tried both wav2letter and DeepSpeech, and now this, but I get very poor results even with short sentences (compared to Google's proprietary services). Would it be possible to also make an API for passing in training data and automatically update the model? I'm thinking that the results might get better if they are trained with the specific audio/hardware/settings and dialect of the end user.
I just played with DeepSpeech (v0.6.1) and I found significant improvements by using a custom language model.
The language model is built from sentences and is rather quick to build (~seconds), at least for the small number of sentences I used. This can then be combined with the pre-trained neural net.
Though I hear DeepSpeech is currently fairly US-influenced when it comes to recognizing accents. So if you're not a native speaker, consider contributing to the open-source dataset over at https://voice.mozilla.org/
its always an issue with domain specific utterances. for the freely available data they are oov and have to be handled somehow (fe generate pronounciation with seqitur)
it requires hundreds of hours of speech data to make any difference, at least for a context-free asr. and gpu-powered high-end hardware, and several days for training.
training an asr model is totally different requirement than using it as a client
> just a small server, hope that it wont crash when posting
> the link here
In all honesty I was just looking for some pre-processed examples.
The API itself seems intuitive enough, good job.
What really stands out to me is that open-source text-to-speech is really awful compared to commercial solutions - which surprises me. Does anybody know why this is the case? (And not just "money", I'm talking technology)
I tested the text-to-speech and it was acceptable, on pair with the robotic voice many screen readers use. For me it's not that important that it sounds like a real human, it's more important that it's accurate, and that you can hear what it say.
Also pre-processed examples is not really any useful, as the author will likely pick examples that turned out good. This API testing page was very useful though! As it was very easy for me to test myself.
when using right ssml formatting, output quality is not so bad. but it cannot compete commercial engines, thats right.
i guess thats for the same reason that google is dominating the speech recogniction world: they have tons of training data available. not smarter algorithms, just more data.
The original task was to automate the testing of a voice-enabled IVR system. While we started with real audio recordings, very soon it was clear that this approach is not feasible for a non-trivial app and it will be impossible to reach a satisfying test coverage. On the other hand, we had to find a way to transcribe the voice app response to text for doing our automated assertions.
As cloud-based solutions where not an option (company policy), we very quickly got frustrated as there was no "get shit done" Open Source stack available for doing medium-quality text-to-speech and speech-to-text conversions. We learned how to train and use Kaldi, which is according to some benchmarks the best available system out there, but mainly targeting academic users and research. We made heavy-weight MaryTTS work to synthesize speech in reasonable quality.
And finally, we packaged all of this in a DevOps-friendly HTTP/JSON API with a Swagger definition.
As always, feedback and contributions are welcome!