Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Text-to-speech and speech-to-text open-source software stack (github.com/codeforequity-at)
438 points by ftreml on Jan 26, 2020 | hide | past | favorite | 85 comments



This project is the result of a one year long learning process in speech recognition and speech synthesis.

The original task was to automate the testing of a voice-enabled IVR system. While we started with real audio recordings, very soon it was clear that this approach is not feasible for a non-trivial app and it will be impossible to reach a satisfying test coverage. On the other hand, we had to find a way to transcribe the voice app response to text for doing our automated assertions.

As cloud-based solutions where not an option (company policy), we very quickly got frustrated as there was no "get shit done" Open Source stack available for doing medium-quality text-to-speech and speech-to-text conversions. We learned how to train and use Kaldi, which is according to some benchmarks the best available system out there, but mainly targeting academic users and research. We made heavy-weight MaryTTS work to synthesize speech in reasonable quality.

And finally, we packaged all of this in a DevOps-friendly HTTP/JSON API with a Swagger definition.

As always, feedback and contributions are welcome!


I've built a speech synthesis system with Marytts before, it works, but unfortunately, the quality is not very good, HMM and unit selection speech synthesis are now very old approaches, they are far from the current state of the art, you should try open-source implementations of Tacotron2 or wavnet, you will surely achieve better quality.


are you ready for a pull request ?


I'll be very pleased, but I can't promise anything. Good job by the way, Speech synthesis and Speech recognition are not easy subjects to work on.


You should indicate somewhere which languages are supported. You mention Deepspeech German in the credits, but did you use it to implement support for German or just to inform your architectural choices?


good point, will add this information. in short, german and english, because those are the languages i am comfortable with. i hope to find native speakers contributing more languages. deepspeech was evaluated, but right now kaldi provides better performance, thats why we stick to kaldi by default.


Mozilla DeepSpeech has seen significant improvements in the quality of results it provides (and RAM/CPU usage) since early December, the accuracy of transcriptions has substantially improved in our testing.


Can you provide some quantifiable results, like word error rate or something? Do you think it is better than what Kaldi offers currently? (In terms of accuracy, not in terms of computational performance.)


Every release mentions the word error rate, its interesting to watch the WER improve over time: https://github.com/mozilla/DeepSpeech/releases


> deepspeech was evaluated, but right now kaldi provides better performance

Was this before the streaming API was added to DeepSpeech? I recently did some testing with it, and it provides text within ~100ms of last audio block on my PC.

edit: that is, the most significant latency I had was from having to wait a bit to detect end of speech.


with "performance" i meant the error rate, not the speed. speed was not a criteria for us, so we didnt evaluate it. i read that deepspeech is way quicker in training phase as it is smarter in using gpu computing power.


Ahh gotcha. Yes I got rather poor "hit rate" until I made my own language model.

Fortunately for me I only needed it for command recognition, so the process was quite quick, and results were very good. No need to retrain the net.


we used a context-free data set, augmented with additional domain-specific sound samples and it worked out fine, although the additiinal samples made nearly no difference


What was the IVR software you were using, was that bespoke?


no, it was a special call center software called vacapo.


I built something quite similar on my own product. Is there any interest on adding more STT/TTS backends to the software? Think services like Lyrebird or Trint.

I could contribute towards it since I have done it before.

Thank you for building this!


yes absolutly! it should be a good mix of freely available packages, with meaningful default configuration.


Here's a sample wav output from using their swagger endpoint: https://drive.google.com/file/d/15y83NSXOCrEW9v9eQVCy6oHcWJ8...

Why does the voice/pronunciation have such drastic volume spikes and dips?


"Now let's have a little taste of that old computer generated swagger." -Max Headroom

https://www.youtube.com/watch?v=WTN1WsUCyQc&t=3m26s


marytts supports a high number of voices in several languages. you could try to use another voice.


Could you explain, what's the difference to

- https://github.com/gooofy/zamia-speech#asr-models

- https://github.com/mpuels/docker-py-kaldi-asr-and-model

in regards of speech recognition except the fact that its easier to use?


zamia-speech: asr training scripts for research purposes, several ready trained asr models for download, based on voxforge data. zamia-speech is the (very hard in terms of know-how, hardware and software requirements) training part to be done where projects like botium speech processing are be based upon.

the other one is an example for packaging kaldi in a docker container.


Thank you :-) I tried to get in touch with

https://www.vorleser.net/

in the past to provide Speech training data, but they were not really interested. I thought it would be great to improve STT having a real good and HUGE set of german audiobooks based on Text, that is publicly available... unfortunately i had no success trying to script something for this purpose (mainly lack of time).


I point you to this article: https://medium.com/@klintcho/creating-an-open-speech-recogni...

It basically describes the thing you mentioned - matching freely available audio books with the source text and using some tools to preprocess the data suitable for ASR training (alignment, splitting).


Can anyone in the space expand on why it's increasingly rare to see people using/building on Sphinx[0]? Do people avoid it simply because of an impression that it won't be good enough compared to deep learning driven approaches?

[0]: https://cmusphinx.github.io/


I've avoided Sphinx after trying to use it, because:

1. Compiling it is hit and miss. Sometimes it works, sometimes it doesn't. There is no official package in any Linux distribution, so packaging anything with it is incredibly painful. There's no easy way to cross-compile your project, so you'll end up working around the build process.

2. The documentation is woeful.

> Recent CMUSphinx code has noise cancellation featur. In sphinxbase/pocketsphinx/sphinxtrain it’s ‘remove_noise’ option. In sphinx4 it’s Denoise frontend component. So if you are using latest version you should be robust to noise in some degree already. Most modern models are trained with noise cancellation already, if you have your own model you need to retrain it.

> The algorithm impelmented is spectral subtraction in on mel filterbank. There are more advanced algorithms for sure, if needed you can extend the current implementation.

Inconsistent methods across the codebases prevents you knowing where to look, and if it is documented, it may involve spelling errors which you have to guess around (like above). Also plenty of vague references to other documents that may or may not even exist.


for me it was exactly that, yes. we had powerful hardware and budget available so there was no reason to stick to statistical models. the available benchmarks showed that kaldi easily outperforms cmusphinx - when starting from scratch you typically select the leader i would say


I think that sphinx is about as good as it can get. Without really significant technical change there is no prospect of the performance that would allow successful application to the kind of applications that people want to try.


Is this using google's tacotron2 or wavenet anywhere? How does this compare to them?


Any recommendations for a real time solution?

I maintain a platform which features live video events we'd like to add captioning and so far can only see IBM Watson providing a websockets interface for near real time stt.


the project includes a websocket endpoint for realtime decoding. will add it to the docs.

we are already using it for a callcenter with around 50 parallel audio streams.


forgot to mention: when doing realtime parallel processing, the default configuration of this project is not a feasible setup. you have to run way more decoder workers, maybe distributed on various machines.


Is MaryTTS still as good as it gets for free TTS? I've been researching this topic and it seems like there are some open-source implementations of Tacotron, but the quality isn't necessarily great.


The nvidia tacotron implementation was much better out of the box than the wav in the neighbor thread.


with ssml formatting it is good enough for a customer facing ivr, though clearly recognizable as robotic voice - if this should be an issue i would not recommend it


That's fantastic work and the demo is very well done. Thanks for sharing it. You obviously put a lot of hard work into it. Feels super polished.


What exactly does “low-key” mean in this context?


it means: easy to install, easy to use, medium performance. no further know-how needed. compared to the total effort for selecting, training, deploying speech recogniction and speech synthesis it provides an extremly quick boilerplate to add voice to your pipeline. i wish there was something like that when we started the project.


If marytts is so good, why are we in many linux distros still using https://en.wikipedia.org/wiki/Festival_Speech_Synthesis_Syst... as our default tts system?


just a guess: marytts is rather heavy weight


So why is 40gigs of free space needed?


speech recogniztion model filed are quite big


But 40gb? I feel like that includes training data or something. A model can't just be 40 GB or else all of the audio would have to be passed through all 40gb of the model during inference. That seems huge.


40GB is maybe too much, but when building the docker images there is some space wasted. The image size after building is around 20GB (12GB marytts, 3GB kaldi de, 6GB kaldi en)


Would love to see a live demo. MaryTTS demo link is broken.



Thanks for providing the test service. It's really easy to use if one is familiar with swagger UIs. First tests show that the results are good.


Thanks. Hard to use and poor result :(

Good example: https://cloud.google.com/text-to-speech/


It's an API, of course it is hard to use without any real user interface ... but as an STT/TTS API, it won't get more easy than that ...

Of course it is not a competitor to Google in any sense.


I'd like to have my laptop read out epubs or articles. Recommendations for speech synthesis (TTS) on the command line?


picotts is a command line tool. marytts is a client/server tool. (both included in botium speech processing and callable with curl).

high quality with google cloud speech and amazon polly.


picotts looks cool, but it's not maintained (maybe it's perfect already?). nanotts is a fork of picotts with improvements to its cli interface, but it's not very maintained either (I was not able to compile it, due to it expecting Alsa; somehow -noalsa switch didn't help). I also discovered gtts (and the simpler google_speech), both available through pip. They're interfaces to google's apis, and have routines for handling large texts properly.


Facebook has released wav2letter++. I'd wager that will outperform kaldi by a wide margin.


in our tests the performance was comparable, so we had no reason to switch from kaldi to something else (german data only). after all, what matters the most is the available training data. are there ready trained models available for wav2letter ?


Oops, there I go being Anglocentric again. The datasets are out there; but idk if there are German trained models. Fortunately wav2letter++ looks like it's been built to be very friendly to train a more extensive one if there isn't. My understanding is they are still releasing parts of it.


It is very unlikely a E2E toolkit can outperform kaldi if they're trained on the same data.

Also I suspect these guys aren't using the latest kaldi architectures.


What's your reasoning? Google's latest published research is analogously parsimonious with this approach. They are doing decoding using a simple beam search through a single neural network. See https://ai.googleblog.com/2019/03/an-all-neural-on-device-sp...


very very interesting, didnt know this one. as soon as i am having some hours left will try to run some evaluation on this. after all, for my project only performance in german language counts


we used existing recipies for training (tuda german), slightly adapted and with additional training data. i am sure with more knowledge we would have made some extra percent ... performance is not really good, but good enough for our purposes, and with a satisfying cost vs quality trade off


You should look at the latest recipe's for tedlium, librispeech, swbd etc you'll get significantly better results with those models.


Did you miss wav2letter achieving a new 2019 SOTA on librispeech? Both projects have a _lot_ of architectures and papers in play, I know it can be hard to keep up.

Aren’t some of the peak Kaldi numbers also with _huge_ language models that aren’t so practical to deploy? The best kaldi results I’ve found here both use “fglarge” LMs, but a quick search didn’t tell me how big that actually is.

The best I can find for the test set are (from main RESULTS):

    test-clean: 4.31% WER
    test-other: 10.62% WER
From here (not sure why buried): https://github.com/kaldi-asr/kaldi/blob/master/egs/librispee...

    test-clean: 3.80% WER
    test-other: 8.76% WER
This is the 2019 wav2letter SOTA: https://github.com/facebookresearch/wav2letter/blob/master/r...

Sure, one of their techniques is to semi supervised train on 60k hours of audio, so I’ll post their best numbers both with and without that step (as you said “if kaldi is trained on the same data”)

Without 60k unsupervised audio:

    test-clean: 2.31% WER
    test-other: 5.18% WER
With 60k audio:

    test-clean: 2.03% WER
    test-other: 4.11% WER
That’s so far ahead of kaldi’s librispeech RESULTS, let’s go up the list and look at their acoustic model tests that don’t even bother with a language model:

Without 60k or language model:

    (test-clean / test-other WER)
    3.05 / 7.01
With 60k without language model:

    2.30 / 5.29
Am I missing a kaldi whitepaper here?

I found “State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention With Dilated 1D Convolutions” on wer_are_we claiming 2.2 / 5.8 using Kaldi as a base, which is impressive, but they changed the network architecture and I didn’t find this in any of the open-source kaldi librispeech subdirectories. (If arbitrary model changes count, at some point you might as well treat both wav2letter and kaldi as “tensorflow with some speech helpers” because I’m sure you can implement either framework’s best neural architecture in the other framework.)

If there’s a better paper on kaldi than 2.2 / 5.8 you should comment with it and PR it to wer_are_we, because wav2letter is currently beating every other LS result posted there.


The reason those kaldi results are noticeably worse is because they're not using RNN rescoring. The E2E models also take far more compute to train (facebook and google train their models for hundreds of epochs, kaldi for... 4!). All the papers you read game results by adding unpruned ngram LMs or massive NNLMs.

Point being if you put some more effort I believe you can get better results with kaldi. Remember we're discussing with which toolkit one can achieve better performance (given a certain amount of effort), which is not the same as just taking the results from different teams some of which have spent an order of magnitude more compute in gaming the results.

For SOTA results with a hybrid approach check out RWTH's stuff. The kaldi people have recently been busy rewriting their backend to work with pytorch.

In general using librispeech to gauge performance is a bad idea. It's read speech. I prefer to base my opinions on real world datasets.

Have to say I am impressed by how good their (FB) results are without using an LM though.


I agree that librispeech isn’t the best benchmark, but it’s the benchmark we have. If you have a significantly better benchmark that both wav2letter SOTA and Kaldi SOTA have participated in, I’d love to see it. Without some kind of numbers cited, it’s easier to assume you’re speculating. Especially because, as I think we both acknowledge, wav2letter requires a lot of training power to actually use, so I doubt you’ve done the training yourself in the same way you’ve worked with Kaldi to be able to talk about it authoritatively. (One of my goals with releasing w2l models is that people will be able to do e.g. transfer learning or distillation and not repeat the expensive up front training themselves)

RWTH has an impressive entry on the wer_are_we list for librispeech (2.3 / 5.0), but unfortunately I couldn’t find a way to actually use that research without a lot of work. Facebook has easy recipes posted that reproduce their results, which have been quite nice to work with. RWTH, Google, and Capio held an impressive top spot in the results for a while, which might be where you’re basing your judgement.

I don’t think the biggest benefit of wav2letter was the architecture (until their SOTA paper), I think it was the speed and simplicity. I managed to port the original wav2letter inference to numpy/pytorch in two small files (same model weights, same inference output) because it was such a simple convolutional architecture. (The only fiddly part was matching the output of their featurization code)

I agree the data and training overhead is a big aspect, which is why I’ve been training and releasing freely available wav2letter models on far more data than what Facebook has published using librispeech.

My upcoming goal is to reproduce the SOTA results but with stronger training data. My baseline data is around 4000h currently and is far more diverse than librispeech. Unfortunately I’m independent and can’t afford the LDC datasets, though I have some neat ideas for extending FB’s semi supervised work to even more data.


You are right I've never tried out wav2letter on a large amount of data.

Cool project! I'll try out your models.


are there any wav2letter models ready for download ?


Facebook posted some of the models listed in their SOTA paper already. I haven’t tried these yet.

https://github.com/facebookresearch/wav2letter/blob/master/r...

My models are here: https://talonvoice.com/research/ I haven’t yet posted the model I’ve been working on most recently. I’m 120 epochs into a large size model trained on all of my datasets. I also have another 1000-1500h of audio I haven’t finished prepping to train on.

Here is a web demo. It’s currently running a slightly older checkpoint of my WIP large model, and the deepspeech LM: https://web2letter-west-1.talonvoice.com/


so this is a real cool project. as soon as you finished your training I will be happy to add it as option (or maybe default setup) in the botium speech processing setup (if you want that).

Do you have any experience with online decoding in wav2letter ? Is there something like a Websocket API available somewhere ?


How does accuracy compare to what you’re using now?


the german model is from the kaldi tuda recipe with WER of 15%.

the english is from the tedlium recipe with WER of 7%.

room for improvement, but for our original purpose it was sufficient.


Any idea how it compares to Mozilla's DeepSpeech?


from my experience, when trained with the same data, kaldi is slightly better and with custom recipes adaptable to changing conditions. deepspeech has way better documentation and is more developer friendly. wav2letter seems to be the quickest. i guess there is no real winner here ... depends what criteria are applied.

just an example, kaldi is a weird mixture of c++, python2, python3, shell scripts, java, perl. hard to oversee. deepspeech is python. wav2letter is an exe file.


> deepspeech is python

I was under the impression DeepSpeech was native (C++), with bindings for Python and others. Personally I've used it with Node.js so far, and I couldn't see any dependencies on Python.

edit: I was talking client, you're talking training I guess.


Are there any performance metrics of this versus other offline and cloud based services?


yes there are plenty of them. just google for something like "kaldi vs google".

in short: not surprising the blockbuster cloud services provide better results as they have way more training data. tradeoff between price, privacy, quality.


Which languages are supported?


currently included german and english. contributions for other languages welcome, native speakers will have better insights into quality of speech output and recognition model


Would be cool with a web demo.


https://speech.botiumbox.com

just a small server, hope that it wont crash when posting the link here


Cool! Thanks! I've tried both wav2letter and DeepSpeech, and now this, but I get very poor results even with short sentences (compared to Google's proprietary services). Would it be possible to also make an API for passing in training data and automatically update the model? I'm thinking that the results might get better if they are trained with the specific audio/hardware/settings and dialect of the end user.


I just played with DeepSpeech (v0.6.1) and I found significant improvements by using a custom language model.

The language model is built from sentences and is rather quick to build (~seconds), at least for the small number of sentences I used. This can then be combined with the pre-trained neural net.

Though I hear DeepSpeech is currently fairly US-influenced when it comes to recognizing accents. So if you're not a native speaker, consider contributing to the open-source dataset over at https://voice.mozilla.org/


its always an issue with domain specific utterances. for the freely available data they are oov and have to be handled somehow (fe generate pronounciation with seqitur)


it requires hundreds of hours of speech data to make any difference, at least for a context-free asr. and gpu-powered high-end hardware, and several days for training.

training an asr model is totally different requirement than using it as a client


> just a small server, hope that it wont crash when posting

> the link here

In all honesty I was just looking for some pre-processed examples.

The API itself seems intuitive enough, good job.

What really stands out to me is that open-source text-to-speech is really awful compared to commercial solutions - which surprises me. Does anybody know why this is the case? (And not just "money", I'm talking technology)


I tested the text-to-speech and it was acceptable, on pair with the robotic voice many screen readers use. For me it's not that important that it sounds like a real human, it's more important that it's accurate, and that you can hear what it say. Also pre-processed examples is not really any useful, as the author will likely pick examples that turned out good. This API testing page was very useful though! As it was very easy for me to test myself.


when using right ssml formatting, output quality is not so bad. but it cannot compete commercial engines, thats right.

i guess thats for the same reason that google is dominating the speech recogniction world: they have tons of training data available. not smarter algorithms, just more data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: