TTS: Text-to-Speech for All

erogol · on April 14, 2021

Check out Coqui TTS where we continue the work.

https://github.com/coqui-ai/TTS

Mozilla TTS is not maintained anymore (at least ATM).

Disclaimer: I've created both of the projects.

bravura · on April 15, 2021

How does this compare to ESPNET2? (https://colab.research.google.com/github/espnet/notebook/blo...)

Do you support multiple speakers?

Also, do you mind if I email you and ask you a few questions about Coqui TTS?

erogol · on April 15, 2021

We support multi speaker models and working on even multi lingual models.

Come and join our gitter room.

https://gitter.im/coqui-ai/TTS

sirius87 · on April 15, 2021

I remember giving Mozilla TTS a try but the docker image would crash on punctuations and symbols. Seemed to require clean up of the text and submitting it in small chunks.

Any idea if these issues still exist? Thank you.

jadbox · on April 15, 2021

The examples are really impressive! Are multiple voice tones/genders supported?

erogol · on April 15, 2021

we are working on it.

Check our latest work https://edresson.github.io/SC-GlowTTS/

You can also check other released models here

https://github.com/coqui-ai/TTS/releases

kstrauser · on April 15, 2021

That's mind blowingly amazing. It's far and away the best TTS I've ever heard in my life!

We've come a long from from SAM and my Amiga, I tell ya.

Isn0gud · on April 14, 2021

It seems like this is another dead mozilla project now, given that the people who worked on this started a new project: https://github.com/coqui-ai

kdavis · on April 14, 2021

You can see some Coqui[0] TTS examples here[1].

[0] https://coqui.ai/

[1] https://erogol.github.io/ddc-samples/

echelon · on April 14, 2021

I work on TTS (created https://vo.codes) and my impression of the Mozilla project was that it was incredibly understaffed. Unrealistically so to ever lead to any kind of product or platform.

Maybe this new organization can accomplish the goal of easy and open trainable TTS. I'd really like to see it.

erogol · on April 15, 2021

understaffed -> it was only me and still only me but with a great list of contributors :)

karussell · on April 15, 2021

Thanks for doing this! I got it running immediately and I'm impressed and will try it in a project I have in mind plus spread the good word.

echelon · on April 15, 2021

I would really love to see you get funding for developing an open platform for TTS that can offer commercial options that fuel growth. I really want to see your work scale. Easy TTS from custom data needs to be a thing.

Good job with the stuff you've built, and good luck!

whatrocks · on April 15, 2021

I used one of the TTS pre-trained models to turn the Frankenstein e-book from Project Gutenberg into a podcast, and it worked pretty darn well (especially when I compared it with the terminal "say" command). Here's my write-up:

https://www.charlieharrington.com/flow-and-creative-computin...

And the podcast RSS feed:

https://whatrocks.github.io/castellan/podcastjr.xml

It's great when these ML models link to a Google Colab notebook. It makes it super easy and dare-I-say fun to try them out.

disabled · on April 15, 2021

Wow, that neural voice sounds so much better than the TTS that I use with my screenreader for reading books with my print-related disability. Thank you for the writeup! :-)

snakers41 · on April 15, 2021

You should also totally check out silero-models, which are also available in colab with 10 speakers:

- https://github.com/snakers4/silero-models#text-to-speech - https://habr.com/ru/post/549482/

Disclaimer, this is my independent project

marcodiego · on April 14, 2021

FLOSS TTS and STT is badly needed right now. Being able to use voice recognition and speech synthesis should not be restricted to a small oligoply.

synesthesiam · on April 14, 2021

Shameless plug for Rhasspy: https://rhasspy.readthedocs.io/en/latest/

Abishek_Muthian · on April 14, 2021

TTS tech is so accessible that product developers should consider integrating it within their product for the sake of visually impaired and not leave the product to the mercy of operating system's accessibility features.

I feel the biggest causality of online advertisement is accessibility, Those with eyes (eye-sight) are more valuable to the mega corps than those without and so Internet is full of rich graphics; making the lives of those without proper vision miserable.

suyash · on April 14, 2021

Always be 100% compatible with OS provided accessibility guidelines before adding additional support. Most disabled users are familiar with native accessibility tools and that should come first.

Abishek_Muthian · on April 15, 2021

True, I meant native TTS as an addition to current accessibility specs to all apps with content focused on reading.

e.g. News apps should have a play button on the top both for accessibility and convenience. Here's a discussion[1] regarding that on my problem validation platform with participation of someone with accessibility needs.

[1] https://needgap.com/problems/200-human-voice-summary-of-news...

synesthesiam · on April 14, 2021

Larynx TTS has a similar goal: https://rhasspy.github.io/larynx/

It was originally based on Mozilla TTS, but I've since moved to exporting models to Onnx for speed.

adkadskhj · on April 14, 2021

In an example[1], it sounds decent but i noticed a fuzzy white noise whenever the voice is talking. Is this the algorithm, or compression? If it's the algorithm, why?

[1]: https://soundcloud.com/user-565970875/pocket-article-wavernn...

xcodevn · on April 14, 2021

This is a well known problem. The noise is due to mu-law compression. The 16 bit audio samples are compressed to 8, 9, or 10 bits before feeding to the neutral net. The reason is because predicting a categorical distribution of 2^16 values requires too many parameters. The noise was also in samples from the famous Wavenet from Deepmind (they used 8 bit mu law).

There are two ways to avoid this: 1. predict 8 high (coarse) bits, 8 low (fine) bits separately as in the original waveRNN paper. 2. use a mixture of logistic distributions as the predictive output as in the recent Lyra vocoder from Google.

Tade0 · on April 14, 2021

How does the number of parameters scale with resolution?

Specifically, how much slower this would be if the audio was, say, 10 bits?

I recall a lab exercise in college where we were supposed to increase the resolution of a quantizer until we reached a decent tone and 10 bits were the point at which we reached satisfying quality.

xcodevn · on April 14, 2021

It is a single matrix multiplication to predict probabilities of all possible outputs. For example, with a hidden state of 1024 dimensions, and 8 bits output, it is 1024x256 parameters. 10 bits will need 1024x1024 params.

kdavis · on April 15, 2021

You should listen to newer examples[1]. It's improved a lot in the intervening time period.

[1] https://erogol.github.io/ddc-samples/

throwawaysea · on April 14, 2021

I actually don't hear the fuzzy white noise, but maybe it's because of my tinnitus. Is it during a certain part of the recording? To my ears this sounds surprisingly high fidelity and natural sounding.

adkadskhj · on April 14, 2021

It's only during when the .. "person" talks. Which makes it quite noticeable to me because it starts and stops. It is rather faint, so i might not even notice it if it was consistent.

erogol · on April 14, 2021

It mainly reflects the quality of the trained dataset, the earlier stages of the project and some experiments.

I suggest you the check the latest uploads on soundcloud.

eddyg · on April 14, 2021

I hear it as well, even when using the speaker on my phone and not headphones (where it seems like it would be even more noticeable).

sandreas · on April 14, 2021

Maybe interesting:

https://colab.research.google.com/drive/1SPl226SwzrfMZltrVag...

https://github.com/keithito/tacotron

https://www.youtube.com/watch?v=ijhZR43TOwc

https://heartbeat.fritz.ai/a-2019-guide-to-speech-synthesis-...

snakers41 · on April 15, 2021

You should also totally check out silero-models, which are also available in colab with 10 speakers: - https://github.com/snakers4/silero-models#text-to-speech - https://habr.com/ru/post/549482/

Disclaimer, this is my independent project

gxqoz · on April 14, 2021

Hrmm the link to an example from Pocket leads me to hope that these are coming to that app. The current TTS for listening to saved articles is decent but certainly not state of the art.

banana_giraffe · on April 14, 2021

It's a cool demo. Though, to my ears it's still a bit far from my dream of having something cheap or free I can feed more niche books I like and use to create an audiobook version of them.

https://vocaroo.com/1oOjiLNCagur

Causality1 · on April 14, 2021

That's still my personal TTS dream as well. Google's Read Aloud voice blows everything else out of the water but I've found by experimentation that it will only read the first three and a half hours of web page text.

kdavis · on April 15, 2021

You should check out newer examples[1]. The quality has improved significantly.

[1] https://erogol.github.io/ddc-samples/

ancarda · on April 14, 2021

Will there be a wide choice of accents? The link in the README <https://erogol.github.io/ddc-samples/> seemed to only have a single voice

erogol · on April 14, 2021

Yep. The aim is to solve TTS for all languages one at a time.

You can check out the released models page for the other models and languages.

https://github.com/coqui-ai/TTS/releases

ftyers · on April 14, 2021

When are you going to do Chuvash ? ;)

synesthesiam · on April 16, 2021

If you're willing to record a public domain dataset, I'll help train a voice :)

monkeydust · on April 14, 2021

One of my products involves providing a lot of dense data to traders overlayed with performance measures based on proprietary models.

We are working on automatically extracting some insights for the user and using NLP to present them like news articles.

It wouldn't take a huge lift from that to use TTS to provide another way for user to digest the data.

Would make for a cool demo but wonder how sticky it would be.

cromwellian · on April 14, 2021

NVidia pimped this at GTC21 as "state of the art TTS" which is why I think it's getting renewed attention, , but to my ears, it doesn't sound anywhere near WaveNet (Google), Siri, or Alexa.

kdavis · on April 15, 2021

Newer examples[0] are, to my ears, comparable or better than WaveNet (Google), Siri, or Alexa.

Also, Mean Opinion Score (MOS)[1] tests[2] of the TTS engine have it coming out ahead of other commercial engines[3].

[0] https://erogol.github.io/ddc-samples/

[1] https://en.wikipedia.org/wiki/Mean_opinion_score

[2] https://dl.acm.org/doi/abs/10.1145/3313831.3376789

[3] https://github.com/coqui-ai/TTS#-tts-performance

swiley · on April 14, 2021

I'm personally very suspicious of any software coming from NVidia at this point.

mileycyrusXOXO · on April 14, 2021

The example is very impressive! Sounds very natural.

Raed667 · on April 14, 2021

Is there a way to get this working in Firefox?

uniqueid · on April 14, 2021

What's the plan with this? Is it to incorporate it into Firefox to improve its Web Speech API implementation?

hjek · on April 14, 2021

I hope so. The examples sounds so much better than Espeak.

Edit: Oh, I see this project uses Espeak. Interesting.

ftyers · on April 15, 2021

I believe Espeak is only used for the grapheme to phoneme conversion (if at all). The rest is all new.

mnemotronic · on April 14, 2021

The acronym would be more fun if the product was Text-InTo-Speech. Yea.... -1