Hacker News new | past | comments | ask | show | jobs | submit login
TTS: Text-to-Speech for All (github.com/mozilla)
282 points by doener on April 14, 2021 | hide | past | favorite | 51 comments



Check out Coqui TTS where we continue the work.

https://github.com/coqui-ai/TTS

Mozilla TTS is not maintained anymore (at least ATM).

Disclaimer: I've created both of the projects.


How does this compare to ESPNET2? (https://colab.research.google.com/github/espnet/notebook/blo...)

Do you support multiple speakers?

Also, do you mind if I email you and ask you a few questions about Coqui TTS?


We support multi speaker models and working on even multi lingual models.

Come and join our gitter room.

https://gitter.im/coqui-ai/TTS


I remember giving Mozilla TTS a try but the docker image would crash on punctuations and symbols. Seemed to require clean up of the text and submitting it in small chunks.

Any idea if these issues still exist? Thank you.


The examples are really impressive! Are multiple voice tones/genders supported?


we are working on it.

Check our latest work https://edresson.github.io/SC-GlowTTS/

You can also check other released models here

https://github.com/coqui-ai/TTS/releases


That's mind blowingly amazing. It's far and away the best TTS I've ever heard in my life!

We've come a long from from SAM and my Amiga, I tell ya.


It seems like this is another dead mozilla project now, given that the people who worked on this started a new project: https://github.com/coqui-ai


You can see some Coqui[0] TTS examples here[1].

[0] https://coqui.ai/

[1] https://erogol.github.io/ddc-samples/


I work on TTS (created https://vo.codes) and my impression of the Mozilla project was that it was incredibly understaffed. Unrealistically so to ever lead to any kind of product or platform.

Maybe this new organization can accomplish the goal of easy and open trainable TTS. I'd really like to see it.


understaffed -> it was only me and still only me but with a great list of contributors :)


Thanks for doing this! I got it running immediately and I'm impressed and will try it in a project I have in mind plus spread the good word.


I would really love to see you get funding for developing an open platform for TTS that can offer commercial options that fuel growth. I really want to see your work scale. Easy TTS from custom data needs to be a thing.

Good job with the stuff you've built, and good luck!


I used one of the TTS pre-trained models to turn the Frankenstein e-book from Project Gutenberg into a podcast, and it worked pretty darn well (especially when I compared it with the terminal "say" command). Here's my write-up:

https://www.charlieharrington.com/flow-and-creative-computin...

And the podcast RSS feed:

https://whatrocks.github.io/castellan/podcastjr.xml

It's great when these ML models link to a Google Colab notebook. It makes it super easy and dare-I-say fun to try them out.


Wow, that neural voice sounds so much better than the TTS that I use with my screenreader for reading books with my print-related disability. Thank you for the writeup! :-)


You should also totally check out silero-models, which are also available in colab with 10 speakers:

- https://github.com/snakers4/silero-models#text-to-speech - https://habr.com/ru/post/549482/

Disclaimer, this is my independent project


FLOSS TTS and STT is badly needed right now. Being able to use voice recognition and speech synthesis should not be restricted to a small oligoply.


Shameless plug for Rhasspy: https://rhasspy.readthedocs.io/en/latest/


TTS tech is so accessible that product developers should consider integrating it within their product for the sake of visually impaired and not leave the product to the mercy of operating system's accessibility features.

I feel the biggest causality of online advertisement is accessibility, Those with eyes (eye-sight) are more valuable to the mega corps than those without and so Internet is full of rich graphics; making the lives of those without proper vision miserable.


Always be 100% compatible with OS provided accessibility guidelines before adding additional support. Most disabled users are familiar with native accessibility tools and that should come first.


True, I meant native TTS as an addition to current accessibility specs to all apps with content focused on reading.

e.g. News apps should have a play button on the top both for accessibility and convenience. Here's a discussion[1] regarding that on my problem validation platform with participation of someone with accessibility needs.

[1] https://needgap.com/problems/200-human-voice-summary-of-news...


Larynx TTS has a similar goal: https://rhasspy.github.io/larynx/

It was originally based on Mozilla TTS, but I've since moved to exporting models to Onnx for speed.


In an example[1], it sounds decent but i noticed a fuzzy white noise whenever the voice is talking. Is this the algorithm, or compression? If it's the algorithm, why?

[1]: https://soundcloud.com/user-565970875/pocket-article-wavernn...


This is a well known problem. The noise is due to mu-law compression. The 16 bit audio samples are compressed to 8, 9, or 10 bits before feeding to the neutral net. The reason is because predicting a categorical distribution of 2^16 values requires too many parameters. The noise was also in samples from the famous Wavenet from Deepmind (they used 8 bit mu law).

There are two ways to avoid this: 1. predict 8 high (coarse) bits, 8 low (fine) bits separately as in the original waveRNN paper. 2. use a mixture of logistic distributions as the predictive output as in the recent Lyra vocoder from Google.


How does the number of parameters scale with resolution?

Specifically, how much slower this would be if the audio was, say, 10 bits?

I recall a lab exercise in college where we were supposed to increase the resolution of a quantizer until we reached a decent tone and 10 bits were the point at which we reached satisfying quality.


It is a single matrix multiplication to predict probabilities of all possible outputs. For example, with a hidden state of 1024 dimensions, and 8 bits output, it is 1024x256 parameters. 10 bits will need 1024x1024 params.


You should listen to newer examples[1]. It's improved a lot in the intervening time period.

[1] https://erogol.github.io/ddc-samples/


I actually don't hear the fuzzy white noise, but maybe it's because of my tinnitus. Is it during a certain part of the recording? To my ears this sounds surprisingly high fidelity and natural sounding.


It's only during when the .. "person" talks. Which makes it quite noticeable to me because it starts and stops. It is rather faint, so i might not even notice it if it was consistent.


It mainly reflects the quality of the trained dataset, the earlier stages of the project and some experiments.

I suggest you the check the latest uploads on soundcloud.


I hear it as well, even when using the speaker on my phone and not headphones (where it seems like it would be even more noticeable).



You should also totally check out silero-models, which are also available in colab with 10 speakers: - https://github.com/snakers4/silero-models#text-to-speech - https://habr.com/ru/post/549482/

Disclaimer, this is my independent project


Hrmm the link to an example from Pocket leads me to hope that these are coming to that app. The current TTS for listening to saved articles is decent but certainly not state of the art.


It's a cool demo. Though, to my ears it's still a bit far from my dream of having something cheap or free I can feed more niche books I like and use to create an audiobook version of them.

https://vocaroo.com/1oOjiLNCagur


That's still my personal TTS dream as well. Google's Read Aloud voice blows everything else out of the water but I've found by experimentation that it will only read the first three and a half hours of web page text.


You should check out newer examples[1]. The quality has improved significantly.

[1] https://erogol.github.io/ddc-samples/


Will there be a wide choice of accents? The link in the README <https://erogol.github.io/ddc-samples/> seemed to only have a single voice


Yep. The aim is to solve TTS for all languages one at a time.

You can check out the released models page for the other models and languages.

https://github.com/coqui-ai/TTS/releases


When are you going to do Chuvash ? ;)


If you're willing to record a public domain dataset, I'll help train a voice :)


One of my products involves providing a lot of dense data to traders overlayed with performance measures based on proprietary models.

We are working on automatically extracting some insights for the user and using NLP to present them like news articles.

It wouldn't take a huge lift from that to use TTS to provide another way for user to digest the data.

Would make for a cool demo but wonder how sticky it would be.


NVidia pimped this at GTC21 as "state of the art TTS" which is why I think it's getting renewed attention, , but to my ears, it doesn't sound anywhere near WaveNet (Google), Siri, or Alexa.


Newer examples[0] are, to my ears, comparable or better than WaveNet (Google), Siri, or Alexa.

Also, Mean Opinion Score (MOS)[1] tests[2] of the TTS engine have it coming out ahead of other commercial engines[3].

[0] https://erogol.github.io/ddc-samples/

[1] https://en.wikipedia.org/wiki/Mean_opinion_score

[2] https://dl.acm.org/doi/abs/10.1145/3313831.3376789

[3] https://github.com/coqui-ai/TTS#-tts-performance


I'm personally very suspicious of any software coming from NVidia at this point.


The example is very impressive! Sounds very natural.


Is there a way to get this working in Firefox?


What's the plan with this? Is it to incorporate it into Firefox to improve its Web Speech API implementation?


I hope so. The examples sounds so much better than Espeak.

Edit: Oh, I see this project uses Espeak. Interesting.


I believe Espeak is only used for the grapheme to phoneme conversion (if at all). The rest is all new.


The acronym would be more fun if the product was Text-InTo-Speech. Yea.... -1




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: