I remember giving Mozilla TTS a try but the docker image would crash on punctuations and symbols. Seemed to require clean up of the text and submitting it in small chunks.
I work on TTS (created https://vo.codes) and my impression of the Mozilla project was that it was incredibly understaffed. Unrealistically so to ever lead to any kind of product or platform.
Maybe this new organization can accomplish the goal of easy and open trainable TTS. I'd really like to see it.
I would really love to see you get funding for developing an open platform for TTS that can offer commercial options that fuel growth. I really want to see your work scale. Easy TTS from custom data needs to be a thing.
Good job with the stuff you've built, and good luck!
I used one of the TTS pre-trained models to turn the Frankenstein e-book from Project Gutenberg into a podcast, and it worked pretty darn well (especially when I compared it with the terminal "say" command). Here's my write-up:
Wow, that neural voice sounds so much better than the TTS that I use with my screenreader for reading books with my print-related disability. Thank you for the writeup! :-)
TTS tech is so accessible that product developers should consider integrating it within their product for the sake of visually impaired and not leave the product to the mercy of operating system's accessibility features.
I feel the biggest causality of online advertisement is accessibility, Those with eyes (eye-sight) are more valuable to the mega corps than those without and so Internet is full of rich graphics; making the lives of those without proper vision miserable.
Always be 100% compatible with OS provided accessibility guidelines before adding additional support. Most disabled users are familiar with native accessibility tools and that should come first.
True, I meant native TTS as an addition to current accessibility specs to all apps with content focused on reading.
e.g. News apps should have a play button on the top both for accessibility and convenience. Here's a discussion[1] regarding that on my problem validation platform with participation of someone with accessibility needs.
In an example[1], it sounds decent but i noticed a fuzzy white noise whenever the voice is talking. Is this the algorithm, or compression? If it's the algorithm, why?
This is a well known problem. The noise is due to mu-law compression. The 16 bit audio samples are compressed to 8, 9, or 10 bits before feeding to the neutral net. The reason is because predicting a categorical distribution of 2^16 values requires too many parameters. The noise was also in samples from the famous Wavenet from Deepmind (they used 8 bit mu law).
There are two ways to avoid this: 1. predict 8 high (coarse) bits, 8 low (fine) bits separately as in the original waveRNN paper.
2. use a mixture of logistic distributions as the predictive output as in the recent Lyra vocoder from Google.
How does the number of parameters scale with resolution?
Specifically, how much slower this would be if the audio was, say, 10 bits?
I recall a lab exercise in college where we were supposed to increase the resolution of a quantizer until we reached a decent tone and 10 bits were the point at which we reached satisfying quality.
It is a single matrix multiplication to predict probabilities of all possible outputs. For example, with a hidden state of 1024 dimensions, and 8 bits output, it is 1024x256 parameters. 10 bits will need 1024x1024 params.
I actually don't hear the fuzzy white noise, but maybe it's because of my tinnitus. Is it during a certain part of the recording? To my ears this sounds surprisingly high fidelity and natural sounding.
It's only during when the .. "person" talks. Which makes it quite noticeable to me because it starts and stops. It is rather faint, so i might not even notice it if it was consistent.
Hrmm the link to an example from Pocket leads me to hope that these are coming to that app. The current TTS for listening to saved articles is decent but certainly not state of the art.
It's a cool demo. Though, to my ears it's still a bit far from my dream of having something cheap or free I can feed more niche books I like and use to create an audiobook version of them.
That's still my personal TTS dream as well. Google's Read Aloud voice blows everything else out of the water but I've found by experimentation that it will only read the first three and a half hours of web page text.
NVidia pimped this at GTC21 as "state of the art TTS" which is why I think it's getting renewed attention, , but to my ears, it doesn't sound anywhere near WaveNet (Google), Siri, or Alexa.
https://github.com/coqui-ai/TTS
Mozilla TTS is not maintained anymore (at least ATM).
Disclaimer: I've created both of the projects.