Hacker News new | past | comments | ask | show | jobs | submit login
ESpeak-ng: speech synthesizer with more than one hundred languages and accents (github.com/espeak-ng)
256 points by nateb2022 15 days ago | hide | past | favorite | 107 comments



Classic speech synthesis is interesting, in that relatively simple approaches, produce useful results. Formant synthesis takes relatively simple sounds, and modifies them according to the various distinctions the human speech tract can make. The basic vowel quality can be modelled as two sine waves that change over time. (Nothing more complex than what's needed to generate touch tone dialing tones, basically.) Add a few types of buzzing or clicking noises before or after that for consonants, and you're halfway there. The technique predates computers; it's basically the same technique used by the original voder [1] just under computer control.

Join that with algorithms which can translate English into phonetic tokens with relatively high accuracy, and you have speech synthesis. Make the dictionary big enough, add enough finesse, and a few hundred rules about transitioning from phoneme to phoneme, and it's produces relatively understandable speech.

Part of me feels that we are losing something, moving away from these classic approaches to AI. It used to be that, to teach a machine how to speak, or translate, the designer of the system had to understand how language worked. Sometimes these models percolated back into broader thinking about language. Formant synthesis ended up being an inspiration to some ideas for how the brain recognizes phonemes. (Or maybe that worked in both directions.) It was thought, further advances would come from better theories about language, better abstractions. Deep learning has produced far better systems than the classic approach, but they also offer little in terms of understanding or simplifying.

[1] https://en.wikipedia.org/wiki/Voder


I feel like you. Relatively simple formant models gets "close enough" that it feels like you should be able to do well with very little.

One of the things I've long wanted to do but not found time for, is to take a few different variants of formant synths, and try to train a simple TTS model to control one instead of producing "raw" output. It's amazing what TTS models can do with "raw" output, but we know our brains aren't producing raw, unconstrained digital audio, and so I think there's a lot of potential to understanding more and simplifying if you train models constrained to produce outputs we know ought to be sufficient, and push their size as far down as we can.


Too late to edit, but to any one who needs "convincing" of the flexibility of a formant synthesizer, you should 1) play with Pink Trombone[1], a Javascript formant synthesizer with a UI that lets you graphically manipulate a vocal tract, and 2) have a look at this programmable version of it[2]

[1] https://dood.al/pinktrombone/

[2] https://github.com/zakaton/Pink-Trombone


Thanks for those links, that's superb. Sounds surprisingly like the "Oh long Johnson" cat - https://www.youtube.com/watch?v=kkwiQmGWK4c


That reminds me of Google's blob opera: https://artsandculture.google.com/experiment/AAHWrq360NcGbw

What a fun toy! But I mostly get sounds of a drunk man who tries to say something but cannot.


Check out the videos in the second link - it gives some better examples of what you can do with it.


Funny that the Pink Trombone website does not work with screen readers.


It's entirely visual in that it is a graphic of a vocal tract where you can move the tongue and shape the mouth, so in that form trying to make it do anything with screen readers would mean basically doing what the second link does, and create an API for it and hook the backend up to an entirely different UI.

The videos on the other link does show that the guy who has done that API has done stuff to the UI to allow interfacing to it in non-graphical ways, but sadly I don't see any online demos of those alternative user interfaces anywhere, which is a great shame. Sadly it doesn't look like the videos are very accessible either, as they're mostly demos with no commentary about what is going on on the screen, without which they're mostly just random sounds.


Yeah, Pink Trombone is awesome[0]. :D

Thanks for the link to the programmable version--I don't think I'd been aware of that previously...

[0] And, from personal experience, also rather difficult to "safely" search for if you don't quite remember its name exactly. :D


It is possible to synthesize an English voice with a 1.5MB model using http://cmuflite.org/ or some of the Apple VoiceOver voices, which is just crazy to me. Most of the model is diphone samples for pairs of phonemes.

I don't know of any way to go smaller than that with software. I tried, but it seems like a fundamental limit for English.


I don't know of any way to go smaller than that with software. I tried, but it seems like a fundamental limit for English.

If you include "robotic" speech, then there's https://en.wikipedia.org/wiki/Software_Automatic_Mouth in a few tens of KB, and the demoscene has done similar in around 1/10th that. All formant synths, of course, not the sample-based ones that you're referring to.


SAM is awful but at the same time tantalisingly close (one of the demos of it apparently draws on a great reverse engineered version by Sebastian Macke and refactoring efforts by a couple of others including me - I spent too many hours listening to SAM output...) - especially when comparing to the still awful Festival/Flite models -, that I keep wanting to see what a better generic formant synth used as constraint on an ML model would produce.

That is, instead of allowing a generic machine learning model to output unconstrained audio, train it on the basis of letting it produce low bitrate input/control values for a formant synth instead, and see just how small you can push the model.


Me as a blind dude that is listening to a sinth all the time I don't cair about the size of the thing I just want good SQ.


That's fair, but those two are in constant tension. We can do very good speech if we ship gigs of samples. Better quality smaller models make it easier to get better quality speech in more places.


You probably could by what I call the "eastern-european method." Record one wave period of each phoneme, perhaps two for plosives, downsample to 8 or 11 kHz 8 bit, and repeat that recording on-the-fly enough times to make the right sound. If you're thinking "mod file", you're on the right track.

For phonetically simple languages, such a system can easily fit on a microcontroller with kilobytes of RAM and a slow CPU. English might require a little bit more on the text-to-phoneme stage, but you can definitely go far below 1MB.


That's effectively what eSpeak-ng is doing.

For the CMU flite voices they represent the data as LPC (linear predictive coding) data with residual remainder (residual excited LPC). The HTS models use simple neural networks to predict the waveforms -- IIRC, these are similar to RNNs.

The MBROLA models use OLA (overlapped add) to overlap small waveform samples. They also use diphone samples taken from midpoint to midpoint in order to create better phoneme transitions.


Most of the links there are dead.

That's a descendant of Festival Singer, which was well respected in its day.

What's a current practical text-to-speech system that's open source, local, and not huge?


Depending on your definition of "huge", you might find Piper TTS fits your requirements: https://github.com/rhasspy/piper

The size of the associated voice files varies but there are options that are under 100MB: https://huggingface.co/rhasspy/piper-voices/tree/main/en


Flite.


That's another spinoff of Festival Singer. What's good from the LLM era?


The speak & spell did it with a 32kB [edit, previously incorrectly wrote 16kB] ROM and a TMS0280.


I always wanted something like Flite but for Spanish.


Now try it with a trumpet! Herb Alpert's "The Trolley Song" is an underrated masterpiece of control over the sound a trumpet makes. No synthesized trumpet sound has ever done anything like this.

https://www.youtube.com/watch?v=mqr9E9Q-P5o


Haven't heard that one, it's great, thank you.


> Part of me feels that we are losing something, moving away from these classic approaches to AI

Absolutely. Seems a large amount of software developers have moved on from trying to understand how things work to solve the problem, and they are now instead just essentially throwing shit at a magical wall until something sticks long enough.


> It used to be that, to teach a machine how to [X], the designer of the system had to understand how [X] worked.

It does feel like we're rapidly losing this relationship in general. I think it's going to be a good thing overall for productivity and the advancement of mankind, but it definitely takes a lot of the humanity out of our collective accomplishments. I feel warm and fuzzy when a person goes on a quest to deeply understand a subject and then shares the fruits of their efforts with everyone else, but I don't feel like that when someone points at a subject and says "hey computer, become good at that" with similar end results.


I think it's going to be a good thing overall for productivity and the advancement of mankind, but it definitely takes a lot of the humanity out of our collective accomplishments.

I think AI will only cause us to become stuck in another local maximum, since not understanding how something works can only lead to imitation at best, and not inspiration.


I'm not convinced, because I think there will be a drive to distill down models and constrain them, and try to train models with access to "premade blocks" of functionality we know should help.

E.g. we know human voices can be produced well with formant synthesis because we know how the human vocal tract is shaped. So you can "give" a model a formant synth, and try to train smaller models outputting to it.

I think there's going to be a whole lot of research possibilities in placing constraints and training smaller models, and even training ensembles of models constrained in how they're interacting and their relative sizes to try to "force" extraction of functionality.

E.g. we have reasonable estimates at the lowest bitrate raw audio that produces passable voice. Now consider training two models A, B, where A => B => audio, and the "channel" between A and B is constrained to a small fraction of the bitrate that'd let A do all the work, and where the size of B is set at a level you've first struggled to get passable TTS output from.

Try to squeeze the bitrate and/or the size of B down and see if you can get something to emerge where analysing what happens in B is doable.


I've said this before on HN, but neural networks are not a tabula rasa. Their structure is created by people. With better domain understanding we can make better structures for AI models. It's not an either-or situation.


isn't the ultimate point of AI though, that instead of the traditional situation where the machines are the tools of humans, we become the (optional) tools for the AI? The AGI will do the understanding for us, and we'll profit by getting what we want from the black box, like a dog receiving industrially manufactured treats it cannot comprehend from its owner.


Quality of life improvements are much much more important than understandable models of speech, so we should live with, appreciate, and work to interpret and improve the current generation of complex neural TTS models.

I depend on TTS to overcome dyslexia, but I also struggle with auditory processing disorder that causes me to misunderstand words. As a result , classical TTS does not help me read faster or more accurately than struggling through my dyslexia. It causes me to rapidly fatigue, zone out, and rewind often, in a way that is more severe than when I sight read.

On the other hand, modern neural TTS is a huge enabler. My error rate, rewind rate, and fatigue are much better thanks to the natural tone, articulation, and prosody. I'm able to read for hours this way, and my productivity is higher than sight reading alone. This unlocks long and complex readings that I would never complete by sight reading alone, like papers in history, philosophy, and law. Previously I was limited to reading math, computer science, and engineering work, where I heavily depended on diagrams and math formulas to help me gloss over dense text readings.

The old tech had no impact on my life, given my combination of reading and listening difficulty, since it was not comparatively better than sight reading. But my life changed about 6 years ago with neural TTS. The improvement has been massive, and has helped me work with many non-technical readings that I would previously give up on.

The main issues I see now is not that neural models are hard to understand. For better or worse, we're able to improve the models just by throwing capital and ML PhDs at the problem. The problem I see is that the resulting technology is proprietary and not freely available for the people whose lives it would change.

We should work towards a future where people can depend on useful and free TTS that improves their quality of life. I don't think simple synthetic models will be enough. We must work to seize control of models that can provide the same quality of life improvements that new proprietary models can provide. And we must make these models free for everyone to use!


It's not at all a given that these two thing are in conflict. The best path towards free TTS might well turn out to be to identify ways of making smaller models that are cheaper to train and improve on if/when we can split out things we know to do (be it with separate neural models or other methods) and train models to "fill the gaps" instead of the entire end-to-end process.

There are also plenty of places where the current modern "neural" models are too compute intensive / costly to run, and so picking just the current big models isn't an option for all uses.


> The technique predates computers; it's basically the same technique used by the original voder [1] just under computer control.

Something similar from the 800s is the Euphonia talking machine ( https://en.m.wikipedia.org/wiki/Euphonia_(device) ).


* 1800s

Clicked that thinking someone had made a talking machine in the Middle Ages :)


Oops :)


Do you have any good resources on this?

I took a few stabs at understanding Klatt, but I feel like I had far too little DSP, math and linguistic intuitions back then to fully comprehend it, perhaps I should take another one now.


Blind person here, ESpeak-ng is literally what I use on all of my devices for most of my day, every day.

I switched to it in early childhood, at a time where human-sounding synthesizers were notoriously slow and noticeably unresponsive, and just haven't found anything better ever since. I've used Vocalizer for a while, which is what iOS and Mac OS ship with, but then third-party synthesizer support was added and I switched right back.


How fast do you set speech playback speed/rate?

I tried a bunch of speech synthesis, with speed and intelligibility in mind.

ESpeakng-ng barely intelligible past ~500 words per minute, and just generally unpleasant to listen to. Maybe my brain just can't acclimatize to it.

Microsoft Zira Mobile (unlock on win11 desktop via regex) sounds much more natural and intelligible at max windows SAPI speech rate, which I estimate is around ~600 and equivalent to most conversation/casual spoken word at 2x speed. I wish windows could increase playback even further, my brain can process 900-1200 words per minute or 3x-4x normal playback speed.

On Android, Google's "United States - 1" sounds a little awkward but also intelligible at 3x-4x speed.


Similar to OP if information is low density like a legal contract I can do 1200wpm after a few hours of getting used to it. Daily normal is 600wpm, if the text is heavy going enough I have to drop it down to 100 wpm and put it on loop.

Like usual the limit isn't how fast human io is but how fast human processing works.


Yeah 600wpm is passive listening. 900-1200wpm is listening lecture on youtube at 3-4x speed. Skim listening for content I'm familiar with. Active listening for things I just want to speed through. It's context dependent, I find I can ramp up 600-1200 and get into flow state of listening.

>text is heavy going enough I have to drop it down to 100 wpm

What is heavy text for you? Like very dense technical text?

>put it on loop

I find this very helpful as well, but for content I consume, not very technical, I listen at ~600wpm and loop it multipe times. It's like listening a song to death. Engrain it on a vocal / story telling level.

E: semi related comment to a deleted comment about processing speed that I can no longer reply to. Posting here because related.

Some speech synthesis are much more intelligible at higher speeds, and aids processing at higher wpms. What I've been trying to find is the most intelligible speech synthesis voice for upper limit of concentrated/burst listening which for me is around 1200wpm / 4x speed, i.e. many have wierd audio artefacts past 3x. There's synthesis engines whose high speed intelligbility improves if text is processed with SSML markup to add longer pauses after punctuation. Just little tweaks that makes processing easier. Doesn't apply to all content, all contexts, but I think some consumption are suitable for that, and it's something that can be trained like many mental tasks, and dedicated speech synthesis like fancy sport equipments improve top end performance.

IMO also something neural model can be tuned for. There are some podcasters/audiobook narrators who are "easy" listening at 3x speed vs others because they just have better enunciation/cadence at same word density. Most voices out there from traditional SAPI models to neural are... very mid fast "narrators". Think need to bundle speech sythensis with content awareness - AI to filter content then synthesis speech that emphasis/slow on significant information, breeze past filler - just present information more efficiently for consumption.


Thanks for the heads-up. May I ask you if you know websites / articles that explain daily setups for blind people ? I had issues that required me not to rely on sight and I couldn't find much.


No example output? Here's a youtube video where he plays with this software

https://www.youtube.com/watch?v=493xbPIQBSU


With timestamp, as there's hardly any TTS speech in that video: https://youtu.be/493xbPIQBSU?t=605


Oh man, it sounds awful, like 15-year old tech.

I've been spoiled by modern AI generated voices that sound indistinguishable from humans to me.


It is 30 years old tech.


When speaking Chinese, it says the tone number in English after each character. So "你好" is pronounced "ni three hao three". Am I using this wrong? I'm running `espeak-ng -v cmn "你好"`.

If this is just how it is, the "more than one hundred languages" claim is a bit suspect.


After some brief research it seems the issue you're seeing may be a known bug in at least some versions/release of espeak-ng.

Here's some potentially related links if you'd like to dig deeper:

* "questions about mandarin data packet #1044": https://github.com/espeak-ng/espeak-ng/issues/1044

* "ESpeak NJ-1.51’s Mandarin pronunciation is corrupted #12952": https://github.com/nvaccess/nvda/issues/12952

* "The pronunciation of Mandarin Chinese using ESpeak NJ in NVDA is not normal #1028": https://github.com/espeak-ng/espeak-ng/issues/1028

* "When espeak-ng translates Chinese (cmn), IPA tone symbols are not output correctly #305": https://github.com/rhasspy/piper/issues/305

* "Please default ESpeak NG's voice role to 'Chinese (Mandarin, latin as Pinyin)' for Chinese to fix #12952 #13572": https://github.com/nvaccess/nvda/issues/13572

* "Cmn voice not correctly translated #1370": https://github.com/espeak-ng/espeak-ng/issues/1370


I was curious, 300 minority languages are spoken in China, spread across 55 minority ethnic groups https://en.wikipedia.org/wiki/Languages_of_China


Beyond China, countries like Singapore and Malaysia have their own blend of Mandarin/Hokkien/Cantonese/English/Malay/etc. the particular blend may differ from family to family, even within the same village.

I used it on Android and it seems to be one of very few apps that can replace the default Google services text-to-speech engine.

However, I wasn't satisfied with the speech quality so now I'm using RHVoice. RHVoice seems to produce more natural/human-sounding output yo me.


Depending on context, I cycle between espeak-ng with mbrola-en or RHVoice, but even plain espeak shouldn't be discarded.

RHVoice sounds slightly more natural in some cases, but one advantage of espeak-ng is that the text parsing logic is cleaner, by default.

For example, RHVoice likes to spell a lot regular text formatting. One example would be spelling " -- " as dash-dash instead of pausing between sentences. So while text sounds a little more natural, it's actually harder to understand in context unless the text is clean to begin with.

I don't know if speech-dispatcher does this for you, but I'm using a shell script and some regex rules to make the text cleaner for TTS which I don't need when using espeak-ng.

Another tradeoff: espeak-ng with the mbrola doesn't offer all the inflexion customization options you have with the "robotic-sounding" voices. When accelerating speech, these options make a qualitative difference in my experience.

I can see why each of these can have its place.


On android I use ETI-Eloquence, but you cant get a legal one. Google it and look on the website blind help. There is a apk.


I always feel sympathy for the devs on this project as they get so many issues raised by people that are largely lazy (since the solution is documented and/or they left out obvious detail) or plain wrong. I suspect it's a side effect from espeak-ng being behind various other tools and in particular critical to many screen readers, thus you can see why the individuals need help even if they struggle to ask for it effectively.


Anyone know why the default voice is set to be so bad?


Why specifically do you consider it to be bad? Espeak-ng is primarily an accessibility tool, used as the voice synthesizer for screen readers. Clarity at high speed is more important than realism.


That can’t be a serious question. Go look at the accessibility voice for windows or Mac and then compare the way it sounds. Both of those are both more human like with better pronunciation.


To you.

The default voice sounds robotic for several reasons. It has a low sample rate to conserve space. It is built using a mix of techniques that make it difficult to reconstruct the original waveform exactly. And it uses things like artificial noise for the plosives, etc.

The default voice is optimized for space and speed instead of quality of the generated audio.


I’ll suggest that’s the wrong optimization to make for an accessibility tool. Modern CPUs are more than capable of handling its speed requirements by several orders of magnitude (they can decode h265 in real time for gods sake without HW acceleration). And same goes for size.

It’s simply the wrong tuning tradeoff.


But today disk space is not an issue.


As I've learned over time (and other people in these comments have clarified) it turns out that evaluating "quality" of Text To Speech is somewhat dependent on the domain in which the audio output is being used (obviously with overlaps), broadly:

* accessibility

* non-accessibility (e.g. voice interfaces; narration; voice over)

The qualities of the generated speech which are favoured may differ significantly between the two domains, e.g. AIUI non-accessibility focused TTS often prioritises "realism" & "naturalness" while more accessibility focussed TTS often prioritizes clarity at high words-per-minute speech rates (which often sounds distinctly non-"realistic").

And, AIUI espeak-ng has historically been more focused on the accessibility domain.


I don't have any disabilities so I don't know if espeak-ng is better on the pure accessibility axis. But given that MacOS tends to be received quite well by the accessibility crowd & it's definitely a focus from what I observed internally, given that MacOS has much higher realism & naturalness out of the box, I'm going to posit that it's not the linear tradeoff argument you've made & that espeak-ng defaults aren't tuned well out of the box.


I think it would be good if they provided some samples on the readme. It would be good for example if their list of languages/accents could be sampled [1]

[1] https://github.com/espeak-ng/espeak-ng/blob/master/docs/lang...

> eSpeak NG uses a "formant synthesis" method. This allows many languages to be provided in a small size. The speech is clear, and can be used at high speeds, but is not as natural or smooth as larger synthesizers which are based on human speech recordings. It also supports Klatt formant synthesis, and the ability to use MBROLA as backend speech synthesizer.

I've been using eSpeak for many years now. It's superb for resource constrained systems.

I always wondered whether it would be possible to have a semi-context aware, but not neural network, approach.

I quite like the sound of Mimic 3, but it seems to be mostly abandoned: https://github.com/MycroftAI/mimic3


FYI re: Mimic 3: the main developer Michael Hansen (a.k.a synesthesiam) (who also previously developed Larynx TTS) now develops Piper TTS (https://github.com/rhasspy/piper) which is essentially a "successor" to the earlier projects.

IIUC ongoing development of Piper TTS is now financially supported by the recently announced Open Home Foundation (which is great news as IMO synesthesiam has almost single-handed revolutionized the quality level--in terms of naturalness/realism--of FLOSS TTS over the past few years and it would be a real loss if financial considerations stalled continued development): https://www.openhomefoundation.org/projects/ (Ok, on re-reading OHF is more generally funding development of Rhasspy of which Piper TTS is one component.)


I hoped "-ng" would be standing for Nigeria - which would have been most fitting, considering Nigeria's linguistic diversity !


Can I get my map navigation prompts in the voice of Yoda please?

"At the roundabout, the second exit take."

"At your destination, arrived have you."


Quenya is an option, so (assuming you speak it) you could get your map navigation prompts in the voice of Galadriel...

"A star shall shine on the hour of our taking the second exit."

"You have reached your Destination, fair as the Sea and the Sun and the Snow upon the Mountain!"


You mean the voice, or the grammar? The grammar part is outside of the scope of a synthesizer. That's completely up to the user.

Or you want a model which translates normal English into Yoda-English (on text level) and then attach a speech synthesizer on that?

Or I guess an end-to-end speech synthesizer, a big neural network which operates on the whole sentence at once, could also internally learn to do that grammar transformation.



I'm quite surprised to find this on HN, synthesizers like espeak and eloquence (ibm TTS) have fallen out of favor these days. I'm a blind person who uses espeak on all my devices except my macbook, where unfortunately I can't install the speech synthesizer because it apparently only supports MacOS 13 (installing the library itself works fine though).

Most times I try to use modern "natural-sounding" voices they take a while to initialize, and when you speed them at a certain point the words mix together into meaningless noise, while at the same rate eloquence and espeak would handle just great, well, for me at least.

I was thinking about this a few days back while I was trying out piper-tts [0] how supposedly "more advanced" synthesizers powered by AI use up more ram and cpu and disk space to deliver a voice which doesn't sound much better than something like RH voice and gets things like inflection wrong. And that's the english voice, the voice for my language (serbian) makes espeak sound human and according to piper-tts it's "medium".

Funny story about synthesizers taking a while to initialize, there's a local IT company here that specializes in speech synthesis and their voices take so long to load they had to say "<company> Mary is initializing..." whenever you start your screen reader or such. Was annoying but in a fun way. Their newer Serbian voices also have this "feature" where they try to pronounce some english words it comes upon properly. It also has another "feature" where it tries to pronounce words right that were spelled without accent marks or such, and like with most of these kinds of "features" they combine badly and hilariously. For example if you asked them to pronounce "topic" it would pronounce it as "topich, which was fun while browsing forums or such.

[0] https://github.com/rhasspy/piper


Anyone interested in formants and speech synthesis should have a look at Praat[0], a marvellous piece of free software that can do all kinds of speech analysis, synthesis, and manipulation.

https://www.fon.hum.uva.nl/praat/


Is this better than the classical espeak which is available in opensource repositories?

I would be very glad if there's a truly open source local hosted text to speech software which brings good human sounding speech in woman/man german/english/french/spanish/russian/arabic language...


When you install espeak with a distro package manager, you're quite likely to get espeak-ng.


Based on your description of your requirements Piper TTS might be of interest to you: https://github.com/rhasspy/piper


I listen to ebooks with TTS. On Android via FDroid the speech packs in this software are extremely robotic.

There aren't many options for degoogled Android users. In the end I settled for the Google Speech Services and disabled network access and used the default voice. GSS has its issues and voices don't download properly, but the default voice is tolerable in this situation.


you can get ETI-Eloquence for android.


Thanks for the tip but it didn't work for me. Unselectable in TTS Engine options screen and app crashes for some reason

Another project falls victim to the tragic “ng” relative naming, leaving it without options for future generations


They can name the next iteration ESpeak-DS9 ;)


I actually have seen that done at a former employer, a very large agribusiness. I bet there are more examples of that very specific, not so intended versioning system out there.


Now I just want DECTalk ported to MacOS. The original Stephen Hawking voice.

I have an Emic2 board I use (through UART so my ESP32 can send commands to it) and I use Home Assistant to send notifications to it. My family are science nerds like me, so when the voice of Stephen Hawking tells us there is someone at the door, it brings a lot of joy to us.




Awesome, thank you!


Why is the quality of open source TTS so horribly, horribly, horribly behind the commercial neural ones? This is nowhere near the quality of Google, Microsoft, or Amazon TTS, yet for image generation and LLMs almost everything outside of OpenAI seems to be open-sourced.


I'm glad that it doesn't. A lot of us use these voices as an accessibility tool in our screen readers. They need to perform well and be understandable at very high rate, and they need to be very responsive. ESpeak is one of the most responsive speech synths out there, so for a screen reader this means key press to speech output is extremely low. Adding AI would just make this a lot slower and unpredictable, and unusable for daily work, at least right now. This is anecdotal, but part of what makes a synth work well at high speech rates is predictability. I know how a speech synth is going to say something exactly. This let's me put more focus on the thing I'm doing rather than trying to decipher what the synth is saying. Neural TTS always has differences in how they say a thing, and at times, those differences can be large enough to trip me up. Then I'm focusing on the speech again and not what I'm doing. But ESpeak is very predictable, so I can let my brain do the pattern matching and focus actively on something else.


The quality also depends on the type of model. I'm not really sure what ESpeak-ng actually uses? The classical TTS approaches often use some statistical model (e.g. HMM) + some vocoder. You can get to intelligible speech pretty easily but the quality is bad (w.r.t. how natural it sounds).

There are better open source TTS models. E.g. check https://github.com/neonbjb/tortoise-tts or https://github.com/NVIDIA/tacotron2. Or here for more: https://www.reddit.com/r/MachineLearning/comments/12kjof5/d_...


almost like there's a few bilion dollars difference in their budgets


Sure but it's been 15 years and the quality of the espeak command is equally horrible. I would have expected some changes ... especially considering even the free TTS inside Google Chrome is actually pretty decent, that could just be extracted and packaged up as a new version of espeak.


Festival it's nicer; Flite would run on a toaster and Mbrola can work with Espeak but the data it's restricted for commercial usage.


Deepspeech from Mozilla is open source. Did you know this one?

From the samples I listened, it sounds great to me?


As I understand it DeepSpeech is no longer actively maintained by Mozilla: https://github.com/mozilla/DeepSpeech/issues/3693

For Text To Speech, I've found Piper TTS useful (for situations where "quality"=="realistic"/"natual"): https://github.com/rhasspy/piper

For Speech to Text (which AIUI DeepSpeech provided), I've had some success with Vosk: https://github.com/alphacep/vosk-api


I have to dry this, thanks! Unfortunately I couldn't find samples on their git repo and it looks like it isn't apt-gettable. Maybe that's part of the reason.

They should make it so that I can do

    sudo apt-get install deepspeech
    sudo ln -s /usr/bin/deepspeech /usr/bin/espeak
Anything more than that is an impediment to mass adoption.

Seems they need some new product management ...


ESpeak is pretty great, and now that Piper is using it, hopefully strange issues like it saying nineteen hundred eighty four for 1984 the year, can be fixed.


As the sibling comment mentions, the next version of Piper will no longer use espeak-ng to avoid potential GPL licensing issues.


Yeah, it would be nice if the financial backing behind Rhasspy/Piper led to improvements in espeak-ng too but based on my own development-related experience with the espeak-ng code base (related elsewhere in the thread) I suspect it would be significantly easier to extract the specific required text to phonemes functionality or (to a certain degree) reimplement it (or use a different project as a base[3]) than to more closely/fully integrate changes with espeak-ng itself[4]. :/

It seems Piper currently abstracts its phonemize-related functionality with a library[0] that currently makes use of a espeak-ng fork[1].

Unfortunately it also seems license-related issues may have an impact[2] on whether Piper continues to make use of espeak-ng.

For your specific example of handling 1984 as a year, my understanding is that espeak-ng can handle situations like that via parameters/configuration but in my experience there can be unexpected interactions between different configuration/API options[6].

[0] https://github.com/rhasspy/piper-phonemize

[1] https://github.com/rhasspy/espeak-ng

[2] https://github.com/rhasspy/piper-phonemize/issues/30#issueco...

[3] Previously I've made note of some potential options here: https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...

[4] For example, as I note here[5] there's currently at least four different ways to access espeak-ng's phoneme-related functionality--and it seems that they all differ in their output, sometimes consistently and other times dependent on configuration (e.g. audio output mode, spoken punctuation) and probably also input. :/

[5] https://gitlab.com/RancidBacon/floss-various-contribs/-/blob...

[6] For example, see my test cases for some other numeric-related configuration options here: https://gitlab.com/RancidBacon/floss-various-contribs/-/blob...


just used it a few days ago, the quality is honestly subpar.

I use chrome's extension 'read aloud', which is as natural as you can get.


It's been mentioned elsewhere in the comments but espeak-ng has historically prioritized accessibility use cases which is a domain where "quality" doesn't necessarily correlate with "naturalness" (e.g. there is a preference for clarity at high words-per-minute rates of speech where the speech doesn't sound "natural" but is still understandable, for people who have acclimatized to it through daily use, at least :) ).


SORA AI should integrate this into their LLM.


Is it an LLM? What base model does it use?


eSpeak uses what is known as formant synthesis, and no LLM as far as I know.


Definitely no LLM! Espeak dates from at least 10 years before LLMs appeared and was based on the approach used on Acorn computers in the 80s and 90s.


hugging face?


Based on my own recent experience[0] with espeak-ng, IMO the project is currently in a really tough situation[3]:

* the project seems to provide real value to a huge number of people who rely on it for reasons of accessibility (even more so for non-English languages); and,

* the project is a valuable trove of knowledge about multiple languages--collected & refined over multiple decades by both linguistic specialists and everyday speakers/readers; but...

* the project's code base is very much of "a different era" reflecting its mid-90s origins (on RISC OS, no less :) ) and a somewhat piecemeal development process over the following decades--due in part to a complex Venn diagram of skills, knowledge & familiarity required to make modifications to it.

Perhaps the prime example of the last point is that `espeak-ng` has a hand-rolled XML parser--which attempts to handle both valid & invalid SSML markup--and markup parsing is interleaved with internal language-related parsing in the code. And this is implemented in C.

[Aside: Due to this I would strongly caution against feeding "untrusted" input to espeak-ng in its current state but unfortunately that's what most people who rely on espeak-ng for accessibility purposes inevitably do while browsing the web.]

[TL;DR: More detail/repros/observations on espeak-ng issues here:

* https://gitlab.com/RancidBacon/floss-various-contribs/-/blob...

* https://gitlab.com/RancidBacon/floss-various-contribs/-/blob...

* https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...

]

Contributors to the project are not unaware of the issues with the code base (which are exacerbated by the difficulty of even tracing the execution flow in order to understand how the library operates) nor that it would benefit from a significant refactoring effort.

However as is typical with such projects which greatly benefit individual humans but don't offer an opportunity to generate significant corporate financial return, a lack of developers with sufficient skill/knowledge/time to devote to a significant refactoring means a "quick workaround" for an specific individual issue is often all that can be managed.

This is often exacerbated by outdated/unclear/missing documentation.

IMO there are two contribution approaches that could help the project moving forward while requiring the least amount of specialist knowledge/experience:

* Improve visibility into the code by adding logging/tracing to make it easier to see why a particular code path gets taken.

* Integrate an existing XML parser as a "pre-processor" to ensure that only valid/"sanitized"/cleaned-up XML is passed through to the SSML parsing code--this would increase robustness/safety and facilitate future removal of XML parsing-specific workarounds from the code base (leading to less tangled control flow) and potentially future removal/replacement of the entire bespoke XML parser.

Of course, the project is not short on ideas/suggestions for how to improve the situation but, rather, direct developer contributions so... shrug

In light of this, last year when I was developing the personal project[0] which made use of a dependency that in turn used espeak-ng I wanted to try to contribute something more tangible than just "ideas" so began to write-up & create reproductions for some of the issues I encountered while using espeak-ng and at least document the current behaviour/issues I encountered.

Unfortunately while doing so I kept encountering new issues which would lead to the start of yet another round of debugging to try to understand what was happening in the new case.

Perhaps inevitably this effort eventually stalled--due to a combination of available time, a need to attempt to prioritize income generation opportunities and the downsides of living with ADHD--before I was able to share the fruits of my research. (Unfortunately I seem to be way better at discovering & root-causing bugs than I am at writing up the results...)

However I just now used the espeak-ng project being mentioned on HN as a catalyst to at least upload some of my notes/repros to a public repo (see links in TLDR section above) in that hopes that maybe they will be useful to someone who might have the time/inclination to make a more direct code contribution to the project. (Or, you know, prompt someone to offer to fund my further efforts in this area... :) )

[0] A personal project to "port" my "Dialogue Tool for Larynx Text To Speech" project[1] to use the more recent Piper TTS[2] system which makes use of espeak-ng for transforming text to phonemes.

[1] https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-to... & https://gitlab.com/RancidBacon/larynx-dialogue/-/tree/featur...

[2] https://github.com/rhasspy/piper

[3] Very much no shade toward the project intended.


"More than hundred"


Fwiw, in many languages that's correct. Coming from Dutch' "meer dan honderd", being taught to say one hundred is like teaching an English person to say "more than one ten" for values >10




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: