Coqui.ai TTS: A Deep Learning Toolkit for Text-to-Speech

modeless · 2024-06-11T16:48:06 1718124486

XTTSv2 is only slightly behind StyleTTS 2 near the top of the TTS Arena leaderboard, though they are both far behind Eleven Labs: https://huggingface.co/spaces/TTS-AGI/TTS-Arena

Personally I prefer StyleTTS 2, and it has a better license. But XTTSv2 has a streaming mode with pretty low latency which is nice. I did run into hallucination issues though. It will hallucinate nonsense words or insert extra syllables in words, pretty frequently.

As others mentioned they shut down so there won't be any updates to XTTS.

eginhard · 2024-06-11T18:34:23 1718130863

They just shared the paper for XTTS, which got accepted to Interspeech and might be the reason for this being posted now: https://arxiv.org/abs/2406.04904

jsemrau · 2024-06-11T19:39:21 1718134761

Interesting. I got quite good results for my longform substack by combining xTTS2 with Nvidia's Nemo.

WhitneyLand · 2024-06-11T21:39:49 1718141989

Anyone have a sense for how these compare to OpenAI’s TTS?

jonahx · 2024-06-12T01:25:52 1718155552

Somewhat unrelated, but given that anyone can vote anonymously, how is the TTS-Arena protecting itself against bots or even rings of humans gaming the system?

modeless · 2024-06-12T02:09:25 1718158165

Low stakes, I guess

Grimblewald · 2024-06-12T06:06:55 1718172415

problem is that low stakes divided by low cost of bots is still an acceptable return.

vessenes · 2024-06-11T17:46:37 1718127997

NB: Coqui is no longer actively maintained. I’m not sure what the team is up to now. The open market is definitely in need of an upgraded TTS offering; eleven labs is far ahead at the moment.

eginhard · 2024-06-11T18:49:45 1718131785

We do maintain a fork, mostly with bug fixes for now: https://github.com/idiap/coqui-ai-TTS PRs welcome :)

dlx · 2024-06-11T22:06:38 1718143598

Any progress on the license situation? I'd love to work more on it, but worried about it being a bit of a dead end due to uncertainty about the future of the license and not being able to use it in any commercial projects.

eginhard · 2024-06-12T06:07:47 1718172467

The licenses of the code (MPL 2.0, allowing commercial use) and the available pretrained models (https://github.com/idiap/coqui-ai-TTS/blob/dev/TTS/.models.j...) are all clearly stated and won't change unless the model owners decide to do so. So the XTTS model is still under CPML, which doesn't allow commercial use.

CaptainOfCoit · 2024-06-11T23:51:27 1718149887

> Any progress on the license situation?

What is the situation exactly? Seems to be licensed MPL at a glance, so you're able to use it in commercial projects.

woodson · 2024-06-12T03:04:44 1718161484

The pretrained models aren’t MPL licensed, though.

eginhard · 2024-06-12T05:59:38 1718171978

Many of them still allow commercial use. The question is most likely about the XTTS model, which doesn't, but its license is up to the original Coqui team.

personjerry · 2024-06-11T22:26:05 1718144765

Not surprising. When I was researching options for a client I tried a few companies including ElevenLabs and Play.ht, each seemed happy to talk to us... except Coqui. I think I went as far as reporting bugs to them, just to have them aggressively ignore me. I guess they're more of a research team than a business?

jokethrowaway · 2024-06-12T00:11:25 1718151085

They were very friendly and welcoming.

The main problem is quality, Eleven labs is so far ahead, even though their API is not very flexible.

Meta's Voicebox is the only other option that feels realistic - but it's for research only for now.

nmfisher · 2024-06-12T01:39:28 1718156368

Check out Sonic (cartesia.ai). Great quality, very fast - but with a few kinks to work out (going off the rails on long utterances, random sounds, etc).

phyce · 2024-06-11T22:10:24 1718143824

Coqui is great, but another fantastic tool for TTS I recommend checking out is Piper. The voice quality is great, it's extremely lightweight, and it's fast enough to generate TTS in realtime https://github.com/rhasspy/piper

dv35z · 2024-06-12T18:12:13 1718215933

Can you suggest (1) How to get it working on a Mac, (2) alternatively, how to get it running in a Docker container (on a mac)?

mlboss · 2024-06-13T01:57:25 1718243845

Works with rhel9 docker image and compiled binary link

dv35z · 2024-06-15T23:15:46 1718493346

Thank you!

huskyr · 2024-06-12T15:10:06 1718205006

Piper seems very interesting, but unfortunately the last time i tried it on macOS it didn't seem to work (anymore).

nishithfolly · 2024-06-11T16:40:45 1718124045

This was a great team. Sad to see they had to shut down.

ks2048 · 2024-06-12T00:27:11 1718152031

I don't know anything about the startup/VC world, but does anyone have insight on why this failed? It seemed to be one of the highest profile TTS projects and I thought money was just pouring into AI startups.

eginhard · 2024-06-12T06:12:52 1718172772

Some insights from one former lead: https://erogol.com/2024/01/09/goodsandbadsofopensource

TLDR: Making money from open-source is hard.

satvikpendem · 2024-06-12T01:26:15 1718155575

How does it compare to this recent Show HN, MARS5 [0]? Coqui is not maintained anymore so I'd be interested in what the SOTA is for open source TTS.

[0] https://news.ycombinator.com/item?id=40616438

SubiculumCode · 2024-06-12T02:19:07 1718158747

I have a pet ML project that I am doing for fun. I am trying to build a custom transcription and diarizer model for a friend's podcast[1]. My initial solution involved a straight forward implementation using Whisper medium for transcription, and Nemo for diarizing, based on [1]. The results are not bad generally, but since my application involves a fixed set of five known speakers, I thought surely I could fine tune the nemo (or pyannote) diarizer model on their voices to improve accuracy.

Audio samples are easily obtained from their podcast, but manual data labeling is painful for a hobby activity. Further, from what I understand, the real difficulty in performant diarizer models is not speaker recognition generally, but specifically speaker recognition while there is overlapping speech between multiple speakers. I am not even sure how to best implement a labeling procedure for segments with overlapping speech.

I started to wonder whether I might bootstrap a decent sample by leveraging TTS vocal cloning models to simulate the five speakers in dialogues with overlapping speech segments. So I ask HN, is this hopelessly naive, or potentially useful technique? Also, any other advice?

[1] https://www.3d6downtheline.com/ [2] https://github.com/MahmoudAshraf97/whisper-diarization/

tarasglek · 2024-06-12T08:48:52 1718182132

Unclear from docs, does your solution support inferring number of speakers from audio? Found it a bit frustrating that this wasn't automatic in diarization algos I tried last year

SubiculumCode · 2024-06-12T15:43:00 1718206980

The solution that this GitHub provided automatically determines the number of speaker labels, but will often create extra speaker classes for a few exerpts in the stream. You can prespecify the number of speakers I believe for better performance.

ackprakhack · 2024-06-12T06:01:28 1718172088

We've just opensourced MARS5 and are bullish about it's ability to capture very hard prosody -- hopefully you can validate the results and grow alongside its community.

We tend to agree, the time for just one company to be seriously doing speech is over. It needs to be more diverse, and needs to be opensource https://github.com/Camb-ai/MARS5-TTS

BenRacicot · 2024-06-13T21:30:46 1718314246

If we could run this locally (Win and Mac) it could reset the standard for accessibility.

vijucat · 2024-06-12T07:39:50 1718177990

I absolutely love how good the voices are in the VCTK-VIS dataset (109 of them!). I found it easy to install Coqui on WSL and it is able to use CUDA + the GPU quite effectively. p236 male and p237 female are my choices, but holy cow, 109 quality voices still blows my mind. Crazy how you had to pay for a good TTS just a year ago, but now, it's commoditized. Hope you find this useful:

    CUDA_VISIBLE_DEVICES="0" python TTS/server/server.py --model_name tts_models/en/vctk/vits --use_cuda True


 def play_sound(response):
     #learning : you have to use a semaphore to serialize calls to winsound.PlaySound(), which freaks out with "Failed to play sound" if you try to play 2 clips at once
     semaphore.acquire()
     try:
         winsound.PlaySound(response.content, winsound.SND_MEMORY | winsound.SND_NOSTOP)
     finally:
         # Always release the permit, even if PlaySound raises an exception
         semaphore.release()

ritonlajoie · 2024-06-11T18:50:27 1718131827

Are there any project which would make TTS with my own voice with some training on my voice ?

probably_wrong · 2024-06-12T12:09:09 1718194149

While the other commenters provided several voice cloning projects, I would like to point out that I haven't been able to find one that works well for South-American Spanish.

eginhard · 2024-06-11T19:41:29 1718134889

Yes, you can train/fine-tune models on your own voice with Coqui

willwade · 2024-06-11T22:39:25 1718145565

Elevenlabs, coqui, piper, Microsoft, Google, Apple. Seriously. They all can these days. Don’t forget acapela or nuance.

mttpgn · 2024-06-11T19:36:26 1718134586

Yes, elevenlabs can.

roskoez · 2024-06-11T23:52:57 1718149977

Anyone knows a modern TTS program for Windows? Something you can feed a text file to and have it read while it's on screen?

I've been using Dimio's Speech for a decade now, but it seems silly now that much better voices exist.

robotburrito · 2024-06-12T05:21:33 1718169693

I like this project. I used it to create a website that lets me turn a list of articles on the web into a podcast I can subscribe to via my phone.

nextworddev · 2024-06-12T00:58:25 1718153905

Is there anything that’s self hostable that’s on par with Elevenlabs?

spacemanspiff01 · 2024-06-11T16:38:20 1718123900

I believe the company behind this shit down at the end of 2023

giancarlostoro · 2024-06-11T17:40:38 1718127638

One of my favorite typos. ;) Also coqui is a frog in Puerto Rico (that wound up in Hawaii, sneaking into someone's luggage or something to that effect), when you hear them at night, what you are hearing is their mating call if I remember correctly.

Kerbonut · 2024-06-11T20:03:25 1718136205

I really like Parler TTS on the TTS Arena.

Jayakumark · 2024-06-11T16:39:35 1718123975

Its good except for license.

sa-code · 2024-06-11T16:53:25 1718124805

Is the license still relevant if the company has shut down?

cal85 · 2024-06-11T17:06:25 1718125585

marcooliv · 2024-06-11T21:41:32 1718142092

dlx · 2024-06-11T22:07:25 1718143645

The license forbids commercial use unless you buy a license. The problem is, no one seems to be selling one ;)

pabs3 · 2024-06-12T14:52:53 1718203973

The license is the MPL, which allows commercial use?

https://github.com/coqui-ai/TTS/blob/dev/LICENSE.txt

eginhard · 2024-06-12T18:45:01 1718217901

This is just the code license. Parents are referring to the XTTS model (their best one).

2Gkashmiri · 2024-06-12T02:04:39 1718157879

So if you violate the license, who will try to sue you?

Bilal_io · 2024-06-12T03:50:31 1718164231

You never know, and you don't want to find out