Hacker News new | past | comments | ask | show | jobs | submit login
Silero V3: fast high-quality text-to-speech in 20 languages with 173 voices (github.com/snakers4)
256 points by TheRealAicantar on June 20, 2022 | hide | past | favorite | 97 comments



"Enterprise-grade STT" and then a 19% word error rate on CommonVoice?

Scribosermo from 2020 was at 7% error rate. State of the art is around 3%.

Let me just illustrate fuck you how eerie ta thing and annoying it is if the eh aye gets every force word wrong.

Let me just illustrate for you how irritating and annoying it is if the AI gets every fourth word wrong.

(20 words, 4 mistakes => 20% word error rate)

EDIT: Just to clarify, I'm only criticizing their "speech to text" quality. Their "text to speech" quality is top notch and close to state of the art.


Can you elaborate further? I am not familiar with the field, but their benchmarks here seem to show quality similar to Google: https://github.com/snakers4/silero-models/wiki/Quality-Bench...

The only trick I can see being played is that Google was benchmarked on September 2020, so likely has already improved and they don't want to show that. Is CommonVoice a better standard to use when comparing these tools?


CommonVoice is kind of the sound quality that you expect if random people talk into their headset microphone. So that's kind of the quality you need to work with to build a phone system for the general public.

LibriSpeech, on the other hand, is audiobooks read in a silent setting with a good microphone. And some speakers are even professional narrators. So that's the dataset to compare for for an office worker using it for dictation with high-quality equipment.

Also, Google is kinda famous for having the worst speech recognition of the enterprise offerings. (Microsoft's Azure STT has roughly half the error rate) Plus they tested Google's old model.

But the main point I'm trying to make is that even if it is "as good as Google", that a 20% word error rate is still pretty much unusable in practice.


Commonvoice is actually not very good test set. Their texts are very specific (mostly wikipedia and such) and also the texts overlap between train and test which leads to overtraining of most transformer models. If you test on variety of domains, you'll see totally different picture.


> Also, Google is kinda famous for having the worst speech recognition of the enterprise offerings.

Not in my experience. I tested basically all commercial speech recognition APIs a number of years ago and Google was significantly ahead of everyone else.

It was some time ago and I haven't tested since, but my casual use of speech recognition systems (e.g. via Alexa or Google Assistant) suggests that it's only gotten better since then.


Google's gotten worse for professional use than they once were, in my opinion. Maybe it's because they're targeting a wider variety of dialects and accents but that's just a hypothesis. It used to be that if you spoke in the "dictation voice" where you enunciated clearly and bit your consonants Google would nail every word except true homophones but that isn't the case anymore.


Ok, thank you for the clarification.

I see your point, although "as good as Google" qualifies as "enterprise-level" in my book.


I was curious and poked at the TTS Colab to switch it from Russian to English (language = 'en', model_id = 'v3_en' under "V3", speaker = 'en_XXX' under "Text"). Short "this is a test"s sound great, so then I tried feeding it the nearest bit of interesting conversational text I had to hand - this thread.

Here's your comment through the two apparently-most-developed models:

en_116: https://vocaroo.com/1axxFRHCs4YF

en_117: https://vocaroo.com/1983M4jVGMdR

Uuhhhh. It has a bit of a way to go to get to where Google et al are at, IMHO. It sounds vaguely like someone put DeepDream and GPT-3 into a blender and selected the "Transmutate into TTS model" option. On the one hand it's undeniably up there in terms of not sounding like the previous generation of TTS, buuuut yeah it has a bit of a way to go.

To be clear it should take just about anyone under a minute to switch the Colab to English, this is just for whoever doesn't feel like fiddling (and/or is on mobile).


I remember we had systems with 15% error rate and successfully deployed in multiple solutions. Yes, look at the number, you would think it is really bad, but actually errors are more likely to happen for short words (such as for for example), those words are less meaningful, and the error would be 'four' instead of 'fuck'. And we were working with studio recording quality data XD


STT should use vosk, right?


I am so disappointed. What a harmful project.

I apologize for the somewhat blunt language, but no matter how worthy, all software licensed under CC-NC is worthless. It's for fun side projects that you want to die, not for anything of social value. Democratizing STT/TTS has a very clear social value, but CC-NC is a dangerous trap to anything that touches it. Anyone even slightly inclined to take some money to develop it - even still in the open - MUST avoid touching NC code with ten foot pole, or abandon their plans to publish anything. The better the project, the more damage CC-NC does.

"CC-NC considered harmful" http://esr.ibiblio.org/?p=4559 - by https://en.wikipedia.org/wiki/Eric_S._Raymond, founder of the OSI, the organization that defines the rules for what constitutes "open source" [1][2].

[1] https://en.wikipedia.org/wiki/Open_Source_Initiative [2] https://en.wikipedia.org/wiki/The_Open_Source_Definition


Oh, the same rhetoric used in depth about GNU AGPL licenses as well. And so nice to read the opinions of people explaining why a corporation X is not breaking your AGPL license and can use everything for free)

The reality is much simpler - in real world anyone hardly cares about licenses, companies and corporations steal all the time, it is only the matter of the amount of money you want to invest in litigation.

The only real silver lining is that all popular software licenses basically prohibit authors from defending their rights ... most prominent FOSS things are financed by corporations as a means of competition ... an or course you (i.e. me) should use a license that deprives me of any possible rights.


A lot of these comments are speaking on behalf of small projects and small companies who cannot afford to invest any money in litigation. So

> anyone hardly cares about licenses

is patently not true. If we are convinced that corporations steal and therefore CC-NC is acceptable, then is the intent of this project to only service corporations? Because it seems quite dangerous for individual projects.


> then is the intent of this project to only service corporations?

One of the intents is NOT to service corporations for free or promote such services.


As you said, corporations steal all the time, so they're able to use this "for free". But since I'm not willing to steal, I can't use this for any sort of project because commercial activity is not well-defined as the root comment points out. Ironically, that intent doesn't seem to line up with the result.


Yeah, and instead of contacting the authors directly, you are making a public case about how unfair the NC license is.


> Oh, the same rhetoric used in depth about GNU AGPL licenses as well

Well, no, not really. When the literal author of "The Cathedral and the Bazaar" denounces NC as a "dangerous trap" (direct quote) and explains that the Open Source Initiative forbids any such restriction in any open-source licenses, that's meaningful.

I think the AGPL does a reasonable job of promoting its goals. It's a patch over the GPL to clarify the technical vagueness of "derivative work" that the GPL almost doesn't address, except by convention and common understanding that developed before web services were as common as they are. Where the GPL was vague and exploited, the AGPL clarifies and closes the loophole. The OSI arguably agrees with this, as the AGPL is advertised as OSI-approved.

NC, by comparison, blows a big, huge, gaping loophole that is entirely separate from ideology. It turns the code socially and especially legally radioactive to anyone who might want to work on something similar.

To me, that pushes it far to the other end of ethical. Whereas the AGPL stands for principles, NC is not just significantly less ethical, it is actually /unethical/, even more so than proprietary software, in that as far as some legal departments are concerned, it bans interested people who develop expertise from working on it or related projects.

In other words, NC actively selects for people or corporations for whom either ethics or enforcement are a distant concern.

I have a friend who develops software to help people communicate through STT and TTS and several more who active make use of it, and they would do anything to have an alternative that didn't involve paying copious amounts of money to megacorporations. Unfortunately, this means that they have to structure their software as a service model, despite wanting it to be as freely available as is viable.

So I can't express in words how excited I was to find an open-source STT when this was posted, and how immediately crushed I felt that the author actually just threw up two huge middle fingers to open-source. I know how stupid or ingenuine it might sound to claim, but it truly brings me close to tears at the cruelty. Like, WHY??? They put all this time and effort into making a viable alternative to the for-pay bullshit, and they make sure that nobody can use it except corporations without morals? What the fuck?

I wish they could feel the pain they inflicted, but I get the feeling that they don't take the licensing even half as seriously as they should.


> Where the GPL was vague and exploited, the AGPL clarifies and closes the loophole

It can be easily surpassed. Just create a simple wrapper and publish it. And voila, you are can use everything for free again)

All of these FOSS licenses are just beautiful constructs, not related to how the world really works.

A simple question. How do people, building true FOSS libraries, make their ends meet, if everything is 100% free?

> ethics or enforcement are a distant concern.

The main thing that you are missing is that ethics and capitalism are very distant concerns.

> they make sure that nobody can use it except corporations without morals? What the fuck?

We live in different worlds, man.

The summary of your complaints is that you want to use NC software or artefacts for profit (basically to resell it in some form or another) … and you cannot because you are so moral.

But … just use it not for profit, or pay the authors, if you use it for profit. Simple, right?

> even more so than proprietary software

As an experiment, try stopping to use any non 100% FOSS software (or any software you did not pay for) for a day and report the results.


Well no, the wrapper could count as a derivative work. Even if the wrapper were permissively licensed rather than copyleft, the assembled whole of a product making use of the wrapper would arguably be covered. One of the glaring issues with the GPL is that copyright law imperfectly defines "derivative work"; though this works to the GPL's advantage as the ambiguity ensures the risk is taken by whomever is attempting to make use of the library or tool.

The GPL has been litigated against fairly deep-pocketed companies, and while there hasn't been much in the way of precedent-setting court decisions, there have been multiple victories in the sense that these huge companies settled rather than set that precedent.

The GPL is explicitly constructed to exist in the real world; it was founded on the premise that the infectious nature of copyright was toxic and should be twisted to more equitable means so that software would be shared as it should be, rather than hoarded and exploited.

NC is just another form of hoarding and exploitation.

> The summary of your complaints is that you want to use NC software or artefacts for profit (basically to resell it in some form or another) … and you cannot because you are so moral.

Yes, minus the profit. I want to use the software. Improve it, redistribute it, help people make it work for them. Because unlike you or the author, I actually give a shit about ethics, and I live in the real world where software has external costs to hosting and maintenance, which involve transacting in commerce to integrate or make use of tools. I'm pretty far left and anti-capitalist, but even full-blown communism still involves transacting in commerce!

I can't even look at the fucking source code, goddammit. Using it at all would create a master-servant relationship if I wanted to, for example, take a bounty to implement an open-source feature.

It's so fucking hypocritical and stupid and selfish. They might as well take it closed-source and distribute it for free if they're going to pull this shit. Would be less harmful.


Also, a bit more detailed info here with voice samples:

- https://www.reddit.com/r/MachineLearning/comments/v9rigf/p_s...


"Attribution Non-Commercial Sharealike"

Oops. Non-Commercial. Thus not open source or free software.

I'll look elsewhere.


https://github.com/snakers4/silero-models/tree/941f911858f51...

Looks like it was previously GPL until 2 months ago, you could use the old version


Anyone with more knowledge of the subject matter able to evaluate how much someone would lose by forking from just before this GPL license change. I’m always considering TTS technology for various things but I’m also put off by the non commercial clause.


We already had our issues with local corporations neglecting the license (and being in general disrespectful towards the community), so we had to change it to CC BY-NC-SA to avoid this in future.


If you can't enforce your license how is choosing a different license helping you?

After a quick glance I thought I might use this for a project of mine but NC-SA definitely killed my interest in this. I'm not a local corporation trying to get rich, but "NonCommercial" is an absolute minefield, left vague on purpose[1].

I've seen this sort of licensing on free game assets like Warsow and I think they work there (after all, the main use is to scare people away from re-using them at all) and I really wonder why something like AGPL wouldn't work better in this case.

1. https://wiki.creativecommons.org/wiki/NonCommercial_interpre...


You may consider the Open Data Commons Open Database License (ODbL) used by OpenStreetMap. It will be much more acceptable by the community.

Details: https://wiki.osmfoundation.org/wiki/Licence/About_The_Licenc...


What does CC BY-NC-SA actually mean when applied to the model? Are the .wav files generated by it considered CC BY-NC-SA as well or only software derived from it? Would I be allowed to use the output in say a Youtube video?

PS: This still points to the old AGPL license: https://github.com/snakers4/silero-models/wiki/Licensing-and...


Would you be open to AGPLv3? This mandates allowing the user to replace the code with their own, and requires publishing source code to users even if it's "hidden in the cloud". (well IANAL, so don't take my understanding as a fact)


We used to have AGPLv3 or similar, but we decided to abandon it for the reasons I explained in this (or above) thread.


Your licensing is confused.

Companies do steal software. Universities steal software. Non-profits steal software. Individuals do too. Harvard Medical School is pirating software I wrote. I raised it with them through multiple channels, and they simply didn't respond to emails. It's not worth a law suit against a $40B entity. It doesn't matter what license I used. They stole it.

CC-NC-BY-SA guarantees the only entities using your software will be ones who don't mind breaking laws. "Non-commercial" is legally ill-defined, and virtually any use can appear as related to commerce in some way. It's a liability hole. If it's being used internally or on a server, you also can't enforce the SA provision.

AGPLv3 is the license you want. No commercial entity working on anything proprietary will realistically touch that with a 10-foot pole, unless they're willing to break laws (but non-commercial use is okay). You can enforce SA, and get changes back. It's designed for exactly this purpose.

The reasons you explained make no sense. Your logic is at the level of: "My computer was getting hot, so I got a new hard drive." "I thought my computer might have a virus, so I swapped out the RAM."


> Companies do steal software.

They do not. The do not care about them as well.

> Harvard Medical School is pirating software I wrote. > It's not worth a law suit against a $40B entity. > It doesn't matter what license I used. They stole it. > You can enforce SA, and get changes back. It's designed for exactly this purpose.

It can be enforced, yet you cannot enforce it. By your own logic, in real life licenses hardly matter at all.


Licenses matter. There are two types of parties:

- Ones how respect licenses

- Ones who do not

90% of the time, if someone violates my license, and I send a polite email, it is followed from there on. 10% of the time -- as in the Harvard Medical School case -- there's a wilful violation.

For the 90% of parties who do follow licenses, they lay out a sort of constitution or a set of rules everyone in a commons plays by.

For the 10% who don't care, you can enforce them, but it will eat your life. Litigation sucks. Or you can ignore it. I generally do the latter. The most I do is name names in public forums, and only once it's abundantly clear that it's wilful, as I did with Harvard Medical School.


What understand from German copyright law at least, it is equally important to actually have damages(at least that is what a FOSS specialized company lawyer explained to me based on some ordinary court cases). That means it might be easier if you have dual licence (and successfully sold it) and can prove that someone did not pay the regular licence fee, so you can be eligible to double the fee. I guess if the money at stake is big enough you find a lawyer. However I guess the case needs to be clear enough because the legal fees will go up as well for failure, I guess.


What I hear is that the affluent "customer" does not respect licenses and you have no resources to enforce your license.

Well ... maybe there is a correlation?


Correlation between what? Between affluence and lack of respect for licenses? A weak one at best. My experience is that there are sleazeballs from all backgrounds, and good people too.

I wouldn't enforce my license against a less affluent party either.

My experience is that I can spend my life fighting the good fight, or having fun. I'd rather have fun. Perhaps that's selfish of me -- Harvard will keep stealing -- but having taken both routes, dealing with crooks is a lot of stress and pain.

Building stuff and dealing with honest people is fun.

I'm excited about the Free Software Conservancy's copyright assignment. It feels like a V0, and I'm not ready to send my code over to them quite yet, but having someone else do the fighting on my behalf (and collect any gains too) would let me focus on having fun, while going after bad players: https://sfconservancy.org/assignment/

Come to think of it, it's an approach you might consider. Depending on your goals, they might be a good fit.


That's a very vague response. Can you be more specific?


Are there any good alternatives? I looked at Tortoise TTS but the performance is too slow. Silero is fast enough but the non-commercial license is a huge turn off. Shame since it seems like it would've been a nice project to contribute to.


Yeah +1. This is impressive and I would have used it in a heartbeat for a new project I'm involved in, but the inability to even consider commercialization means going this route is dead in the water. Will be looking elsewhere.


honest question: why do you consider non-commercial not open source or free software but not attribution or sharealike ?


That has been the most common definition at least for the last 25 years. The Wikipedia article about open source is a good place to start reading. https://en.m.wikipedia.org/wiki/Open-source_license


Historically, not having "non-commercial" was extremely important for Free Software, as it meant people were allowed to distribute tapes, floppies and CD-ROMs for money with the software on it. We would never have had all the Linux distributions without that and nobody would have been able to get the software, as Internet for the masses was still a decade away.

The modern Internet has drastically lowered that barrier and "non-commercial" is slightly less of an issue, as you no longer need a middle man handling the shipping. However servers still don't run for free and non-commercial creates the problem of defining what commercial use actually is. Is having ads on your page enough to make it commercial use? What if the ads aren't under your control, but automatically inserted by the hoster (Instagram, Youtube, Reddit, etc.)? Donations to keep the servers running can count as commercial and all that. It gets really tricky to draw the line and most people will just avoid anything with a "non-commercial" clause.

Meanwhile attribution and sharealike are relatively harmless. Many Free Software licenses have required similar. They tell you exactly what rules you have to follow and don't hinder the distribution or use of it.


https://www.gnu.org/philosophy/free-sw.html#four-freedoms

I'd more generally point to RMS' writings from the nineties and before. They're prophetic. He gets dismissed, but it's almost scary how much he predicted correctly 30-40 years ago, and how closely open source (which started as a push-back against free software) eventually converged to what he wrote back then.


The OSI definition of Open source says "The license shall not restrict any party from selling or giving away the software as a component of an aggregate software distribution containing programs from several different sources. The license shall not require a royalty or other fee for such sale."

NC obviously violates the definition of Open Source. As far as I can tell it is a less well specified version of the Commons Clause license.


Both Open Source (as defined by OSI) and Free Software (as defined by FSF) do require freedom of use, by anyone, for any purpose.


What would it take for me to hear a sample output of this project ? I see pages and pages of numbers that do not mean a thing to me in the "quality" section while I really want to have some kind of idea of how it sounds like.



All we needed was one clip longer than 2 seconds.


You can generate any sentence you want with provided colab examples.


How does it compare to Tortoise TTS? If they are comparable, why does Tortoise TTS require a GPU whereas this one does not?


Aside from voice quality (Tortoise is undefeated) and speed (Tortoise is vastly slower), there is a difference in consistency and control.

Tortoise is a probabilistic system that uses one model to randomly generate 80+ different possible outputs and then another model to pick the best one. A consequence of this approach is that sometimes none of the generated outputs are good, the selected 'best' choice is weird and you have to run it again to get something you like.

(That's not necessarily a bad thing. It's actually incredibly cool to be able to say "that's not quite what I want, can you say the same thing but differently?" and have the computer just keep saying it in different ways until you're satisfied. But, it's probably not the right choice for something like Siri where you want consistently decent output 100% of the time, without human review.)

Tortoise also currently lacks any direct control over pronunciation or speaking rate, which is a deal-breaker for many applications.


This not only does not require a GPU, but also works on 1-4 CPU threads (!):

- 8 kHz, 1 thread 15-25, 4 threads 30 - 60 - 24 kHz, 1 thread 10, 4 threads 15 - 20 - 48 kHz, 1 thread 5, 4 threads 10

the numbers are seconds of audio generated per second


So, I've been playing with this a little bit now and here are some comments:

- it is very fast and scales quite nicely on CPU with 4 threads (~ twice the speed), but not further (I tried it on a 64 cores box). Not sure why since they seem to be using torch's native threading support.

- surprisingly, it is not that much faster when run on a GPU

- the quality (as in: what I'm hearing, not a formally measured metric) is good but (YMMV) not as good as turtle.

- it breaks with strange error messages if the text you feed it is too long

- there is mention of "a model for text repunctuation and recapitalization", which I wonder if it could be used to break a very long text (eg a book) into pieces that can be digested by the tts engine (link: https://habr.com/ru/post/581960/)

- as mentioned in another post, if you want a simple 'convert_this_text_file' CLI utility, you'll have to roll your own or use https://github.com/Grumbel/silero-test/blob/master/silero-te...

Altogether very nice work, especially the speed.


Many thanks for a detailed and thoughtful comment.

> it is very fast and scales quite nicely on CPU with 4 threads (~ twice the speed), but not further (I tried it on a 64 cores box).

Well, practically it does NOT scale even past 6 threads. 64 cores are just overkill, and most likely it will only hurt performance.

> Not sure why since they seem to be using torch's native threading support. > surprisingly, it is not that much faster when run on a GPU

Probably for the same reason, you can speed up the NN only so much. Realistically it can be made 2-3x faster still. Also currently we abandoned batching, so GPUs are not really required at all.

> the quality (as in: what I'm hearing, not a formally measured metric) is good but (YMMV) not as good as turtle.

I believe the compute required during training and inference … may differ by 3 or 4 orders of magnitude (!).

Also note, that some speakers and languages just sound better due to high quality of source material and the amount of work invested and polish.

> it breaks with strange error messages if the text you feed it is too long

Well, there should be a warning somewhere, but it works with text no longer than 512-1024 symbols.

> there is mention of "a model for text repunctuation and recapitalization", which I wonder if it could be used to break a very long text (eg a book) into pieces that can be digested by the tts engine

This model only restores some punctuation marks and capital letters.

There are libraries like razdel for this - https://github.com/natasha/razdel


The title is inaccurate according to https://pytorch.org/hub/snakers4_silero-models_tts and the YML file in the repo.

Supported Languages and Formats As of this page update, the speakers of the following languages are supported both in 8 kHz and 16 kHz:

Russian (6 speakers) English (1 speaker) German (1 speaker) Spanish (1 speaker) French (1 speaker)

I don't see how we get 173 voices and 20 languages from that.


HN removes the HTML anchors. The link (which I also copied below in the comments, b/c it was noticed after publishing) should be:

- https://github.com/snakers4/silero-models#text-to-speech

This link leads to the TTS section, whish contains the full list of all of the speakers and languages.


Loosely related, but what's the state of the art in natural text to speech ML/AI models?


text to WAV:

You predict mel spectrums with a transformer architecture (so word embedding + attention decoder) and then convert them into audio signals with Parallel WaveGAN or Hifi-GAN.

FastSpeech2 (Microsoft) is extremely good.

WAV to text:

You detect wave shapes with convolutions to generate an embedding, then attention layers to turn it into an encoding, then convert that to logits. Logits go into language model and that predicts the final sentence with a beam search decoder.

wav2vec 2.0 (Facebook) is amazing.


Tortoise-TTS (stylised as "TorToiSe") is pretty amazing: https://github.com/neonbjb/tortoise-tts


Silero TTS works fast even on one CPU thread, this is the point


Nice to see this here - Silero is also the engine that powers the "dataset builder" for Voice-Cloning-App (https://github.com/BenAAndrew/Voice-Cloning-App), a GUI TTS system that modifies Tacotron2 slightly.

Just sharing the links in case others are new to the space and keen to tinker on some solid open-source offerings.


So, I'm not sure I get the "embarrassingly simple" part.

As in here is a file some_txt.txt, how do I convert it to a .wav ?

As in: what do I type in my shell to convert some_txt.txt to some_txt.wav

That would be something that would deserve the moniker "embarrassingly simple" in my book.


Exactly! Why so many ML tools don't provide a static single-binary CLI? "That", is embarrassingly simple.


It's hard not to see it as researchers doing typical ivory tower gatekeeping. If it were only a few of them that neglected to provide simple tools and installation instructions for common people, I would chalk it up to laziness. But this is the standard way with TTS projects, every time it takes me ten minutes of head scratching until I get the thing to work. It's been this way for years.

To make it twice as frustrating, these new systems never seem to trickle down to FOSS software users; they get incorporated into commercial products but the average linux desktop user with vision problems is still left suffering with Festival. Any such user who wants to use these new models is left to figure out how to integrate these systems themselves.

Anyway, I do appreciate them publishing this. I got it working now and it will suit my needs well; it's the best CPU-based TTS that I've managed to get running thus-far, and I think the quality will be good enough for narrating ebooks. Good enough for now, I'll figure out how to get Firefox using this later.


> what do I type in my shell to convert some_txt.txt to some_txt.wav

Doesn't look like they have any command line tool included, but they have a simple Python example in the README. I used that, added some argparse around it and build a Nix package out of it. In case you are running Nix with flakes enabled, you can just type:

   nix run github:grumbel/silero-test -- -t "Hello World" --speaker en_0 -o /tmp/out.wav
Source is pretty much "embarrassingly simple":

   https://github.com/Grumbel/silero-test/blob/master/silero-test
It's currently hard codes the English language model, other models can be found at https://models.silero.ai/models/tts/

PS: This is probably reinventing the wheel, haven't looked around if there is anything ready to use already.


Super, this is very useful, thank you!


Their Colab examples work pretty much like that.


Colab is running software on other people's computer.

The moment you try to reproduce in local env, you'll be greeted with many "non-existent and unmatched dependency" errors.

Also, "examples" do not cut it.


I am not sure, what can be more simple than 1 LOC invocation + minimal imports.

It is true that the model is based on PyTorch + python, but the majority of complexity (like SSML parsing) is tucked inside of the model.

Theoretically one can make a simplified model without any of those features in plain PyTorch or ONNX, but so far we did not have proper motivation to do.

As for CLI, this also seems simple enough, but out of scope for us.


In such situations it could be useful to provide a container image or nix or guix shell setup, to make sure people have the dependencies they need.


> I am not sure, what can be more simple than 1 LOC invocation + minimal imports.

Let me make it "embarrassingly simple" for you:

    /bin/bash text_to_speech.sh file.txt file.wav
Also, I'm not entirely sure what "out of scope" mean?

Do you mean you run your software on computers that can't run bash?

Do you develop machine learning algorithms on your phone?


> Do you mean you run your software on computers that can't run bash?

It is explicitly stated, that PyTorch is the only real requirement. Bash is not required, i.e. models can be run on Windows or ARM with PyTorch.

> Also, I'm not entirely sure what "out of scope" mean?

There was no tangible benefit in making a bash CLI for us.


From my experience with similar projects, it doesn't get any simpler than creating a virtual environment, running requirements.txt and using a simple function to get what you want. Did you have a problem when you tried running that? Colab in this case is just abstracting that part for the user.


Not criticising this project in particular but I frequently find that Colab is just a way for people to get/be very very bad at managing build/deployment of their code. It allows hand rolling a bunch of adjustments to an environment that may only be barely understood and then simply cloning that poorly understood environment.

3/4 times I try to make/rebuild a Colab based demo from scratch in a suitable non Colab environment… the setup instructions are caring degrees of wrong. From the little mistakes like under specific requirements that are now broken due to transient dependency changes, to completely wrong because everything has changed to the absolute worst version of all, the never even written down.

I find Colab is a subtle form of lock in by providing useful crutches … by leaning on the crutches of Colab handing all this hard dependency and environment management stuff you never need to learn how to do it any better than necessary to function on Colab… to draw a somewhat nasty analogy using terminology from the DevOps world, good dependency and build tools make a folder full of code like cattle, you can blow it away and rebuild it when you want, but Colab let’s you raise a pet by hand and then just magically clones it whenever you or someone else need a copy.


Yes, that has been my exact experience with folks who work within Colab and other Jupyter-like things:

    1. They assume everyone has access to the same environment they do

    2. They often don't understand anything about the infrastructure that's running their stuff

    3. They produce very interesting work (such as this particular TTS work)

    4. They drop 90% of their potential audience within 5 mn because the bloody thing lives in a weird cloud-only environment or requires a nightmarish stack of dependencies to run on a local machine and basically can't be simply integrated in a larger pipeline (e.g. a simple shell script).
My experience has been that getting ML researchers to get their head out of colab's ass and learn to type things like "ls" and "cd" is really hard.


Jupyter has similar issues with bad environments but it’s usually much closer to “didn’t get my dependency versions right” or “this could theoretically run with less junk” and things like that.

Colab is far worse, they say “don’t worry about it, you can just clone” and it’s just been a toxic spill, rotting away at the level of understanding in the ML community. Colab let’s you basically never put any effort into management of setup, dependencies, or data, and consequently it’s both amazing and fucking horrible the moment you want to avoid using it because everyone just builds their project “leaning on” the capabilities of Colab… it’s built an entire shanty town of poorly managed ML projects leaning precariously against the supports provided by Colab.

I’m just glad it hasn’t sucked too much air out of Jupyter in the ML community because at least stock Jupyter based tools are easy enough to take apart and reverse engineer since it’s a normal Python ecosystem, no magic Google drive data links, no custom Google tensor unit specific libraries, no push button magic clones of entirely hand crafted environments.


> it doesn't get any simpler than creating a virtual environment, running requirements.txt and using a simple function to get what you want.

Can't tell if serious or sarcasm.


Can you point out any ML project that works any simpler than this? Other than running Colab of course, which I mentioned.


> Can you point out any ML project that works any simpler than this? Other than running Colab of course, which I mentioned.

https://bellard.org/nncp/


That's certainly very nice, but I'm sure you can appreciate the complexity involved here. I would need to compile this and have CUDA properly configured on Linux, or have no CUDA support on Windows. So even your hand-picked example is not that different from the process I just described, which is the standard for ML projects as of today, even for players like Meta or Nvidia.


Usually these projects depend on Python wheels and native binaries. No I haven't tried reproducing this project.

Interesting project btw, kudos to the dev.


Also also, HN cuts the anchors in ULRs, so here is the full URL

- https://github.com/snakers4/silero-models#text-to-speech


Based on my limited understanding it seems to be possible to use (Not training) these models using C# and the ONNX runtime without needing any Python or other dependencies installed (would need to download the models manually). https://github.com/snakers4/silero-models#onnx=

Is this correct?


For STT models - yes. For TTS models - not yet.


Awesome. Is there an active plan to make the TTS models available on ONNX? Or is it something that is possible but is not being worked on right now? Is there an issue / ticket somewhere that I can follow (Or should I create one?)


It is possible, albeit with a significant simplification of the capabilities of the models (i.e. all of the SSML stuff will be left out).

Also ONNX boasts native quantization that just works.

But we are not currently actively working on this. Most likely ONNX will only be available only for commercial customers for special buds, but we are not decided on this yet.


Any chance this could be ported and used on Android / iOS? There is a Kaldi port for mobile and it works reasonably well.


These TTS models are not related to Kaldi, they are based off PyTorch and TorchScript.

There can be made a simplified version, with ONNX models (or plain Torch jit) maybe and some outer logic, but we did not do it yet for lack of incentive.


Off-topic: Any piece of advice for a Java-based TTS? (I know and use Mary TTS 5, but may bé there are better ones).


Our model can be simplified to remove all of the Python bits, and made to work with plain PyTorch jit-models or ONNX models (which both have a JAVA API), but we did not invest time in this yet.

Typically, JAVA ~ commercial usage, and they are typically able pay for a license and / or they can use a model behind an API.


Not sure exactly what you mean by "JAVA ~ commercial usage" :)


What is the actual underlying technology powering this? I read several of the blog posts, the Colab notebook, and the GitHub page, but couldn't find anything saying what is actually being used? Sounds pretty great and I'm very curious to know a bit about how it's done!


It is PyTorch


I meant what ML model or variety of ML model. Is it using Tacotron2? Old concatenative TTS? Something else?


Neither of these


Totally random question: would it be possible to run speech-to-text -> text-to-speech in a practical manner for something like free audiobooks to fix the audio for some of the volunteer readers who have less than optimal recording setups?


Possible, sure... But you're going to have errors in transcription that'll be read out if it's fully automated. Might as well just have ebook to TTS and save yourself some time.


I would love for Silero to support Arabic.

As a team that builds standalone smart glasses we often find ourselves paying outrageous license fees to companies for having Arabic and Hebrew TTS engines on our device. Google consistently refuses to add Arabic to their Google TTS engine and hence we'll have to rely on paying out of our pocket for this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: