"Enterprise-grade STT" and then a 19% word error rate on CommonVoice? Scriboserm...

spupe · on June 20, 2022

Can you elaborate further? I am not familiar with the field, but their benchmarks here seem to show quality similar to Google: https://github.com/snakers4/silero-models/wiki/Quality-Bench...

The only trick I can see being played is that Google was benchmarked on September 2020, so likely has already improved and they don't want to show that. Is CommonVoice a better standard to use when comparing these tools?

fxtentacle · on June 20, 2022

CommonVoice is kind of the sound quality that you expect if random people talk into their headset microphone. So that's kind of the quality you need to work with to build a phone system for the general public.

LibriSpeech, on the other hand, is audiobooks read in a silent setting with a good microphone. And some speakers are even professional narrators. So that's the dataset to compare for for an office worker using it for dictation with high-quality equipment.

Also, Google is kinda famous for having the worst speech recognition of the enterprise offerings. (Microsoft's Azure STT has roughly half the error rate) Plus they tested Google's old model.

But the main point I'm trying to make is that even if it is "as good as Google", that a 20% word error rate is still pretty much unusable in practice.

nshm · on June 20, 2022

Commonvoice is actually not very good test set. Their texts are very specific (mostly wikipedia and such) and also the texts overlap between train and test which leads to overtraining of most transformer models. If you test on variety of domains, you'll see totally different picture.

IshKebab · on June 20, 2022

> Also, Google is kinda famous for having the worst speech recognition of the enterprise offerings.

Not in my experience. I tested basically all commercial speech recognition APIs a number of years ago and Google was significantly ahead of everyone else.

It was some time ago and I haven't tested since, but my casual use of speech recognition systems (e.g. via Alexa or Google Assistant) suggests that it's only gotten better since then.

causality0 · on June 20, 2022

Google's gotten worse for professional use than they once were, in my opinion. Maybe it's because they're targeting a wider variety of dialects and accents but that's just a hypothesis. It used to be that if you spoke in the "dictation voice" where you enunciated clearly and bit your consonants Google would nail every word except true homophones but that isn't the case anymore.

spupe · on June 20, 2022

Ok, thank you for the clarification.

I see your point, although "as good as Google" qualifies as "enterprise-level" in my book.

exikyut · on June 20, 2022

I was curious and poked at the TTS Colab to switch it from Russian to English (language = 'en', model_id = 'v3_en' under "V3", speaker = 'en_XXX' under "Text"). Short "this is a test"s sound great, so then I tried feeding it the nearest bit of interesting conversational text I had to hand - this thread.

Here's your comment through the two apparently-most-developed models:

en_116: https://vocaroo.com/1axxFRHCs4YF

en_117: https://vocaroo.com/1983M4jVGMdR

Uuhhhh. It has a bit of a way to go to get to where Google et al are at, IMHO. It sounds vaguely like someone put DeepDream and GPT-3 into a blender and selected the "Transmutate into TTS model" option. On the one hand it's undeniably up there in terms of not sounding like the previous generation of TTS, buuuut yeah it has a bit of a way to go.

To be clear it should take just about anyone under a minute to switch the Colab to English, this is just for whoever doesn't feel like fiddling (and/or is on mobile).

powerapple · on June 21, 2022

I remember we had systems with 15% error rate and successfully deployed in multiple solutions. Yes, look at the number, you would think it is really bad, but actually errors are more likely to happen for short words (such as for for example), those words are less meaningful, and the error would be 'four' instead of 'fuck'. And we were working with studio recording quality data XD

password4321 · on June 20, 2022

STT should use vosk, right?