Hacker News new | past | comments | ask | show | jobs | submit login

"Enterprise-grade STT" and then a 19% word error rate on CommonVoice?

Scribosermo from 2020 was at 7% error rate. State of the art is around 3%.

Let me just illustrate fuck you how eerie ta thing and annoying it is if the eh aye gets every force word wrong.

Let me just illustrate for you how irritating and annoying it is if the AI gets every fourth word wrong.

(20 words, 4 mistakes => 20% word error rate)

EDIT: Just to clarify, I'm only criticizing their "speech to text" quality. Their "text to speech" quality is top notch and close to state of the art.




Can you elaborate further? I am not familiar with the field, but their benchmarks here seem to show quality similar to Google: https://github.com/snakers4/silero-models/wiki/Quality-Bench...

The only trick I can see being played is that Google was benchmarked on September 2020, so likely has already improved and they don't want to show that. Is CommonVoice a better standard to use when comparing these tools?


CommonVoice is kind of the sound quality that you expect if random people talk into their headset microphone. So that's kind of the quality you need to work with to build a phone system for the general public.

LibriSpeech, on the other hand, is audiobooks read in a silent setting with a good microphone. And some speakers are even professional narrators. So that's the dataset to compare for for an office worker using it for dictation with high-quality equipment.

Also, Google is kinda famous for having the worst speech recognition of the enterprise offerings. (Microsoft's Azure STT has roughly half the error rate) Plus they tested Google's old model.

But the main point I'm trying to make is that even if it is "as good as Google", that a 20% word error rate is still pretty much unusable in practice.


Commonvoice is actually not very good test set. Their texts are very specific (mostly wikipedia and such) and also the texts overlap between train and test which leads to overtraining of most transformer models. If you test on variety of domains, you'll see totally different picture.


> Also, Google is kinda famous for having the worst speech recognition of the enterprise offerings.

Not in my experience. I tested basically all commercial speech recognition APIs a number of years ago and Google was significantly ahead of everyone else.

It was some time ago and I haven't tested since, but my casual use of speech recognition systems (e.g. via Alexa or Google Assistant) suggests that it's only gotten better since then.


Google's gotten worse for professional use than they once were, in my opinion. Maybe it's because they're targeting a wider variety of dialects and accents but that's just a hypothesis. It used to be that if you spoke in the "dictation voice" where you enunciated clearly and bit your consonants Google would nail every word except true homophones but that isn't the case anymore.


Ok, thank you for the clarification.

I see your point, although "as good as Google" qualifies as "enterprise-level" in my book.


I was curious and poked at the TTS Colab to switch it from Russian to English (language = 'en', model_id = 'v3_en' under "V3", speaker = 'en_XXX' under "Text"). Short "this is a test"s sound great, so then I tried feeding it the nearest bit of interesting conversational text I had to hand - this thread.

Here's your comment through the two apparently-most-developed models:

en_116: https://vocaroo.com/1axxFRHCs4YF

en_117: https://vocaroo.com/1983M4jVGMdR

Uuhhhh. It has a bit of a way to go to get to where Google et al are at, IMHO. It sounds vaguely like someone put DeepDream and GPT-3 into a blender and selected the "Transmutate into TTS model" option. On the one hand it's undeniably up there in terms of not sounding like the previous generation of TTS, buuuut yeah it has a bit of a way to go.

To be clear it should take just about anyone under a minute to switch the Colab to English, this is just for whoever doesn't feel like fiddling (and/or is on mobile).


I remember we had systems with 15% error rate and successfully deployed in multiple solutions. Yes, look at the number, you would think it is really bad, but actually errors are more likely to happen for short words (such as for for example), those words are less meaningful, and the error would be 'four' instead of 'fuck'. And we were working with studio recording quality data XD


STT should use vosk, right?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: