Can you elaborate further? I am not familiar with the field, but their benchmark...

fxtentacle · on June 20, 2022

CommonVoice is kind of the sound quality that you expect if random people talk into their headset microphone. So that's kind of the quality you need to work with to build a phone system for the general public.

LibriSpeech, on the other hand, is audiobooks read in a silent setting with a good microphone. And some speakers are even professional narrators. So that's the dataset to compare for for an office worker using it for dictation with high-quality equipment.

Also, Google is kinda famous for having the worst speech recognition of the enterprise offerings. (Microsoft's Azure STT has roughly half the error rate) Plus they tested Google's old model.

But the main point I'm trying to make is that even if it is "as good as Google", that a 20% word error rate is still pretty much unusable in practice.

nshm · on June 20, 2022

Commonvoice is actually not very good test set. Their texts are very specific (mostly wikipedia and such) and also the texts overlap between train and test which leads to overtraining of most transformer models. If you test on variety of domains, you'll see totally different picture.

IshKebab · on June 20, 2022

> Also, Google is kinda famous for having the worst speech recognition of the enterprise offerings.

Not in my experience. I tested basically all commercial speech recognition APIs a number of years ago and Google was significantly ahead of everyone else.

It was some time ago and I haven't tested since, but my casual use of speech recognition systems (e.g. via Alexa or Google Assistant) suggests that it's only gotten better since then.

causality0 · on June 20, 2022

Google's gotten worse for professional use than they once were, in my opinion. Maybe it's because they're targeting a wider variety of dialects and accents but that's just a hypothesis. It used to be that if you spoke in the "dictation voice" where you enunciated clearly and bit your consonants Google would nail every word except true homophones but that isn't the case anymore.

spupe · on June 20, 2022

Ok, thank you for the clarification.

I see your point, although "as good as Google" qualifies as "enterprise-level" in my book.