Hacker News new | past | comments | ask | show | jobs | submit login

Can you elaborate further? I am not familiar with the field, but their benchmarks here seem to show quality similar to Google: https://github.com/snakers4/silero-models/wiki/Quality-Bench...

The only trick I can see being played is that Google was benchmarked on September 2020, so likely has already improved and they don't want to show that. Is CommonVoice a better standard to use when comparing these tools?




CommonVoice is kind of the sound quality that you expect if random people talk into their headset microphone. So that's kind of the quality you need to work with to build a phone system for the general public.

LibriSpeech, on the other hand, is audiobooks read in a silent setting with a good microphone. And some speakers are even professional narrators. So that's the dataset to compare for for an office worker using it for dictation with high-quality equipment.

Also, Google is kinda famous for having the worst speech recognition of the enterprise offerings. (Microsoft's Azure STT has roughly half the error rate) Plus they tested Google's old model.

But the main point I'm trying to make is that even if it is "as good as Google", that a 20% word error rate is still pretty much unusable in practice.


Commonvoice is actually not very good test set. Their texts are very specific (mostly wikipedia and such) and also the texts overlap between train and test which leads to overtraining of most transformer models. If you test on variety of domains, you'll see totally different picture.


> Also, Google is kinda famous for having the worst speech recognition of the enterprise offerings.

Not in my experience. I tested basically all commercial speech recognition APIs a number of years ago and Google was significantly ahead of everyone else.

It was some time ago and I haven't tested since, but my casual use of speech recognition systems (e.g. via Alexa or Google Assistant) suggests that it's only gotten better since then.


Google's gotten worse for professional use than they once were, in my opinion. Maybe it's because they're targeting a wider variety of dialects and accents but that's just a hypothesis. It used to be that if you spoke in the "dictation voice" where you enunciated clearly and bit your consonants Google would nail every word except true homophones but that isn't the case anymore.


Ok, thank you for the clarification.

I see your point, although "as good as Google" qualifies as "enterprise-level" in my book.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: