Hacker News new | past | comments | ask | show | jobs | submit login

On the youtube captions dataset, it says english (US) WER is nearly 15%. On those 18 languages, it's nearly 20%. How does that match up with 32% relative being one per thousand or so?



In the paper itself[0], they do have a couple comparisons for English-only (Fig. 2, Table 3), and it looks like Whisper has an 11.5% error rate and USM has a 10.5% error rate. It's a truly negligible difference. There's no way I'd ever pay for an API for this if I knew I only cared about English.

I know that's not the point of this model (the point is that for a lot of languages, its the only model available). But paywalling it seems greedy, you'll only extract money from those under-represented communities. On the other hand, maybe this never would have been built without the profit motive. Idk. I wish we could fund these things as "basic science research" without a need for direct profit. Let positive externalities pay us back down the road.

0: https://arxiv.org/pdf/2303.01037.pdf


Whisper can also be fine tuned on other languages. Don’t know how well it’ll do compared to this but it’s at least a possibility.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: