On the youtube captions dataset, it says english (US) WER is nearly 15%. On thos...

runnerup · on March 30, 2023

In the paper itself[0], they do have a couple comparisons for English-only (Fig. 2, Table 3), and it looks like Whisper has an 11.5% error rate and USM has a 10.5% error rate. It's a truly negligible difference. There's no way I'd ever pay for an API for this if I knew I only cared about English.

I know that's not the point of this model (the point is that for a lot of languages, its the only model available). But paywalling it seems greedy, you'll only extract money from those under-represented communities. On the other hand, maybe this never would have been built without the profit motive. Idk. I wish we could fund these things as "basic science research" without a need for direct profit. Let positive externalities pay us back down the road.

0: https://arxiv.org/pdf/2303.01037.pdf

UncleEntity · on March 30, 2023

Whisper can also be fine tuned on other languages. Don’t know how well it’ll do compared to this but it’s at least a possibility.