On the youtube captions dataset, it says english (US) WER is nearly 15%. On those 18 languages, it's nearly 20%. How does that match up with 32% relative being one per thousand or so?
In the paper itself[0], they do have a couple comparisons for English-only (Fig. 2, Table 3), and it looks like Whisper has an 11.5% error rate and USM has a 10.5% error rate. It's a truly negligible difference. There's no way I'd ever pay for an API for this if I knew I only cared about English.
I know that's not the point of this model (the point is that for a lot of languages, its the only model available). But paywalling it seems greedy, you'll only extract money from those under-represented communities. On the other hand, maybe this never would have been built without the profit motive. Idk. I wish we could fund these things as "basic science research" without a need for direct profit. Let positive externalities pay us back down the road.