> About a third of Whisper’s audio dataset is non-English, and it is alternately...

> About a third of Whisper’s audio dataset is non-English, and it is alternately given the task of transcribing in the original language or translating to English. We find this approach is particularly effective at learning speech to text translation and outperforms the supervised SOTA on CoVoST2 to English translation zero-shot.

That's intriguing. You can just set the model to transcribe everything into English, no matter which language the speaker is using, and it just works. Given that many people are much better at understanding English than at speaking it, this might make voice interfaces much more accessible without much work.