I suspect Whisper is more robust than other "SOTA" models, but this release is likely leaving a fair bit of accuracy on the table considering the amount of resources OpenAI is capable of throwing at training it.
Comparing the readily available test sets from the paper to some of my personal robust models (for the Talon models, this is greedy decoding, no language model):
Talon Talon Talon Whisper wav2vec 2.0
28M 300M 1B Large 960h
librispeech clean 3.21 2.52 2.40 2.7 2.7
librispeech other 8.21 6.56 5.63 5.6 6.2
common voice 13.88 11.65 8.86 9.5 29.9
tedlium 7.51 6.55 5.47 4.0 10.5
I have a battery of more difficult tests on hand (including adversarial tests, and diverse accent-specific metrics). I'll look at running these tests on each of the Whisper model sizes and following up with a larger comparison.
Comparing the readily available test sets from the paper to some of my personal robust models (for the Talon models, this is greedy decoding, no language model):
I have a battery of more difficult tests on hand (including adversarial tests, and diverse accent-specific metrics). I'll look at running these tests on each of the Whisper model sizes and following up with a larger comparison.