Hacker News new | past | comments | ask | show | jobs | submit login

I suspect Whisper is more robust than other "SOTA" models, but this release is likely leaving a fair bit of accuracy on the table considering the amount of resources OpenAI is capable of throwing at training it.

Comparing the readily available test sets from the paper to some of my personal robust models (for the Talon models, this is greedy decoding, no language model):

                       Talon  Talon  Talon  Whisper  wav2vec 2.0
                       28M    300M   1B     Large    960h
    librispeech clean   3.21   2.52   2.40   2.7      2.7
    librispeech other   8.21   6.56   5.63   5.6      6.2
    common voice       13.88  11.65   8.86   9.5     29.9
    tedlium             7.51   6.55   5.47   4.0     10.5
I have a battery of more difficult tests on hand (including adversarial tests, and diverse accent-specific metrics). I'll look at running these tests on each of the Whisper model sizes and following up with a larger comparison.



I'm looking forward to your comparison. It's really hard to make sense of how good this model actually is without being an expert in the area.



Talon was the first thing that came to my mind when I saw this news. Would be nice if it could benefit from Whisper. (Big fan of your work on Talon!)


It is interesting how they compare with wav2vec2 instead of nemo conformer (which is more accurate) in Table 2.


Indeed interesting.

On that note, a core Nvidia NeMo developer I follow posted this: https://twitter.com/HaseoX94/status/1572748653189791745

He calls it a "T5 for ASR" paper :) More insights in there, have a look! Curious to see what your blog would put up as well!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: