How is it Apple, Google, or Microsoft are not further ahead of the game on speech recognition like this? They have the resources to hire the best ML researchers and throw tons of computing hours at it, yet Siri, Google, and Cortana continue to struggle to get anywhere near this level of comprehension.
Siri and Cortana have to run at least in real time, with reasonable compute resources. Probably faster than real time when the audio gets shipped off to the cloud and transcribed there. This model can't do that (in the "large" version, which the examples use).
Also, you are comparing Whisper's highlight reel with everyday performance of other models. Nobody shows their weaknesses in their highlight reel.
Someone else in this thread[0] said Whisper was running at 17x real time for them. So, even a weak machine might be able to do an acceptable approximation of real time with Whisper.
Also, I feel like shipping to the cloud and back has been shown to be just as fast as on device transcription in a lot of scenarios. Doing it on device is primarily a benefit for privacy and offline, not necessarily latency. (Although, increasingly powerful smartphone hardware is starting to give the latency edge to local processing.)
Siri's dictation has had such terrible accuracy for me (an American English speaker without a particularly strong regional accent) and everyone else I know for so many years that it is just a joke in my family. Google and Microsoft have much higher accuracy in their models. The bar is so low for Siri that I automatically wonder how much Whisper is beating Siri in accuracy... because I assume it has to be better than that.
I really wish there was an easy demo for Whisper that I could try out.
“CPU” isn’t necessarily the benchmark, though. Most smartphones going back years have ML inference accelerators built in, and both Intel and AMD are starting to build in instructions to accelerate inference. Apple’s M1 and M2 have the same inference accelerator hardware as their phones and tablets. The question is whether this model is a good fit for those inference accelerators, and how well it works there, or how well it works running on the integrated GPUs these devices all have.
Brute forcing the model with just traditional CPU instructions is fine, but… obviously going to be pretty slow.
I have no experience on the accuracy of Talon, but I’ve heard that most open source models are basically overfit to the test datasets… so their posted accuracy is often misleading. If Whisper is substantially better in the real world, that’s the important thing, but I have no idea if that’s the case.
Ok, my test harness is ready. My A40 box will be busy until later tonight, but on an NVIDIA A2 [1], this is the batchsize=1 throughput I'm seeing. Common Voice, default Whisper settings, card is staying at 97-100% utilization:
tiny.en: ~18 sec/sec
base.en: ~14 sec/sec
small.en: ~6 sec sec/sec
medium.en: ~2.2 sec/sec
large: ~1.0 sec/sec (fairly wide variance when ramping up as this is slow to process individual clips)
Isn’t the A2 much weaker than a 3090? So those results are promising.
EDIT: for what it's worth, Nvidia rated the A2 at 18 TFLOPS of FP16, and Apple rates the current A16 Neural Engine at 17 TFLOPS of FP16. I'm sure it's not an "apples to apples" comparison.
If you count the GPU component and memory bandwidth, the Apple M2 is slightly weaker on paper for 16-bit inference than the NVIDIA A2, if you manage to use the whole chip efficiently. The A16 is then slightly weaker than the M2.
Sure, the Whisper Tiny model is probably going to be fast enough, but from my preliminary results I'm not sure it will be any better than other models that are much much faster at this power class.
Whisper Large looks pretty cool, but it seems much harder to run in any meaningful realtime fashion. It's likely pretty useful for batch transcription though.
Even if you hit a realtime factor of 1x, the model can leverage up to 30 seconds of future audio context. So at 1x, if you speak for 10 seconds, you'll potentially need to wait another 10 seconds to use the result. This kind of latency is generally unsatisfying.
EDIT: After writing and posting the original version of this comment, I did an experiment where I dictated it to Siri, and then saved that audio (which was recorded simultaneously), which I then fed to both Whisper's tiny.en and medium.en... Siri did terrible for me. Whisper tiny.en was 100% accurate, as far as I can tell, and the only thing Whisper medium.en did was add a few commas that tiny.en had missed. I actually ended up playing the audio file for Siri as well, and that did not end well either. YMMV, but even the tiny model seems very useful. tiny.en took 17.5 seconds to process the ~1 minute audio file, and medium.en took 351 seconds, but I think there is a lot of room for performance optimization on this M2 MBA. The model evaluation was purely using the CPU, not GPU or neural engine, and it wasn't even using all of the CPU cores for whatever reason.
----
With Siri dictation, I feel like I usually spend at least as much time correcting its mistakes as I do speaking the dictation itself. In some cases, that is still faster/easier than typing, but I would rather have a voice model that can work in about the same total amount of time without requiring constant corrections. If I speak for 30 seconds, then I can do other things for 30 seconds while my phone processes it… that might actually be preferable if it gets it right. Otherwise, I’ll be spending 30 seconds actively editing it anyways. Even an improvement on the number of edits required per dictation would be nice. Admittedly, I feel like Google and Microsoft already do a much better job here.
It could be interesting to use the tiny model to give a preview of the writing while the large model is taking its time, and then allow the user to tap on words that changed to see the predictions from the tiny model and correct back to them if they want. I was doing some experiments a few minutes ago, and on one audio clip, the tiny model wrote down a very literal interpretation of an uncommon sci-fi word, and that was more accurate than either the medium or the large models. The rest of the time, the larger models did better, as expected.
But, I don’t know. This is interesting to me, but I agree there could be issues with making is workable for real time transcription.
See https://news.ycombinator.com/item?id=32929029 re accuracy, I'm working on a wider comparison. My models are generally more robust than open-source models such as Vosk and Silero, but I'm definitely interested in how my stuff compares to Whisper on difficult held-out data.
> Brute forcing the model with just traditional CPU instructions is fine, but… obviously going to be pretty slow.
It's not that simple. Many of the mobile ML accelerators are more targeted for conv net image workloads, and current-gen Intel and Apple CPUs have dedicated hardware to accelerate matrix math (which helps quite a bit here, and these instructions were in use in my tests).
Also, not sure which model they were using at 17x realtime on the 3090. (If it's one of the smaller models, that bodes even worse for non-3090 performance.) The 3090 is one of the fastest ML inference chips in the world, so it doesn't necessarily set realistic expectations.
There are also plenty of optimizations that aren't applied to the code we're testing, but I think it's fairly safe to say the Large model is likely to be slow on anything but a desktop-gpu-class accelerator just due to the sheer parameter size.
Good point about realtime or not, however with ML I have found the weaknesses get addressed pretty fast by someone. There is a big step between proof of concept and practical application though, so we shall see.
This AI has a 30 second delay on the audio processing because it needs to be able to "look into the future" to get these good results. That 30s delay would be unacceptable for Siri/Google/Cortana.
A lot of models we currently use seem to do the same thing. The model will transcribe a "best effort" interpretation in real time, then as you can continue speaking, you'll see it go back and make corrections. I'm sure you can feed the first X seconds you have into the model, followed by (30-X) seconds of silence, and it will do real time transcription just fine... it would be weird if this broke anything. Then, as you get more speech, you continue getting better transcription of the first 30 seconds, then you switch to a 30 second sliding window.
Maybe I'm missing something, but I don't see the problem here.
Yes, that's because Whisper - like pretty much all of them - uses a Transformer encoder with Attention layers. And the Attention layers learn to look into the future.
And yes, what you describe could be done. But no, it won't reduce latency that much, because the model itself learns to delay the prediction w.r.t. the audio stream. That's why ASR-generated subtitles usually need to be re-aligned after the speech recognition step. And that's why there is research such as the FastEmit paper to prevent that, but then it is a trade-off between latency and quality again.
Also, running your "low-latency" model with 1s chunks means you now need to evaluate the AI 30x as often as if you'd be using 30s chunks.
You just said the models pretty much all work the same way, then you said doing what I described won't help. I'm confused. Apple and Google both offer real time, on device transcription these days, so something clearly works. And if you say the models already all do this, then running it 30x as often isn't a problem anyways, since again... people are used to that.
I doubt people run online transcription for long periods of time on their phone very often, so the battery impact is irrelevant, and the model is ideally running (mostly) on a low power, high performance inference accelerator anyways, which is common to many SoCs these days.
I meant that most research that has been released in papers or code recently uses the same architecture. But all of those research papers use something different than Apple and Google.
As for running the AI 30x, on current hardware that'll make it slower than realtime. Plus all of those 1GB+ models won't fit into a phone anyway.
> Plus all of those 1GB+ models won't fit into a phone anyway.
I don't think that's a requirement here. I've been playing with Whisper tonight, and even the tiny model drastically outperformed Siri dictation for me in my testing. YMMV, of course.
I tried feeding the four examples from this announcement into Google as dictation inputs and it just sits there blankly. On the JFK speech test file in the repo, Google understands perfectly. The samples in the announcement are clearly outside the capabilities of anything Google has launched publicly, but I don't know how that translates to overall utility in every day applications.
My experience with the APIs is Google is excellent and Microsoft is slightly better. And the offline model I've been using that's nearly as good as both is facebook's wav2vec2-large-960h-lv60-self.
Don't believe what's on marketing pages, they rarely transfer to the real world. Will have to make time to try it and see. In theory, given task diversity and sheer number of hours, it should be a lot more robust but will wait on evidence before believing any claims on SoTA.