“CPU” isn’t necessarily the benchmark, though. Most smartphones going back years have ML inference accelerators built in, and both Intel and AMD are starting to build in instructions to accelerate inference. Apple’s M1 and M2 have the same inference accelerator hardware as their phones and tablets. The question is whether this model is a good fit for those inference accelerators, and how well it works there, or how well it works running on the integrated GPUs these devices all have.
Brute forcing the model with just traditional CPU instructions is fine, but… obviously going to be pretty slow.
I have no experience on the accuracy of Talon, but I’ve heard that most open source models are basically overfit to the test datasets… so their posted accuracy is often misleading. If Whisper is substantially better in the real world, that’s the important thing, but I have no idea if that’s the case.
Ok, my test harness is ready. My A40 box will be busy until later tonight, but on an NVIDIA A2 [1], this is the batchsize=1 throughput I'm seeing. Common Voice, default Whisper settings, card is staying at 97-100% utilization:
tiny.en: ~18 sec/sec
base.en: ~14 sec/sec
small.en: ~6 sec sec/sec
medium.en: ~2.2 sec/sec
large: ~1.0 sec/sec (fairly wide variance when ramping up as this is slow to process individual clips)
Isn’t the A2 much weaker than a 3090? So those results are promising.
EDIT: for what it's worth, Nvidia rated the A2 at 18 TFLOPS of FP16, and Apple rates the current A16 Neural Engine at 17 TFLOPS of FP16. I'm sure it's not an "apples to apples" comparison.
If you count the GPU component and memory bandwidth, the Apple M2 is slightly weaker on paper for 16-bit inference than the NVIDIA A2, if you manage to use the whole chip efficiently. The A16 is then slightly weaker than the M2.
Sure, the Whisper Tiny model is probably going to be fast enough, but from my preliminary results I'm not sure it will be any better than other models that are much much faster at this power class.
Whisper Large looks pretty cool, but it seems much harder to run in any meaningful realtime fashion. It's likely pretty useful for batch transcription though.
Even if you hit a realtime factor of 1x, the model can leverage up to 30 seconds of future audio context. So at 1x, if you speak for 10 seconds, you'll potentially need to wait another 10 seconds to use the result. This kind of latency is generally unsatisfying.
EDIT: After writing and posting the original version of this comment, I did an experiment where I dictated it to Siri, and then saved that audio (which was recorded simultaneously), which I then fed to both Whisper's tiny.en and medium.en... Siri did terrible for me. Whisper tiny.en was 100% accurate, as far as I can tell, and the only thing Whisper medium.en did was add a few commas that tiny.en had missed. I actually ended up playing the audio file for Siri as well, and that did not end well either. YMMV, but even the tiny model seems very useful. tiny.en took 17.5 seconds to process the ~1 minute audio file, and medium.en took 351 seconds, but I think there is a lot of room for performance optimization on this M2 MBA. The model evaluation was purely using the CPU, not GPU or neural engine, and it wasn't even using all of the CPU cores for whatever reason.
----
With Siri dictation, I feel like I usually spend at least as much time correcting its mistakes as I do speaking the dictation itself. In some cases, that is still faster/easier than typing, but I would rather have a voice model that can work in about the same total amount of time without requiring constant corrections. If I speak for 30 seconds, then I can do other things for 30 seconds while my phone processes it… that might actually be preferable if it gets it right. Otherwise, I’ll be spending 30 seconds actively editing it anyways. Even an improvement on the number of edits required per dictation would be nice. Admittedly, I feel like Google and Microsoft already do a much better job here.
It could be interesting to use the tiny model to give a preview of the writing while the large model is taking its time, and then allow the user to tap on words that changed to see the predictions from the tiny model and correct back to them if they want. I was doing some experiments a few minutes ago, and on one audio clip, the tiny model wrote down a very literal interpretation of an uncommon sci-fi word, and that was more accurate than either the medium or the large models. The rest of the time, the larger models did better, as expected.
But, I don’t know. This is interesting to me, but I agree there could be issues with making is workable for real time transcription.
See https://news.ycombinator.com/item?id=32929029 re accuracy, I'm working on a wider comparison. My models are generally more robust than open-source models such as Vosk and Silero, but I'm definitely interested in how my stuff compares to Whisper on difficult held-out data.
> Brute forcing the model with just traditional CPU instructions is fine, but… obviously going to be pretty slow.
It's not that simple. Many of the mobile ML accelerators are more targeted for conv net image workloads, and current-gen Intel and Apple CPUs have dedicated hardware to accelerate matrix math (which helps quite a bit here, and these instructions were in use in my tests).
Also, not sure which model they were using at 17x realtime on the 3090. (If it's one of the smaller models, that bodes even worse for non-3090 performance.) The 3090 is one of the fastest ML inference chips in the world, so it doesn't necessarily set realistic expectations.
There are also plenty of optimizations that aren't applied to the code we're testing, but I think it's fairly safe to say the Large model is likely to be slow on anything but a desktop-gpu-class accelerator just due to the sheer parameter size.
I did some basic tests on CPU, the "small" Whisper model is in the ballpark of 0.5x realtime, which is probably not great for interactive use.
My models in Talon run closer to 100x realtime on CPU.