Hacker News new | past | comments | ask | show | jobs | submit login
Distil-Whisper: distilled version of Whisper that is 6 times faster, 49% smaller (github.com/huggingface)
277 points by omarfarooq 11 months ago | hide | past | favorite | 83 comments



Super exciting! I'll be shipping Distil-Whisper to whisper-turbo tomorrow! https://github.com/FL33TW00D/whisper-turbo

Should make running in the browser feasible even for underpowered devices: https://whisper-turbo.com/


I have the same plans for ctranslate2[0] and Willow Inference Server[1]!

[0] - https://github.com/OpenNMT/CTranslate2

[1] - https://heywillow.io/components/willow-inference-server/


How does one get a notification for when you've added it?


I'm terrible at managing this stuff but I'll certainly tweet about it: https://twitter.com/toverainc


It’s a shame that the README doesn’t link to the original Whisper, or at least not prominently. There’s the etiquette, but also I still don’t really know what this does.


The AI ethos is more academic and open in general, that's why stuff like this, and not linking directly to what they forked aren't faux pas in the that community but raise eyebrows here.

It does speech recognition


Wow, you managed to squesse what's missing in the whole readme in just 4 words. Thanks.


Yeah, the readme just says “ASR”. Apparently that means automatic speech recognition.

I guess it could be worse. In the future our ai overlords will just talk in embeddings (not even abbreviations), and we will have no clue what they are talking to each other about.


puny scum, barking in some gutteral tongue when you could use gods babel!


Shouldn't "academic" imply a useful citation that links the reader to the referenced works?


I don't disagree formally.

Informally, it feels odd/wrong to see people acting like it's a gap. I've been swimming in the AI stuff for a year so I chose to frame it as cultural mores.

I don't know what you do necessarily, so this broad analogy will sound cartoonish: sort of like questioning why a Windows text editor Github README didn't link to Microsoft.com and explain what Windows was.


I'm not an academic or full time developer, just a humble SRE. I of course agree that Notepad shouldn't need to link to the win32 API docs, but I don't think that's a fair analogy. Even outside of tech, it's generally considered bad form to use an acronym or jargon without defining it at least once or providing a useful reference for disambiguation. When you're dealing with a piece of tech with a name that overlaps with many unrelated things, that notion also applies. Furthermore, I would say that it is useful to almost no one to have a project description for something like Notepad that doesn't include the words "text editor". This trend of projects that don't have any indication of what they do harms discoverability for people that might find them useful, and really needs to stop, IMHO.


You can call it an accessibility issue. Depending on one's audience, is it or is it not fair to assume that they know what some specific term or idea means? They can link to Whisper, they can describe what Whisper is for for anyone who has never heard of that within the context of AI before, but do they also need to provide disambiguations for what "AI" means? Or distillation? model? inference?

There are those who might argue (and those like me who merely take a Tellarite stance https://memory-alpha.fandom.com/wiki/Tellarite) that they can define the audience for their paper by who would actually choose to use the model that they made or who might try to further refine it. Specifying the audience that way does a fair job of singling out people who don't know what the word "AI" means, and may or may not do a good job filtering out people who have never heard of "Whisper" in the context of AI before. Because who is trying to get a more efficient version of a model they've never before encountered or thought about?


Allow me to clarify: the opening line of a project should do the following: - Inform them what the project does. - Indicate what the audience should be. - Provide enough context so that the definitions of further acronyms and jargon can be returned by a search engine.

In my mind, "Whisper is an artificial intelligence voice recognition framework for research and incorporation into other software" would accomplish these things neatly. It need not be as overly verbose as you suggest.


The paper mentions it, I don’t think this is an etiquette error


I'm using this: https://github.com/guillaumekln/faster-whisper Smaller, faster, works well with CPU, multiple languages, etc.


That's just using the original model with a faster runtime. It's limited by the model itself, as is ggerganov/whisper.cpp. This changes the model.


So it is possible to combine that with distil for extra speed?


I'm the founder of Willow[0] (we use ctranslate2 as well) and I will be looking at this as soon tomorrow as these models are released. HF claims they're drop-in compatible but we won't know for sure until someone looks at it.

[0] - https://heywillow.io/


I have to say I love Willow, well done. It's a bit slow now, because I'm not running recognition locally (as I'm sure many people aren't), but it will be fantastic news if this helps me offload recognition onto my NUC (ie CPU-only) and can shave lots of ms off that way.


Thanks!

I'll be looking at this as soon as it is released tomorrow.

Separately, we have some Willow Inference Server improvements in the works that increase the speed of speech recognition on CPU by as much as 50% (depending on CPU supported instruction sets, etc).

Between that, the performance we already have, and this work it will be a dramatic improvement - even on CPU. I'm really looking forward to posting the benchmarks when all of this comes together!


That's excellent news, that'll be great! I'm looking forward to that.


That's the implication. If the distil models are same format as original openai models then the Distil models can be converted for faster-whisper use as per the conversion instructions on https://github.com/guillaumekln/faster-whisper/

So then we'll see whether we get the 6x model speedup on top of the stated 4x faster-whisper code speedup, at same/nearly same accuracy.

I would generally start with the assumption that if something is significantly faster the accuracy has to suffer a bit, but increasing model size and/or settings such as beam size to compensate should allow same accuracy and higher performance (just not all of the stated performance gain).


Just a point of clarification - faster-whisper references it but ctranslate2[0] is what's really doing the magic here.

Ctranslate2 is a sleeper powerhouse project that enables a lot. They should be up front and center and get the credit they deserve.

[0] - https://github.com/OpenNMT/CTranslate2


Yup, should work nicely together.


If it's faster, why openai doesn't implement it?


Because OpenAI focuses on putting out quality models. Efficient execution of ML models is another skill set entirely. Projects like CTranslate2 (which is what faster-whisper uses) are focused on fast model execution and work across all kinds of models from speech recognition to image and speech generation and everything in between.


Also because OpenAI benefits from a certain measure of inefficiency to prevent models from being easy for the masses to run without them being in the loop extracting money as well as compiling new training data out of every inference that users feed them.


I wonder if fast enough for wakeword detection in WASM. Picovoice worked extremely well for this but it's proprietary.


There's also OpenWakeWord[0]. The models are readily available in tflite and ONNX formats and are impressively "light" in terms of compute requirements and performance.

It should be possible.

[0] - https://github.com/dscripka/openWakeWord


I would think that using any version of whisper for this use-case would be like digging a posthole in your front yard with an orbital directed energy cannon powered by a fusion reactor.


This is a common viewpoint.

Have you used Echo/Alexa and seen what people do with it?

"Alexa make an entry on my calendar for lunch with Guillermo, Brian, and Kyle next week Wednesday at noon at Giordano's on Ohio street in Chicago". From 10-15 feet away, often with all kinds of noise, echo, who knows what. A child mumbling french can get within range of an Echo device and do this (with varying degrees of success).

Yes a lot of that is handled on device in the audio frontend and elsewhere but it often still bleeds through and makes the fundamental speech recognition challenging. Not to mention bring your accent/voice/speech pattern.

That's firmly Whisper territory and doesn't even get into the flexible grammar, integrations, etc with entire other stacks.

Plus, many hundreds of millions of dollars and nearly a decade later Alexa still struggles with this.


Good response.

However, wouldn't your described use-case be an activity that occurs after wakeword activation? Then handoff the rest of the audiostream to Whisper for transcription?


Thanks!

Yes, that's exactly what we do[0] (just like the commercial stuff).

Wake word and VAD are low-resource and even an ESP chip can handle that + stream. The ESP-BOX-3 is actually our main target device for voice hardware interface. It's the nearly infinite audio, speech, grammar, language, etc variability and complexity where you need the "big guns".

Another thing that seems to be getting lost on people - user expectations for voice interfaces are pretty high. If wake fails, a transcript is wrong, speech rec is slow, etc it's easier, faster, and far less frustrating to just take your phone out of your pocket. At that point why even have something poorly attempting to do voice?

[0] - https://heywillow.io/how-willow-works/#willow-inference-serv...


I'm glad I'm not crazy :D

Do you see an eventual future where some notional "model-on-chip" would hard-wire something like whisper into a dedicated integrated low-power chip for these more demanding uses?


I get asked that a lot.

It’s certainly possible. However, consider the market dynamics.

Look at the Coral accelerator from Google. It’s $60. It has 6m TOPS.

Sounds great, until you dig just a little bit deeper.

It has 6-8mb of memory. A speech recognition model of sufficient quality for these tasks is measured in hundreds of megabytes. Non-starter.

Even with the might of Google behind it the price point, performance, memory, and therefore utility is quite limited for all but a few bespoke applications. Google also has a lot of experience with their TPUs from phones to datacenters so they reduced costs and benefited from shortcuts via that experience and scale.

Yet the capabilities and software ecosystem are pathetic, with even the official Python implementation not having a single commit for 18 months, being stuck on Python < 3.10.

A random $100 used Nvidia card has 8GB of VRAM, 6 TFLOPS, and over 200GB/s of memory bandwidth. CUDA is also hands down the most well supported software ecosystem. There isn’t anything in ML that doesn’t have tier 1 support for CUDA, and vice-versa. Even this ancient card fully supports CUDA 12, so its future proof well into a decade past release date.

If Google can’t pull off something targeting this market with reasonable availability, price points, and software support a new entrant in the field doesn’t stand a chance.

If someone tried to manufacture such a device between the low manufacturing/sales volume, additional memory, and software ecosystem it would likely come in at multiples of the cost of a used Nvidia GPU and even then it couldn’t remotely compete on software.

GPUs catch a lot of flack on power usage but here’s the thing: my GTX 1070 idles at 10 watts with all models loaded. It can do frigate, transcoding with plex/jellyfin, and Willow voice sessions in it’s sleep and still have 80% of the VRAM free for whatever else I want to throw on it down the line.

It’s very difficult to compete with. Not impossible, but a very special set of things would have to come together to stand a chance.

The only thing I can possibly think of is a Raspberry Pi variant with an NPU and unified memory, but even that ecosystem would have a lot of work ahead of it to match what Nvidia (a $1T company) has built over 15 years with CUDA.


This reminds me of discussions about superfluous information in human language sentences. Consider the phrase "that man is bad" versus "that man bad". Somewhat crappy example, but basically yes, an idea can be conveyed in a more efficient representation, but what is lost through compaction is redundancy in a noisy environment.

If all you're doing is parsing "Alexa" out of the air... you're going to have a bad time because realistically, there is a contextual requirement. In AI applications, a proof-of-concept is great, but 99.9% accuracy is basically useless. Think if computer RAM is accurate 99.9% of the time... that's a broken tool.

If it takes 2 seconds to say "Alexa", that's 43,200 2-second chunks in a day, but if the listener is using a sliding window at 60hz, that's 5.2 million opportunities to screw up each day. 99.9% success of parsing a 2-second slice of audio is insufficient.

At some point, no matter how much training you do for ONLY the word "Alexa", you're going to start getting diminishing returns, in which the model to reach desired accuracy will start getting bigger and bigger for less and less improvement. Logical context analysis can easily bridge the gap for much larger gains.


The model targets the decoder part of the system which is the speed bottleneck. So for tasks like classification it is not likely to be helpful. However a similar method could be used for that use case. (Coauthor)


It's probably still too big to be helpful with these model sizes, but if someone helpful runs the same training on `small.en` (and smaller) we might have something.

Yes, this is me praying to the benevolent HN gods that someone will pick this up and run with it. I don't have a GPU anywhere close to capable...


You'd be surprised how capable old GPUs are! I've had great success with people running Whisper-Turbo in the browser on really old hardware: https://whisper-turbo.com/


We have benchmarks[0] for Willow Inference Server using Whisper + ctranslate2 + some of our own optimizations.

TLD a six year old ~$100 used GTX 1070 is roughly 5x faster than a Threadripper PRO 5955WX at a fraction of the cost and power.

[0] - https://heywillow.io/components/willow-inference-server/#ben...


> TLD a six year old ~$100 used GTX 1070 is roughly 5x faster

Did you mean TIL?


It's not the inference, it's the training. They say in the paper: "We train with a batch size of 256 for a total of 80,000 optimisation steps, which amounts to eight epochs of training." That's a fair chunk of time. Mind you, `small.en` has smaller decoder layers than `medium.en`...


It seems they have only distilled on English data, so the distil-large-v2 model will probably perform badly with any other language, we'll see tomorrow when they are going to release their models.


That is a significant limitation. One of my favorite use cases is translating foreign language audio/video. Whisper translation quality is passable, not great, but enough to get the gist of what is being discussed.


> performs within 1% WER

From the paper, for short-form audio:

> the distil-large-v2 model achieves the lowest overall average WER of 10.1%. It is one percentage point higher than the large-v2 baseline, with 5.8 times faster inference speed and fewer than half the parameters.

Long-form is similar, except Distil-Whisper does slightly better than Whisper (fewer hallucinations, apparently).

10% WER seems awfully high, and doesn't match my experience with Whisper. Maybe my audio is nice and clean relative to their test set?


WER is a pretty strict metric. IIRC it can penalize missing repeats, disfluencies, and the like that an ASR model may reasonably decide to drop. Additionally it will penalize for incorrect pluralization, unique proper nouns that aren't in the model, etc. 10% is still very readable.

I built a tool in the mid-201Xs on an ASR engine with 20%+ WER, and even that was good enough for what we were trying to do.


I agree. When using the small or medium en models either for real-time speech recognition of a native English speaker or for transcribing podcasts of native English speakers the error rate is nowhere near 10%. I might say it's something like 1% of which the majority of errors are possibly subjective decisions about punctuation. But I have found the error rates are much higher on the tiny model and higher on the base model.

I assume therefore that the 10% word error rate is on very difficult audio such as pilots speaking to Air Traffic Control (distorted or clipped microphones with significant background noise), which I personally find can be difficult to 100% understand even though I'm a native English speaker and when both pilots and air traffic control are native English speakers.


Reading the paper the table is showing performance on out-of-ditribution test sets.


I see: what does that mean, exactly? :) if it means "data they dont usually test on", 10% does still sound pretty high


Yes, it's not that clear to me either what test sets get a 10% error rate. Because in my use (native English dictation or native English podcast transcription) the small or medium original whisper models have what I'll call a "discrepancy" rate of say 1-2% which is mostly punctuation and "umms/errs" inclusion or not. The actual "error" rate is below 1% in my experience, and excluding surnames, brands and place names that I don't know how to spell either the remaining errors tend to be minor (missed plural etc.).

So I infer that these data sets are some deliberately difficult audio: call centre recordings with lots of background noise, phoneline quality audio etc. Maybe non-native speakers. If I only heard that sort of audio once I also might have an error rate of 10%.


Funnily enough, `-small`, `-base` and `-tiny` versions of this would be more exciting to me. `small.en` is the largest of the original whisper models that will run anywhere near usable speed on a raspberry pi zero 2 with whisper.cpp, and it's still too slow to really bother with for streaming. Anything smaller is too inaccurate for day to day use. If there was a distilled version which had a similar 6x speedup, that would be transformative.


I understand that, though I think significant speedups can be useful at multiple levels. So for me for example I am using either the base or small model on beam size of 1 with faster-whisper for real-time dictation on a laptop CPU (Rzyen 4500U). The recognition time is just that bit too high when using a larger beam size or is much too high when using the medium model. So if these models offer a decent speed up it means I can either increase beam size or go up a model size which will lead to good improvement in accuracy I think - With real-time dictation I find that small errors are quite annoying to deal with so any improvement in accuracy is really useful.

At a larger level, say an exercise to transcribe a back catalogue of audio might need a $1000 GPU with the current model speeds to get the job done in a reasonable time. With models that run 6x faster it might be that a $200 GPU is sufficient. That could be quite a significant saving for a small company or charity etc.


Oh yes, that's absolutely true - faster is better for everyone. It's just that this particular breakpoint would put realtime transcription on a $17 device with an amazing support ecosystem. It's wild.

That being said, even with this distillation there's still the aspect that Whisper isn't really designed for streaming. It's fairly simplistic and always deals with 30 second windows. I was expecting there to have been some sort of useful transform you could do to the model to avoid quite so much reprocessing per frame, but other than https://github.com/mit-han-lab/streaming-llm (which I'm not even sure directly helps) I haven't noticed anything out there.


Another important use is low-latency transcription on a phone, for hard-of-hearing people like me. I've been tempted lately to buy a beefier phone.


I do wonder how low latency can feasibly get. I think there's an interesting barrier to cross with letting speach-to-text correct previously emitted tokens when later context reveals "mishearings" that would help. I know I've seen some systems do that so it's definitely possible, just not something that's readily supported by Whisper as it stands.


On a partially-related note, has anyone packaged any version of whisper as an Android keyboard? It seems like a reasonably good fit, and I would be interested to see if it worked better than the deteriorating quality of Google's offering. I think it would work even with the existing versions, but a faster+smaller version would obviously be a better fit for running on phone hardware.


Fortunately yes, recently i've been playing with this https://github.com/rpdrewes/whisper-websocket-server which uses K6nele as frontend on android if you really care about performance.

Tho if you're looking for a standalone app then you can give this a go : https://github.com/alex-vt/WhisperInput and run it right on your phone :]

For now they both run regular openai whisper thus tiny.en but as you can see there's tons of impromvement potential with faster-whisper and now distill-whisper :D


How much faster in real wall-clock time is this in batched data than https://github.com/m-bain/whisperX ?


maybe 2-3x? faster-whisper says it's 2x faster than whisper.


Is there a good project out there that pairs whisper with something like llama.cpp to create a private local voice assistant?

Llama2 isn't as good as GPT-4 but it's a hell of a lot smarter at Q&A than Siri or Alexa or any of those things.

PSA: I will pay for such a thing if it's really good, privacy respecting, local-first, and preferably at least source available.


I literally played with cat'ing the output of one into the input of the other and it worked better than I had any reason to expect.

edit: in my 30minutes of playing with it, I didn't find a good sounding open-source text-to-speech model for the final stage of the pipeline.


It’s not open-source, but Play.HT has a new “Turbo model” that can begin generating text-to-speech within 150ms. I’ve tried it out, and it’s pretty impressive in terms of both quality and speed. There’s an API, so perhaps that would be worth looking into!


This lack of a decent open text to speech is really frustrating, because some of the closed ones are just scary good.


I've used various forms of Dragon since 1994 to help me deal with hand/arm problems. I would love to tell Windows to go take a flying leap but I can't because I need to use Dragon. If I had real-time or near real-time recognition AND the ability to edit by voice text in any field or application that I would be in a very good place. A better place would also include adding per application/global commands to drive the application.

If you haven't lived with speech recognition, it's not apparent that the command space for speech environment is significantly different from the command space for mouse and hands. In order to make the command space for speech work well, the speech recognition environment the application needs to present to the API with access to all functionality and data within the application.


> I would love to tell Windows to go take a flying leap but I can't because I need to use Dragon.

What does Dragon do you for that Talon can't?


Talon has advanced quite a bit since I last I looked but it looks like it still falls short. However, I will give it a trial run again.

It looks quite useful if you need command and control. for general diction, not so much. To be fair, Dragon and other SR systems fail at speech driven editing.

99.9% of my dragon use is plane text dictation and when if the app is select-and -say enabled,edit and make corrections by voice. Speech commands as they are implemented are rarely useful mostly because there are too many of them to remember. Fortunately, my hands have recovered enough it is faster to type and mouse that it is to silently try to remember commands, construct what I want to say and then say it without stumbling or pausing then, undoing what was recognized, and trying again until I get it right.

one thing I'd like to do is difficult; have the same command give the same results in different contexts.

For example: say "make me root" and have the command recognize what machine on what network I'm connected to, sent the command "sudo su -<enter>" and then send the right password from my password manager without me having to type anything.

example: tail [forever] <service log> work the same no matter what distribution, of logging method (syslog vrs journald) and priv level. if I need a sudo before accessing log data put one in and give it the password if needed without any action om my part.

another should be possible is dictating into an app (like thunderbird) with text boxes and when focus is outside of the box, turn off recognition output. if you leave recognition output active outside of a text box then speech is typing hotkey commands from the letters in the words recognized.

disaster example: hotkey-stroking your way through your email, phone rings, you forgot to turn your mic off. at the end of the all, you recognize your out-of-context recognition has destroyed your mailbox is and you have no idea how to recover it.

The biggest thing is dragon is an out-of-the box solution. 15 mins (or less) install and I am dictating at high accuracy and a large vocabulary.


If you think open source speech recognition is behind don't even look at text to speech synthesis.

It's not even in the same galaxy.


Shameful plug for my project Willow:

https://heywillow.io/

Note that it's important to understand the realities here - short of something like an RTX 3090/4090 with LLama and every performance optimization available when it comes to responsiveness and accuracy competitive with commercial solutions this is a big challenge.

Even with the potential improvements of this work and optimizations like Ctranslate (used by our Willow Inference Server and faster-whisper) getting sub one second response times like Echo/Alexa more-or-less automatically calls for GPU even with every performance trick available. As I like to say when it comes to ML/AI/speech rec/speech synthesis if you bring a CPU to a GPU fight you're going to lose - and all of the commercial implementations are certainly using GPU/TPU on top of who knows what else they've come up with over the years and their immense spending.

To get an idea of how dramatic this is you can see the benchmarks with Ctranslate2/faster-whisper and our Willow Inference Server here[0].

Looking at those real-world numbers even at a claimed 6x performance improvement a mighty Threadripper PRO 5955WX can barely meet this goal with the models needed for voice assistant use cases under real world conditions (medium/large). Throw an LLM in the mix and you're sitting around waiting at least several seconds for a response, even with ridiculous hardware. On anything less than ridiculous hardware (including GPU) that becomes at least 10s of seconds very quickly.

At the fundamental level a seven year old $100 used GTX 1070 is approximately 5x faster than a monster CPU like the Threadripper PRO 5955WX - at a fraction of the cost and power. That's just for the first step (speech rec), to get something approaching Alexa-level you're in RTX 3090 territory because performance and VRAM.

Amazon has spent hundreds of millions of dollars (minimum) over the better part of a decade developing Echo/Alexa. The open source world has a long way to go to catch up.

[0] - https://heywillow.io/components/willow-inference-server/#ben...


I have something pretty rudimentary here: https://github.com/Ono-Sendai/project-2501 Whisper.cpp + chatGPT + windows text-to-speech.


https://github.com/ggerganov/whisper.cpp/tree/master/example... is worth a poke. llama.cpp supports llama2 on CPU.


I've tried the large-v2 on translate task but the results aren't great. Guess there needs to be another round of distillation with translate task thrown in too.


Does anyone know if it is possible to fine-tune the whisper models to add new words? Say, brand names it doesn't yet know about?


You shouldn't need to fine-tune it at all. Whisper supports adding prompts -- not to be confused with GPT-style prompts -- these prompts let you specify the "style" of output the model should give. So if you're giving input that is somewhat ambiguous or has strange spellings of common pronunciations, you can do that via the prompt.

You say "I really like Jason". But, your audience is developers:

prompt=json

"I really like Jason" => "I really like JSON"

The docs give some more detail about how to structure the prompts and has examples about what does and doesn't work, it's meant for this exact purpose.


My experience is that what you're describing is only initial_prompt, and it only affects the first 30-second transcription window of the audio in question.

It's effectively useless for helping the model transcribe new words in longer content. That also wouldn't be a long-term solution anyways... no one wants to compile a huge list of "words Whisper probably doesn't know" and have to pass those in every time the model is being used. Even if that worked, it would also distort the transcription, since you're not saying you know which words are in the actual speech, you're just passing in a list of words. So, you could end up influencing Whisper to choose the wrong words, giving priority to this list of random words being passed in.

I am similarly curious about how we can train Whisper models to learn new words over time, unless OpenAI plans to release updated models themselves.


Would attention sinks work here? https://github.com/mit-han-lab/streaming-llm - it sounds like they might. In theory it doesn't involve retraining, it's just a change to how the data is managed between invocations.


Have not read the paper yet but why do they only cut the decoder and not the encoder?


When distilling models for speed, you get a better win from removing decoder parameters, since they are run in serial, than encoder parameters. For example see this work https://arxiv.org/abs/2006.10369

- paper co-author


They don't justify it explicitly, but they do talk about using the distilled model as an assistant for the original. With the encoder precisely the same for both you only need to additionally load the distilled decoder layers for a 2x speedup with the same accuracy as the original.


English only it seems :(


Hm isn't this problematic from a trademark pov?


Nah, Whisper isnt trademarked. The AI ethos is more academic and open in general, that's why stuff like this, and not linking directly to what they forked aren't faux pas in the that community but raise eyebrows here.


Nice! But next time do the press release when the product is released. Really tired of sites like HN pushing these stories out without any code or files Feels like vaporware.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: