Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Port of OpenAI's Whisper model in C/C++ (github.com/ggerganov)
399 points by ggerganov on Dec 7, 2022 | hide | past | favorite | 87 comments
Hi HN,

OpenAI recently released a model for automatic speech recognition called Whisper [0]. I decided to reimplement the inference of the model from scratch using C/C++. To achieve this I implemented a minimalistic tensor library in C and ported the high-level architecture of the model in C++. The entire code is less than 8000 lines of code and is contained in just 2 source files without any third-party dependencies. The Github project is here:

https://github.com/ggerganov/whisper.cpp

With this implementation I can very easily build and run the model - “make base.en”. It also allows me to run it on a wide range of devices. For example, I have provided examples of running the model on an iPhone, Raspberry Pi 4 and even in a web page via WebAssembly!

The implementation runs fully on the CPU and utilizes FP16, AVX intrinsics on x86 architectures and NEON + Accelerate framework on Apple Silicon. The latter is especially efficient and I observe that the inference is about 2-3 times faster compared to the current PyTorch implementation provided by OpenAI when running it on my MacBook M1 Pro. The WASM port utilizes SIMD 128-bit intrinsics - a feature supported in some modern web browsers [1].

I am very happy with the performance that I observe on Apple Silicon devices. I didn’t expect that the Accelerate framework [2] (i.e. CBLAS) offers such a dramatic performance boost for matrix multiplications so I was very pleasantly surprised! To enable the framework in your C/C++ projects, all you have to do is add `-framework Accelerate` to your clang command-line flags.

This entire exercise of implementing the Whisper model was very interesting to me and helped me understand a lot about how the transformer architecture works. I also got a lot of positive feedback from people finding and using my project. We brainstormed on a lot of interesting tools that can potentially be created with this library (such as speech-to-text plugin for Vim, RPi4 voice assistant, WASM chat bot, etc). If interested, checkout the “Examples” section and the “Show and tell” discussions for some ideas!

Would love to know what you think about this project and about your experience with using the Accelerate framework in any of your projects. Cheers!

[0] https://github.com/openai/whisper

[1] https://chromestatus.com/feature/6533147810332672

[2] https://developer.apple.com/documentation/accelerate




Great work. It's a real bowl of fresh air coming from huge framework that use cuda.

So many different cuda version, with each framework using its own, that all rely on a different driver, and everything needs a new version every 3 months and takes ~10G, (and don't even talk about cudnn needing some manual logged-in install).

Here everything is just two files. For embedded system that don't have a GPU it's perfect.

Here the parallelization and vectorization has been done by hand, but there is a glimmer of hope coming from the side of various compiler projects :

Here is an interesting intel project that does the parallelization and vectorization automatically for different architecture that's definitely worth a look : https://ispc.github.io/ispc.html

For the auto-differentiation when I need performance or memory, I currently use tapenade ( http://tapenade.inria.fr:8080/tapenade/index.jsp ) and/or manually written gradient when I need to fuse some kernel, but Enzyme ( https://enzyme.mit.edu/ ) is also very promising.

MPI for parallelization across machines.


> MPI for parallelization across machines.

Some things never change.


Ditto about the CUDA and cuDNN part. My project that was running fine for the past 4 years just "died" after a colleague's oversight on upgrading the GPU(1080Ti -> 3090) which isn't compatible with the new cuDNN. It is just too much of a hassle maintaining that *expletive* jargon so I did the wise decision to kill it.


100%.

So much more practical to hack around and or build small apps.


I vouch for this. Pretty solid and keeps improving. The OP is in the class of Magic Wizards of programming like Fabrice Bellard!

There are frequent updates and performance improvements. There is also a small community of active users around this.

All most all feedbacks get implemented and the OP is very responsive.

The OP made it possible to do state of the art voice recognition without the PyTorch baggage and in C/C++, pretty incredible! Its one of those rare high value projects.

Very grateful for this project and respect to the OP!

Some day if a ChatGPT open version becomes available, this could mean voice assistants that speak sense and understand the human - as long you have a beefy machine.

The current efficiency is pretty surprising, even on a low spec device it performs faster than real time.

I don't know what to say. But I'm blown away.

I expect to see more magic from the OP in future.

He has even a project for a cool sound modem that works over ultrasonic! Not new stuff, but the implementation is the most robust I have seen.

I recommend hackers here to check out his other project too and maybe contribute with testing and patches and stuff!


Yup this is so magical. I've always felt there was something off about requiring setting up what is essentially a pytorch/ml dev environment everytime end users "just" want to run inference.

A single binary that does this all w/o the python stack is just incredible!

edit: Got it going in 1 min!

I grabbed the prebuilt artifacts (windows)

- https://github.com/ggerganov/whisper.cpp/actions/runs/363552...

Then downloaded ggml-base.bin (148mb) and put it in models/ggml-base.en.bin

- https://huggingface.co/datasets/ggerganov/whisper.cpp/blob/m...

Ran it and everything worked! Amazing. Note that only the large (3GB) whisper-v2 is available at the moment, but haven't seen any errors yet from the older small ones. Wild.


Can you expand on your steps a bit more? I've never used Github Actions which seems like step 1. Not sure how to get an installer.


Ok nevermind, figured it out, it requires login. Then the archive is at the very bottom of the page.


I've been watching this repo pretty much since the beginning, and the amount of work you've achieved is incredible.

I've started tinkering with the code about last week and despite knowing nothing about C/C++, I was able to make some edits to fit my use case, and connect it to a custom Python front end (I initially tried to use Qt in C++ but struggled so much to get to it to compile that I've switched to Python instead). This probably means your code is very clean and well documented.

It's a game changer in terms of accessibility: it can caption almost anything in live!

I'm very grateful for the effort that you've lead. Thank you ggerganov, and thanks to everyone who contributed.


10/10 you're doing god's work my friend, can't wait to spend some time this weekend to try and understand what's going on here. I can't overstate how much I value small libraries. I can't think of a faster way to learn about a concept than to step through someone else's barebones implementation.


Thanks! Indeed, I agree that the project has an educational aspect and value. For me, it helped me get a better understanding of the neural network layers involved in the transformer model. Also, it was a good playground to practice my low-level optimization techniques. I guess another cool thing was that with the help of the community, we came up with a faster way to evaluate the Encoder (at the cost of some accuracy), which ultimately enabled the WASM and RPi4 examples (see #137 if interested in the discussion).


I liked reading the different implementations of the low-level tensor ops (simple C/AVX/AVX2/WASM128bit/ARM-NEON) -- it will help me learn about how to use x86 ASM. Thank you for writing this! Do you have any other recommendations/examples on how numerical code can be optimized via SIMD routines?


I don't have other recommendations as I am a novice myself when it comes to SIMD. I think the multiplication routines in `whisper.cpp` are relatively basic - dot product and fused multiply-add. With a few trial and errors I came up with these implementations - not sure if they are optimal.


For those who want to try, here are the steps I took over about an hour to set it up:

1. Downloaded the Win10 artifact: https://github.com/ggerganov/whisper.cpp/actions/runs/363552... at the bottom of the page by logging in to Github. Extract and placed this folder in my F:\ drive renaming it to 'Whisper'.

2. Downloaded `ggml-large.bin` here: https://huggingface.co/datasets/ggerganov/whisper.cpp/tree/m.... Within F:\Whisper, add a folder named 'models'. Move ggml-large.bin to the 'models' folder.

3. Downloaded ffmpeg, extracted the archive to F:\FFMpeg, and set the environment variable by going to (right click) This PC -> Properties -> Advanced system settings -> (Advanced tab) -> Environment Variables -> click Path -> Edit -> (paste in ffmpegs path i.e. F:\FFMpeg\)

4. Use PowerShell to run ffmpeg against an mp3 file, to convert it to WAV (which is the only format that works) i.e.:

ffmpeg -i F:\Rec\input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le F:\Output\output.wav

5. Open PowerShell again, `cd` to the Whisper folder, and ran this:

./main -m models/ggml-large.bin -f F:\Rec\output.wav


This is really awesome, great work! I downloaded a long video from YouTube that is very challenging to transcribe (it is an interview with Hayek, who was both very soft spoken and had a thick German accent) because I wanted to evaluate OpenAI’s claims about whisper being “superhuman” in recognition. It was a bit picky about having the audio in the exact right format (it needs to be 16kHz wav files— would be really nice it just include ffmpeg in the release and automatically pipe any input first to ffmpeg to convert to the desired format), but once it got started it just cranked away extremely quickly on my iMac M1. And the results do seem to be pretty good. I just wish the model also did some basic speaker identification, so it could insert “Speaker1:” or something at the beginning of each line/timestamp. Even if it’s not sure, it could insert “Speaker<?>:” and that would still be useful.


For those interested:

Video link: https://www.youtube.com/watch?v=34Bre91Ey3Q

Resulting transcript text: https://pastebin.com/5M1iW8yf

The whole thing took 1.5 minutes to run on an M1.



Btw, there is the `yt-wsp.sh` helper script to download, convert and transcribe a video by given url:

  ./examples/yt-wsp.sh <video-url>


I actually couldn't get the script to work-- it keeps complaining that it can't find the Whisper executable. Do you know what I would have to do first to get it to work starting from a "blank slate" and cloning the repo? Thanks!


It's trying to run `WHISPER_EXECUTABLE` which it defaults to `whisper` - but if you follow the instructions in the README, you end up with `main`.

    WHISPER_EXECUTABLE=main examples/yt-wsp.sh youtube_url
Or the script itself prints out a handy hint.

    Whisper needs to be built into the main binary with make, then you can rename it to something like 'whisper' and add it to your PATH for convenience.


Ah thank you, I somehow missed that! Amazing job on this. You could literally create whole new companies from this system.


This is very cool. Excellent work!

Perhaps in combination with https://www.npmjs.com/package/peertube-plugin-transcription your port of Whisper could be used for generating subtitles for videos in PeerTube?

I just recently set up a PeerTube instance of my own and uploaded my first video on it ("No Brain Required - ChatGPT solves Advent of Code in Rust, episode 1", https://video.nstr.no/w/6z7PxB4J92H3NHhgMmfYVw)

I want to try and make use of your port of Whisper on my PeerTube instance, so that I can have subtitles generated for my videos on it :D


I don't fully understand what's going on behind the scenes, but I tried to use the repo some days ago (guess there's even a new model now) and everything seemed very simple to build to try so thank you for your amazing work.


Offline models, especially speech recognition, is a game changer for many apps.

A fully CPU based implementation, simple enough with minimal dependencies is also something that helps tremendously reduce the initial friction and enable potential low-cost applications.

Excellent and impressive work, can’t wait to try this thing at home.



Holy crap, I just loaded a random video file into the tiny model and it did it perfectly and quickly too. This is amazing.


This is awesome. I'm sure the fact that most ML models require an insane mess of Python packages is holding applications back.

Hook this up to ChatGPT and you've got something better than Google Assistant with almost no work.

(You can tell ChatGPT an API, and ask it to generate a script in response to a voice assistant query.)


Hi Georgi,

I am experimenting with your code now. Is there a way to force Whisper to only consider a limited vocabulary and then respond with confidence levels? I am working on an app where it is important to restrict answers and I would like to know how confident that a response is one of a set of words. If the answer could be word A with a confidence level of 95% and word B with a level of 50%, I would want to know that so that I could perform context verification.

Thanks!

Bill


Hopefully this version will add the prompting ability that the original Whisper has. In the original Whisoer, you would be able to give it a prompt for the recognition like "Please respond with only one of the following words: A, B, or C." It wouldn't be foolproof, but it helps.

https://github.com/openai/whisper/discussions/117#discussion...


A follow up on this - I came up with an interesting strategy to achieve this. Still a prototype, but I think it looks very promising:

https://github.com/ggerganov/whisper.cpp/pull/271

The source code is in the `command.cpp` and I will soon write some more details how it works. If you give it a try, definitely let me know if it worked for you.


Hi, it's not obvious how to achieve this, but it feels it could be done. I think all the "tools" are available in the existing interface in `whisper.h` - for example, `whisper_get_probs()` gives you the probability for each token.


Feel free to merge my fork, about 20% faster on my computer (Ryzen 7 5700G CPU, medium.en model): https://github.com/Const-me/whisper.cpp It also contains VS2022 projects to build on Windows, your cmake project results in disabled AVX which is critical for performance.

Also, I didn’t really understand your multithreading code in ggml_graph_compute function, but that custom thread pool implementation IMO looks suspicious. Just too many atomics. Might be possible to improve a lot with a better multithreading strategy.


Awesome work. Curious whether the whisper model could be ported to tinygrad and how the performance would compare to your implementation.


Thanks! I'm also interested in seeing a CPU comparison against tinygrad. From what I've seen, tinygrad already utilizes the AMX coprocessor, so I expect to have comparable performance between the 2 on Apple Silicon.


Instant market created for toys you can talk to.

Next Christmas should see plenty of robots, teddy bears and other weird and wonderful sentient toys.


This is really cool, and I have been meaning to get my hands dirty with Whisper

I looks like you’ve definitely maximized the parallelization on the CPU/AMX here, but have you tried getting it to run on the GPU or the Neural Engine? I love the portability, but I feel like you would get a massive parallelization boost while dramatically cutting energy consumption.


I did some experiments with adding Metal Performance Shaders support, but the performance that I achieved was only marginally better compared to the one I get when using just the Accelerate framework (there is an unmerged PR with the tests).

Honestly, I am bit confused with all the different types of processing units available on Apple Silicon. If I understand correctly, we have: CPU, GPU, AMX coprocessor and Neural Engine on a single chip. I don't fully understand how these interact with each other. Can we use them all at the same time, or would there be some penalties? I'm interested in finding some resources/information on the topic.


You are correct, in that those are the four

My understanding is that the AMX is more tightly wound with the CPU, ultimately being accessible via an instruction set (https://github.com/corsix/amx), and it is useful if you need to do matrix multiplications interleaved with other CPU tasks. A common example would be a VIO loop or something where you want that data in the CPU caches.

The GPU and Neural Engine are not that – they take some time to set up and initialize. They also can parallelize tasks to a much higher degree. The GPU is more generalizable, because you can write compute shaders to do anything in parallel, but it uses a lot of resources. I'll have to check out the PR to see how exactly the MPS implementation matches up with the task at hand, because you could also consider writing Metal compute shaders by hand. Even if the performance is not much better, the CPU is free to do other things.

I know the least about the ANE, but it has specific hardware for running ML models, and you have to process the weights ahead of time to make sure they are in the right format. It can run ML models very efficiently and is the most battery friendly.


I suggest looking into Halide as it will make trying different paths much easier (https://halide-lang.org/).

I haven't looked at your code closely so can't say with certainty it would be the right fit but worth a look.


OpenCL would be appreciated much... Opens the door to use this on many more low powered devices but, it could be very difficult as you have already mentioned.


Very nice! Especially that it can run with good speed on the CPU without any external dependencies.

I wish something like this would also exist for Stable Diffusion - a simple, no dependency way to run it with C++ on the CPU with AVX. Do you know if that would be possible with your tensor library, or is it very hardcoded for Whisper?


> about 2-3 times faster compared to the current PyTorch implementation

This is surprising to me. Is this about CPU only? Then it would make sense.

Also, is there a particular reason why the whole code is basically 2 massive files (3k and 8k lines respectively)?


I am not very good with Python, so there is some chance I am doing something wrong. But my explanation of the results that I get is that PyTorch currently does not fully utilize FP16 + Metal or AMX when running on Apple Silicon. In contrast, my implementation stores the weights in 16-bit floating point precision (FP16) and also utilizes the AMX coprocessor through the Accelerate framework. As I mentioned in OP, the latter is very efficient for doing the matrix multiplications. According to my experiments, it is comparable in performance to running them on the Apple GPU via Metal.


Awesome work.

Honest question, do you think with this marked improvement it could be worth making a wrapped library for this c/c++ version in python - for example like numpy?


I think PyTorch will very soon catch up in performance once Apple Silicon gets properly supported. On x86, I don't observe very big performance improvement compared to current PyTorch. So overall, not sure if it is very worth it, although it's not hard to wrap it thanks to the C-style API.


This is really outstanding work, thank you so much for doing this and open sourcing it! I'm sure this will enable many applications in the future!


It's very cool, I love the talk.wasm example.

It looks like it will open a new application scenario for low-performance hardware products.


When I tried the medium model here it seemed to make more mistakes than the GPU version. Any idea why that would be?


I have only implemented the Greedy decoder which is worse compared to the BeamSearch decoder in most cases.


Yes this does make a difference to the results. I'm loving the speed of your C++ version but it'd be fab to have the BeamSearch decoder if we can?


Really amazing work - just cloned the repo on my Apple Silicon Mac and ran the streaming sample. Worked flawlessly.


Works well on my old Intel 2020 Macbook Pro. Having no dependencies meant I was able to just clone the repo and build (as long as you have the C++ tooling). There's something magical about how simple it is without having to jump through hoops getting your dependencies resolved.


Wow! I tested Whisper and was excited about the possibility of using it with my home automation system, and the accuracy was solid, but I found it just a bit too slow to feel like that was a good use case.

If this is a significant performance improvement it may be a practical option.


I've been watching this repo evolve since it was created. Seriously exemplary work Georgi.


Great work! I use it experimentally in a free service for people who want to subtitle videos. The small model runs o a USD8/mo VPS. Looking into running the medium model soon.


Is there a feature roadmap?

OpenAI quietly launched Whisper V2

https://news.ycombinator.com/item?id=33884716


Large v2 has already been added to `whisper.cpp` (check the pinned issues). I was thinking about adding a roadmap soon


How could I deploy and use this in my web app? I don't know much about Web Assembly if someone can ELI5 please thanks.

Is the result some API that I send requests to?

Thanks a ton!


I am using your command executable as a personal assistant. I am using it for simple commands such opening a browser/ certain apps on desktop.


Pretty cool stuff, can this be easily used to augment videos with translated subtitles? And if so, I'm assuming it would be fast enough to do so in real-time?


I love this! That talk.wasm package has a lot of potential…


This is my favorite example! I recorded a few video demonstrations where I talk with the AI, but they all sounded very cringe. So, the flagship video demo of talk.wasm is currently of 2 browsers talking with each other, which I think is not as impressive. If somebody gets this running and manages to record a conversation - would be happy to hear it!


Amazing, even more so that you rolled your own inference and didn’t use ONNX runtime. Thanks so much for sharing.


On  platforms consider use of the BNNS API in Accelerate, which has some fast paths not exposed by the BLAS API.


There are tons of open source STT models, what makes whisper so valuable? I especially don't get it on mobile, where the native STT built into the OS is now real-time and includes punctuation (at least for iOS). I love the open-source approach to the model, but it didn't strike me as particularly better than other open-source or built-in models.


In my limited testing, I've found Whisper to be much better (accuracy-wise) than other STT models.


For some reason, I recalled Whisper to be on-par or slightly worse than the other open source ones, but much better across languages. I appear to be wrong.


Just curious on What open source ones did you compare them with ? do you mean WER was high or ?


Is there any improvement to GPU inference at all?

CPU inference is not cost effective for my use-case


If you have a good GPU then you don't need `whisper.cpp`. Best use case currently is if you are running on Apple Silicon.


This is crazy valuable. Just the other day I had an idea that needed this!


Very cool, thank you.


Thank you for sharing your project.


This would be cool in Lisp.


Love it, great job!


There is no language called "C/C++".

If, to use it, you need C++, it's a C++ system, period. If parts are in C, that is an implementation detail, and is furthermore pointless; modernizing those to C++ would make them more robust and probably faster. Using C anywhere just makes it as bug-prone in those parts as C code everywhere is.


> If, to use it, you need C++, it's a C++ system, period. If parts are in C, that is an implementation detail

C and C++ are different languages, you can't necessarily treat a codebase of mixed C and C++ as if it were written entirely in C++.

This isn't purely academic, either. For a long time the MSVC compiler had pretty good C++ language support but surprisingly poor C support, which inconvenienced some folks.

> modernizing those to C++ would make them more robust and probably faster

I agree that adopting C++'s insistence on explicit type conversions is a good move for C codebases. I'd be surprised if performance were improved by reworking a C codebase to get it to compile as C++, though.


You can, in fact, compile the C code with a C++ compiler, given trivial adjustments. Adopting "explicit type conversions" would have similarly trivial results. The benefits would come from leaning into higher-level organization.

C++ code going faster than the C is routine. It doesn't come without activity, but the activity yields a more maintainable system, so is commonly done.

MSVC's C compiler only recognized C90. It was generally easier to make your C99 code compile as C++ than to backport it to C90.


> There is no language called "C/C++".

He clearly means that he used two languages: C and C++.

"I decided to reimplement the inference of the model from scratch using C/C++. To achieve this I implemented a minimalistic tensor library in C and ported the high-level architecture of the model in C++."


It is obvious what he did just from reading what he wrote, after. It is not clear that he meant anything at all by the expression. You cannot use his thing in any but a C++ program, so mentioning C at all only adds confusion.


It is clear what he meant by the distiction:

  implemented a minimalistic tensor library in C

  ported the high-level architecture of the model in C++
The C part - which accounts for most of the lines of code - may be of interest to someone who wants to use it as the foundation for implementing a different model - not necessarily in C++


>There is no language called "C/C++".

Yes there is.

Take the union of C and C++ which is valid syntax (and same semantics) in both languages, voila; C/C++.


But it is not, in fact, written in such a "language". Nothing is.


Why not Rust? C++ is obsolete


This kind of comment is exactly what Rust skeptics are talking about when they are referring to the toxic Rust community.


I wonder how ChatGPT3 would respond to this?

If they crawled /r/programming and hackernews, it should be annoyed and come up with something along the lines of "fork it and write in in Rust yourself", "C++ has lots of inertia", "his main stack is C/C++", etc


While there is no pure rust model yet, there is a crate with bindings to the cpp model (probably not to this one).




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: