Hi HN,
OpenAI recently released a model for automatic speech recognition called Whisper [0]. I decided to reimplement the inference of the model from scratch using C/C++. To achieve this I implemented a minimalistic tensor library in C and ported the high-level architecture of the model in C++. The entire code is less than 8000 lines of code and is contained in just 2 source files without any third-party dependencies. The Github project is here:
https://github.com/ggerganov/whisper.cpp
With this implementation I can very easily build and run the model - “make base.en”. It also allows me to run it on a wide range of devices. For example, I have provided examples of running the model on an iPhone, Raspberry Pi 4 and even in a web page via WebAssembly!
The implementation runs fully on the CPU and utilizes FP16, AVX intrinsics on x86 architectures and NEON + Accelerate framework on Apple Silicon. The latter is especially efficient and I observe that the inference is about 2-3 times faster compared to the current PyTorch implementation provided by OpenAI when running it on my MacBook M1 Pro. The WASM port utilizes SIMD 128-bit intrinsics - a feature supported in some modern web browsers [1].
I am very happy with the performance that I observe on Apple Silicon devices. I didn’t expect that the Accelerate framework [2] (i.e. CBLAS) offers such a dramatic performance boost for matrix multiplications so I was very pleasantly surprised! To enable the framework in your C/C++ projects, all you have to do is add `-framework Accelerate` to your clang command-line flags.
This entire exercise of implementing the Whisper model was very interesting to me and helped me understand a lot about how the transformer architecture works. I also got a lot of positive feedback from people finding and using my project. We brainstormed on a lot of interesting tools that can potentially be created with this library (such as speech-to-text plugin for Vim, RPi4 voice assistant, WASM chat bot, etc). If interested, checkout the “Examples” section and the “Show and tell” discussions for some ideas!
Would love to know what you think about this project and about your experience with using the Accelerate framework in any of your projects.
Cheers!
[0] https://github.com/openai/whisper
[1] https://chromestatus.com/feature/6533147810332672
[2] https://developer.apple.com/documentation/accelerate
So many different cuda version, with each framework using its own, that all rely on a different driver, and everything needs a new version every 3 months and takes ~10G, (and don't even talk about cudnn needing some manual logged-in install).
Here everything is just two files. For embedded system that don't have a GPU it's perfect.
Here the parallelization and vectorization has been done by hand, but there is a glimmer of hope coming from the side of various compiler projects :
Here is an interesting intel project that does the parallelization and vectorization automatically for different architecture that's definitely worth a look : https://ispc.github.io/ispc.html
For the auto-differentiation when I need performance or memory, I currently use tapenade ( http://tapenade.inria.fr:8080/tapenade/index.jsp ) and/or manually written gradient when I need to fuse some kernel, but Enzyme ( https://enzyme.mit.edu/ ) is also very promising.
MPI for parallelization across machines.