I want something that I can self host. I am perfectly OK with a single language ...

htrp · on March 8, 2023

Just download whisper ....

If you own a gpu use this one https://github.com/openai/whisper

If you don't own a gpu use this one https://github.com/ggerganov/whisper.cpp (this one is very very slow)

inconceivable · on March 8, 2023

whisperx also adds improved timestamping, closed captioning output, and beta diarization (speaker labeling) support. unfortunately it doesn't seem to support m4a out of the box but you can convert to mp3 (upgrade the sound lib dependency first) or wav with ffmpeg.

travisjungroth · on March 8, 2023

whisper.cpp is not universally very very slow. With an M1 Macbook and the medium model it's faster than real time. There may be some accuracy lost because it uses a different search method and if you choose to run a smaller model.

mpociot · on March 8, 2023

You mean without using the OpenAI API? This project is open source and on GitHub, so you can self-host this if you want!

kkielhofner · on March 8, 2023

You (essentially) need GPU but here you go:

https://github.com/ahmetoner/whisper-asr-webservice

For your requirements the medium.en model (max) should be satisfactory.

adgjlsfhk1 · on March 8, 2023

https://github.com/ggerganov/whisper.cpp makes it relatively feasible to run on CPU.

kkielhofner · on March 8, 2023

Yes but it doesn't provide an HTTP/whatever API - it's CLI.

OP said "self host" so I assumed they're looking for an implementation that provides an API endpoint.

It would be straightforward enough to create an API utilizing whisper.cpp but I'm not aware that such a thing exists.

Additionally, depending on requirements whisper.cpp is remarkably performant considering it's running on CPU but it's still nowhere near competitive with GPU implementations (for obvious reasons). Depending on expectations vs the GPU powered OpenAI Whisper endpoint it could be disappointing.

From the whisper.cpp benchmarks it's showing transcribing 3:24 of audio with whisper medium.en in 30 seconds (on an M1 Pro!!!) - which is (again) incredible considering. That's 6.8x real-time.

As an example, we've spent quite a bit of time optimizing our self-hosted Whisper API endpoint and it can do 3 min of audio (the max we currently care about) in 2.5 seconds with large-v2 and beam size five on an RTX 3090. That's 72x real-time with a much larger, more capable, and more accurate model - and we have further work to do.

Our focus is primarily "real time" dictation tasks with ~10 sec sentences. All in with internet latency of ~70ms end-to-end (from end of 10s audio segment to returned results) is currently roughly 700ms. Medium.en is 400ms all in.

Not a fair comparison but yet another example of the massive performance differences between CPU and GPU for tasks like this.

Additionally, my experience with this project has illustrated to me (yet again) the gulf between "we opened our model" and "actually use it at scale, in production, and be competitive in the marketplace". It's a HUGE difference and the resources, knowledge, etc required are substantial.

XCSme · on March 9, 2023

> You (essentially) need GPU but here you go

Don't most devs most likely already have a powerful GPU? Maybe I am biased for also being a gamer or having worked in game-development, which requires a powerful GPU anyway.

stuxnet79 · on March 8, 2023

whisper is extremely simple to use on the command line. Just install it with pip and you are off to the races.