Hacker News new | past | comments | ask | show | jobs | submit login

I want something that I can self host. I am perfectly OK with a single language and a few mistakes here and there.

Does such a thing exist? I would gladly donate to a kickstarter project for this before trying to build one myself.




Just download whisper ....

If you own a gpu use this one https://github.com/openai/whisper

If you don't own a gpu use this one https://github.com/ggerganov/whisper.cpp (this one is very very slow)


whisperx also adds improved timestamping, closed captioning output, and beta diarization (speaker labeling) support. unfortunately it doesn't seem to support m4a out of the box but you can convert to mp3 (upgrade the sound lib dependency first) or wav with ffmpeg.


whisper.cpp is not universally very very slow. With an M1 Macbook and the medium model it's faster than real time. There may be some accuracy lost because it uses a different search method and if you choose to run a smaller model.


You mean without using the OpenAI API? This project is open source and on GitHub, so you can self-host this if you want!


You (essentially) need GPU but here you go:

https://github.com/ahmetoner/whisper-asr-webservice

For your requirements the medium.en model (max) should be satisfactory.


https://github.com/ggerganov/whisper.cpp makes it relatively feasible to run on CPU.


Yes but it doesn't provide an HTTP/whatever API - it's CLI.

OP said "self host" so I assumed they're looking for an implementation that provides an API endpoint.

It would be straightforward enough to create an API utilizing whisper.cpp but I'm not aware that such a thing exists.

Additionally, depending on requirements whisper.cpp is remarkably performant considering it's running on CPU but it's still nowhere near competitive with GPU implementations (for obvious reasons). Depending on expectations vs the GPU powered OpenAI Whisper endpoint it could be disappointing.

From the whisper.cpp benchmarks it's showing transcribing 3:24 of audio with whisper medium.en in 30 seconds (on an M1 Pro!!!) - which is (again) incredible considering. That's 6.8x real-time.

As an example, we've spent quite a bit of time optimizing our self-hosted Whisper API endpoint and it can do 3 min of audio (the max we currently care about) in 2.5 seconds with large-v2 and beam size five on an RTX 3090. That's 72x real-time with a much larger, more capable, and more accurate model - and we have further work to do.

Our focus is primarily "real time" dictation tasks with ~10 sec sentences. All in with internet latency of ~70ms end-to-end (from end of 10s audio segment to returned results) is currently roughly 700ms. Medium.en is 400ms all in.

Not a fair comparison but yet another example of the massive performance differences between CPU and GPU for tasks like this.

Additionally, my experience with this project has illustrated to me (yet again) the gulf between "we opened our model" and "actually use it at scale, in production, and be competitive in the marketplace". It's a HUGE difference and the resources, knowledge, etc required are substantial.


> You (essentially) need GPU but here you go

Don't most devs most likely already have a powerful GPU? Maybe I am biased for also being a gamer or having worked in game-development, which requires a powerful GPU anyway.


whisper is extremely simple to use on the command line. Just install it with pip and you are off to the races.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: