Hacker News new | past | comments | ask | show | jobs | submit login
MacWhisper: Transcribe audio files on your Mac (gumroad.com)
240 points by cristoperb on Aug 23, 2023 | hide | past | favorite | 93 comments



I've been using MacWhisper for a few months, it's fantastic.

Sometimes I'll send a mp3 or mp4 video through it and use the resulting transcript directly.

Other times I'll run a second step through https://claude.ai/ (because of its 100,000 token context) to clean it up. My prompt for that at the moment is:

> Reformat this transcript into paragraphs and sentences, fix the capitalization and make very light edits such as removing ums

That's often not necessary with Whisper output. It's great for if you extract captions directly from YouTube though - I wrote more about that here: https://simonwillison.net/2023/Aug/6/annotated-presentations...


This is so good! I studied English, then moved to linguistics, then lived in the UK for almost a decade and due to my accent none of the TTS tools are close to the approach you just mentioned (whisper + LLM). Thanks Simon!


I have a Python script on my mac that detects when I press-and-hold the right option key, and records audio while it's pressed. On release, it transcribes it with whispercpp and pastes it. Makes it very easy to record quick voice notes. Here it is: https://github.com/corlinp/whisperer/tree/whisper.cpp

I was working on a native version in the form of a taskbar app with customizable prompt and all. However I quickly realized that the behaviors I want the app to do require a bunch of accessibility permissions that would block it from the app store and require more setup steps.

Would anybody still find that useful?


> However I quickly realized that the behaviors I want the app to do require a bunch of accessibility permissions

Which behaviours specifically?

Personally, I wouldn't worry too much about the App Store. I'm distributing Enso (http://enso.sonnet.io) via gumroad.com, and people download/pay for it. I think it's easier than using the App Store Connect route anyway.

Here's a good intro: https://rambo.codes/posts/2021-01-08-distributing-mac-apps-o...


Detecting an alt-key push even when it's not an active window, and editing the selected text field are both accessibility permissions.

Thanks for the info about your app. It looks great!


Editing the field definitely needs the permissions, but detecting Alt-key holding should not.

You can do that using something like:

    var reactOnOptionKeyHeld: DispatchWorkItem? { didSet { oldValue?.cancel() } }

    NSEvent.addGlobalMonitorForEvents(matching: .flagsChanged) { (event) in
        guard event.modifierFlags == [.option] else {
            reactOnOptionKeyHeld = nil
            return
        }
        reactOnOptionKeyHeld = DispatchWorkItem {
            // start recording
        }
        // schedule to run if held for at least 1 second
        DispatchQueue.main.asyncAfter(deadline: .now() + 1, execute: reactOnOptionKeyHeld!)
    }
I see you're using Python with pynput though, which is creating a full key listener so I guess that is why you need the permissions.


Thanks! Yes, to be clear I was rewriting it in Swift which I'm new at, so I appreciate the expert advice there.


Does anyone actually download their Mac software from the App Store?


I prefer to because it centralizes updates for apps from different makers. Way more convenient than manually checking each piece of software.


Unless it's a trusted recommendation I'll look on the app store first. Sometimes I'll still get it from the app store anyway:

- The store itself is convenient for browsing and discovery.

- I don't need to do any kind of background checking on the developer prior to running their app

- Similarly they don't get access to my credit card details, so I don't have to be concerned about them storing it incorrectly or abusing it later

- It's also easier for me to pay them, as often foreign transactions are blocked despite me specifically approving them.

- I don't need to hand over my email or other personal details, small developers seem really bad at storing information and CC details properly. I use custom emails for everyone and numerous times I've seen my data on-sold, stolen by ex-employees, or simply thefted from the company by hacking groups.

- I don't have to do any reviews when there is an update, I can just accept it knowing that the developer is still trusted and that the project hasn't been hijacked such as the numerous painful times that popular open source projects have had malware snuck into them.

- If the app doesn't properly do what claims (or at least what I thought it would), it's a few clicks to get a refund from Apple.

- Apple carrot and stick developers to keep their apps up to date with the system. First carrot, and eventually the stick (delisting).

Some may argue that some of these things can still happen with an app store, but it's demonstrably less and there are processes in place to deal with that.

It's not a popular opinion, especially on HN, but there are plenty of developers who, whether through frustration or dealing with pedantic/rude* customer requests, treat their customers like shit, the store prevents that.

The app store ain't perfect, and there are plenty of functions which apps can't have if they're sold through the app store, but for its flaws it's trustworthy and helps me utilise a far larger number of utilities than I'd normally be comfortable with personally maintaining.

* I sit on enough discords to see how breathtakingly rude and demanding some users are without noticing it, even for totally free software.


Yup, sometimes. Sometimes not.


Not being able to be on the app store isn't an issue, it wouldn't put anyone off using it or downloading it. Majority of my apps arent from there, I imagine most long time users are the same.


I want this taskbar app! (ideally with streaming transcription vs. at the end) How can I get it?


You need to add the "DIR" var in run.sh or it fails if you run outside of the src dir:

SRC_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

python3.10 "$SRC_DIR/whisperer.py" $@

Also did you solve the python app icon from bouncing all the time?


Thanks for actually trying it out! I must admit, I didn't pay much attention to UX for installation here since it was mostly for my own use, but that's great you got it working! What do you think?

I have not solved the bouncing icon. That's one of the reasons it needs to be rewritten in Swift!


I think it's pretty cool to have a hotkey to type STT text anywhere locally. It also helps with using LLMs, it opens you to using more run times since most don't have a whisper plugin, and using the whisper plug is usually an awkward UX too.


I haven’t used MacOS native dictation, but I thought it could be used anywhere text could be input. Does your script have different functionality?

Edit: looks like Mac native runs locally only with dapple Silicon, and maybe has a more limited geographic/linguistic reach?


- MacOS native dictation, in my experience, is slow to start up (indeterminate delay after pressing the dictation key)

- The accuracy is decent but the vocabulary is very limited. With Whisper, you can customize the prompt to include industry-specific terms and acronyms.

See my example from the repo. Apple recognizes:

> Popular Linux distributions include Debby and Fed or Linux, and do Bantu. You can use windowing systems such as X eleven or Weiland with a desktop environment like KD plasma.

Whisper recognizes:

> Popular Linux distributions include Debian, Fedora, Linux, and Ubuntu. You can use windowing systems such as X11 or Wayland with a desktop environment like KDE Plasma.


To everyone saying the betas are better, no sir:

Popular Lennox distributions include Debbie and Fedora, Lennox, and Beau you can use windowing system such as X11 or Whelan with a desktop environment like Katy plasma


Just tried the MacOS one, here's my nearly worthless result:

Popular Linux distributions include Devion fedora Linux and a bunch to you can we use when doing system such as excellent Wayland with a desktop environment like Katie plasma.


This captures the exact type of problem I have when using Siri for my shopping list.


macOS native dictation is not as good as Whisper in terms of accuracy, however that is probably going to change with macOS Sonoma since they will switch the model for speech recognition to a better one (Transformer based iirc).


Was it ever addressed that even when you had the microphone turned off it could still detect audial stimuli and reflected that in the oscillating sound wave visual? Makes me wonder if it was/is always listening even with Hey Siri disabled


That "sounds" to me like the mic was properly cut off electrically, but the rest of the system as active, so you'd get electrical noise coming in. E.g., the mic and amp are powered down, but the ADC is still active.

My old Sun Ultra 40 M2 had a ton of electrical noise on my headphone jack, and I could def. tell when the CPU was busy from what I was hearing.


I meant like, even after toggling it off when you made noise or spoke, there was a visual representation/feedback for that shown in the wavy graph thing. Not just ambient/moving parts type noise. Just thought it was weird, never really thought about it too much.


I'm on the Sonoma beta and can confirm it's miles better


Better than whisper? I am running whisper.cpp locally on my Ventura. Should I update to Sonoma?


Better than dictation used to be on MacOS. I tried some whisper-based stuff, but it lacks the integration that the built-in dictation has (so I don't have to dictate somewhere else and copy/paste). It seems in the same ballpark as whisper, but I haven't done a comparison.


Whisper is cool. Back in college I wanted to do some projects with speech-to-text and text-to-speech as an interface like 10-12 years ago, but at that point the only option was google APIs that charged by the word or second.

On top of that, constantly sending data to google would have chewed a ton of battery compared to the "activation word" style solutions ("ok google/siri") that can be done on-device. The power for on-device processing was obviously going to come down over time, while wireless is much more governed by the laws of physics, and connectivity power budgets haven't gone down nearly as much over time. I am pretty sure there is a fundamental asymptotic limit for this, governed by Shannon entropy limit/channel width and power output. In the presence of a noise floor of X, for a bandwidth of Y, you simply cannot use less than Z total power for moving a given amount of data.

BTLE is really the first game-changer (especially if you are hooking into a broad network of receivers like apple does with airtags) but even then you are not really breaking this rule - you are just transmitting less often, and sending less data. It's just a different spot on the curve that happens to be useful for IOT. If you are, say, doing a keyboard over BTLE where the duty cycle is higher, the power will be too. Applications that need "100% duty cycle"/"interactive" (reachable at any time with minimal latency") still have not improved very much.

In hindsight I guess the answer would have been writing a mobile app that ties into google/siri keywords and actions, and letting the phone be the UI and only transmit BT/BTLE to the device. But BTLE hadn't hit the scene back then (or at least not nearly to the extent it has now) and I was less experienced/less aware of that solution sapce.


If you're looking for an alternative that runs on Linux, I just recently discovered Speech Note. It does speech to text, text to speech, and machine translation, all offline, with a GUI:

https://flathub.org/apps/net.mkiol.SpeechNote

https://github.com/mkiol/dsnote


I must have missed this on my usual peruse of new apps on Flathub. Thanks for making it, will look forward to trying it out.


While whisper.cpp is faster than faster-whisper on macOS due to Apple's Neural Engine [0], if you have a GPU on Windows or Linux, faster-whisper [1] is a lot faster than OpenAI's reference Whisper implementation as well as whisper.cpp, with the CLI being wscribe or whisper-ctranslate2 as faster-whisper is only a Python library. It's pretty good.

[0] https://github.com/guillaumekln/faster-whisper/discussions/3...

[1] https://github.com/guillaumekln/faster-whisper


This basically does the same thing but free:

https://apps.apple.com/us/app/aiko/id1672085276


That's awesome that the dev released Aiko for free!

Not a deal breaker, but it was last updated 3 months ago and lacks a few QoL features of MacWhisper. Jordi is frequently pushing updates to MacWhisper: https://nitter.net/jordibruin/status/1692133387299864638


Hmm that one has a lot less features.


Nothing that isn't scriptable with existing projects, tbh


Here's a multi-platform open source app that does the same thing but uses vosk instead of whisper.

https://github.com/bugbakery/audapolis


Been using it for a couple months, and Jordi keeps improving on it at a steady clip. It's great!!


I've used this for a few months to transcribe interviews and it works pretty well. The UI for dealing with multiple speakers is a bit cumbersome, and there are occasional crashes, but overall definitely a great app and worth the money


The main problem I have faced with the whisper model (large) is when there is silence or a sizable gap without audio, it hallucinates and just puts out some random gibberish repeatedly until the transcription ends. How does this app handle this?


I've run into that, many times. Would be nice to have a fix.


https://github.com/MahmoudAshraf97/whisper-diarization

This project has been alright for transcribing audio with speaker diarization. A big finicky. The OpenAI model is better than other paid products(Descript, Riverside) so I’m looking forward to trying MacWhisper.


I really like this app, I wish there was a way to play a video while editing the subtitles though!


There is a great library that has support not only with OpenAIs whisper but many others that also work offline. https://github.com/Uberi/speech_recognition


Out of curiosity, does anyone know what the state of the art for transcription is? Is there a possibility it will soon be "better than a person carefully listening and manually transcribing"?

I ask because I asked a friend to record a (for fun) lecture I couldn't attend, and unfortunately the speech audio levels are quite low, and I'm trying to figure out how to extract as much info as possible so I can hear it. If I could add context to the transcriber like "This is about the Bronze Age collapse and uses terminology commonly used in discussions on that topic", it might be even more useful.


Try to upload it on https://revoldiv.com/ we pre-process the file to make it a little Intelligible and you can supply your context when uploading.


A few weeks ago I found myself wanting a speech to text transcriber that directly captures my computer's audio output (I.e. not mic input, not am audio file), but I could not find one. The best alternative I found was to have my computer direct audio output to a virtual audio input device, but I could not do this on my desktop because I do not have a sound card. I found software that did this, but it did not allow me to listen to the audio output while it was redirected to a virtual audio input.

Has anyone else tried to do something similar? How did you achieve it?


Audio Hijack[1] will let you route any audio to multiple virtual or actual outputs while adding the ability to listen to any part of the signal chain. Hope that solves it for you, it’s saved my sanity a number of times! [1] https://rogueamoeba.com/audiohijack/


Love the idea behind this. High quality transcription + the data not leaving your device is excellent.

Any chance there's an iOS version of this coming down the pike? It would be great to have a voice-based note taking app that you can use when you are driving or walking and you don't want to type into your phone, but you just want to save that thought you just had somewhere by quickly dictating it, and having it accessible as text later.


I didn’t know whisper could differentiate voices for the per speaker transcription. Is that new? Is it also available in the command line whisper builds?


It can’t


https://github.com/chidiwilliams/buzz

Brew install buzz

Its great


If you want a quick and free web transcription and editor tool, We've built https://revoldiv.com/ with speaker detection and timestamps. Takes less than a minute to transcribe 1 hour long video/audio


Yes but the point of this project is that it doesn't require you to share sensitive data with third parties.


Good point but the problem with local hosting is that if you want to use the larger models it will take a long time to transcribe a file. We use multiple gpus and we do speaker detection, sound detection and it is has a rich audio editor.


Totally agree, having built a similar app I know speaker diarization is a killer feature that's hard to get. My problem is I'll never share these recordings ;).


Is gumroad a good platform for selling software like this? How is licensing handled?


Would be nice if it allowed importing mkv files, in the end its just a container..


The OpenAi CLI does that, follow the instructions https://github.com/openai/whisper


Thanks for the pointer. Curious: can whisper be made to translate into other languages than English?


Check the options on that i think you just need to select the right model


If you'd rather use a web app with minimal cost upfront check out PlainScribe :) https://www.plainscribe.com/


I tried to sign up and got a Clerk error: "You have reached your limit of 500 users. You can remove the user limit by upgrading to a paid plan or using a production instance."


Sorry about that. This should be fixed now.


Does anyone know of an easy to use whisper fork with speaker attestation?


Shameless plug: recently launched LLMStack (https://github.com/trypromptly/LLMStack) and I have some custom pipelines built as apps on LLMStack that I use to transcribe and translate.

Granted my use cases are not high volume or frequent but being able to take output from Whisper and pipe it to other models has been very powerful for me. It is also amazing how good the quality of Whisper is when handling non English audio.

We added LocalAI (https://localai.io) support to LLMStack in the last release. Will try to use whisper.cpp and see how that compares for my use cases.


Seriously great program. Licensing model just fine. I use this all the time, so do my collegues at other companies.

The developer Jordi has a great speech online about product development.


Is this just a front end to OpanAI's whisper?

https://github.com/openai/whisper


Seems shady to me to charge for running larger free models you don't provide on hardware your users provide. You are charging for openAi features not yours.


They're not charging for the model, they're charging for the UI.


The UI is free, the premium features are the model. Read the website.


Did you even look at the app?


Yes now let me help you read the paid feature:

Supports Tiny (English Only), Tiny, Base, Small, Medium and Large models

Translate audio file into another language through Whisper (use the Medium or Large models, the results will not be perfect and I'm working on more advanced ways to do this)


Many such apps exist. I use Hello Transcribe from the App Store, $7 across all iDevices, with CoreML optimization.


I’ve gotten confused between the different whispers. How is this different from the openai api endpoint?


It runs locally, using Whisper.cpp[1], a Whisper implementation optimized to run on CPU, especially Apple Silicon.

Whisper itself is open source, and so is that implementation, the OpenAI endpoint is merely a convenience to those who don't wish to host a Whisper server themselves, deal with batching, renting GPUs etc. If you're making a commercial service based on Whisper, the API might be worth it for the convenience, but if you're running it personally and have a good enough machine (an M1 MacBook Air will do), running it locally is usually better.

[1] https://github.com/ggerganov/whisper.cpp


FWIW, I will add that most laptops made in the past 10 years are fast enough for real-time transcription. Unless you're trying to transcribe in bulk, running it locally will usually be the best option.


This depends on the model being used, if you're doing anything which isn't English, you pretty much need large, and that needs considerable resources.


Great tool, but I can't wait until it can do real-time live transcribing.


superwhisper.com is also cool


$165!


Thanks!


So this is not Whisper Transcription 4 from the appstore?


Any insight on how Whisper works on older Intel Macs? I have a 2012 Mac mini with 16GB of RAM doing nothing; if I could use it to (slowly) transcribe media in the background, this becomes a must-buy.


Not well with Intel Macs unfortunately.


Anyone have a cached page? Seems to hugged to death.


Why? Just use whisper directly. The model and code is available and I think there’s even a homebrew formula...


Why? Just use MacWhisper and have a great interface with a bunch of options.


Why use a web browser? Just use curl directly. The code is available and I think there's even a homebrew formula...


I deserved that curl snark. :o)

Web browsers are mostly free and don't try to upsell you to a Pro paid version. The MacWhisper author deserves to be compensated for their work, so I'm not objecting to the existence of a paid version. This feels like yet another relatively low value freemium/upsell wrapper in the Mac shareware ecosystem to me.

I'm probably wrong and there's a real population that benefits from this work, clearly some folks perceive it as useful enough to pay for it and I'm just not in that audience to see it.

I think part of what rubs me the wrong way about this is that it feels to me like commercial freeloading due to the thinness of the commercialized wrapper around a free/open core in this case (whisper model + code); it feels ethically questionable unless the author contributes back some portion of the proceeds to research in some way -- I didn't see evidence of that. I'm probably being naive here, happy to have a less snarky discussion about it though.


I have both installed. I use macwhisper because the GUI is convenient.


I just want to drag and drop my files and be done.


Why'nt?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: