On a related note for anyone interested in this and want better performance today:
I messed with a combo of Whisper and ChatGPT. I took a whisper transcript and asked chatgpt to fix mistranscriptions using the context of the transcript and based on potential phonetic issues. I asked it to replace transcribed words that don't make sense with "[unintelligable]", which improved the output even more.
Transcription error rate was almost nonexistent, even on the smallest whisper model.
Probably depends on the language. I tried to transcribe some hour-long talks in Chinese with whisper the other day. Even with the large model, it manages to skip some sentences from time to time (no idea why), and repeat the same sentence over and over in other instances. I tweaked parameters for a long time to no vail. While GPT should be able to clean up the repetition (there will be false positives though, since human speakers do repeat themselves occasionally), it can’t really fill out the missing sentences.
I have to transcribe a tonne of Chinese interviews soonish -- any further thoughts or experiments you can think of? Maybe some preprocessing steps to the audio? For example, cut it into one minute chunks with some overlap, then transcribe those, so that it can't skip those bits...? Or can we finetune it on a library of Chinese mp3s + transcripts?
Whisper cuts the audio into chunks of 30 seconds. So If you have a one-minute recording where the first half has conversation in it and the remainder nothing, then it will think that it has to find something in that second 30 seconds block without knowing how "speech" actually sounded like as it did in the first chunk.
Try to pre-process it where just "voice" is detected, not the meaning, just some speaking, and cut the audio into snippets which only contain speech, so that Whisper doesn't have to guess if the segment will contain speech or not.
Also, if you cut it up into chunks and let it transcribe each chunk and expect JSON as the output, instead of the other output methods, then you'll get a bunch of extra parameters with it which will help you identifying problematic sections. For example hallucinated repetitions usually have a higher "no_speech_prob" parameter, or segments with lower "compression_ratio" will also not be that accurate.
If cost isn’t an issue, I’d use one of the established commercial offerings. Otherwise, you could try splitting into shorter chunks (20 minutes maybe?), do multiple runs on each chunk and pick out the best run according to some criteria, e.g. character count after removing repetitions. Whisper isn’t deterministic so some runs can be better than others; you could also tweak parameters like compression ratio or silence thresholds between runs, but in my experience there’s not going to be an obvious winner leading to a marked improvement in quality. Anyway, I’m no expert, and maybe you’ll have better luck than me. My recordings do have background music and conversations in some places that might confuse the model.
Can't use third party service -- compliance stuff. This is helpful though. Using another model to tidy it up -- maybe Alpaca -- could be an option too. Then we'll just do speaker separation etc. manually later.
I use a custom babbage model fined tuned using the original whisper transcripts as the prompt and the fixed transcript as the result. It does a very good job at correcting common jargon and names, correcting something like 1/3 to 1/2 of all total errors. Also is very good at correcting transcripts which whisper fails to put punctuation in for whatever reason by adding punctuation and capitalization where appropriate. I used this to fix 1000+ transcripts of lectures totaling around 700 hours of speech on the cheap.
Personally wouldn't recommend my approach however unless you are okay with doing some hardcore text manipulation and fuzzy math. It fails to produce text that matches up with the prompt 10% of the time and lots of other caveats with what to do with text that doesn't fit into the prompt.
I've suspected as such, which makes using the Siri and Google Assistant of today even more infuriating! I know it's going to be exponentially better just around the corner!
I've been doing the same. Apart from sometimes changing the meaning of sentences, it does a pretty good job at improving the message one is trying to convey.
In my case I'm recording notes while exercising on my bike, so there's wind and friction and breathing, then I transcribe them with Whisper, do try to filter out things that don't make much sense (where the temperature, compression_ratio, avg_logprob and/or no_speech_prob are not acceptabe).
Since I start recording at will and don't prepare the sentence to be recorded, the recording is a bit chaotic, and ChatGPT does a good job in correcting Whisper's mistakes and transforming the sentences.
But, as I said, sometimes this fails and the real message gets lost.
This is tangential to the software techniques, but have you tried using throat microphones? They're essentially the microphone counterpart to the bone conducting headphones. They're aimed at use cases where there's a lot of noise (originally created for paragliding but also used in biking scenarios), and low voice scenarios (eg covert operations and even to amplify voice from people having speech problems due to things like Parkinson's disease).
I'm using them to record multiple people in the same room but avoid cross-talk (for data collection in a research study). The audio isn't great, but I've used transcription services and they seem to be able to make the words out just fine.
This is one of my frustrations with generative AI APIs, the censoring of content. I understand why it's done but it seems overboard to me, and has created huge headaches with material that is not at all prurient, but where there's some homonyms involved.
I use Whisper actively in my daily life, but I'm curious about the prompt you used to make these corrections. I didn't get great results in my trials. What technology are you using? GPT-3.5 or GPT-4? If you could provide me with some information on this, I'd like to incorporate it into my workflow right away. I understand that you may have fine-tuned the models on Davinci as custom. If any of ones working in this field have a chance to give me more detailed information, I'd really appreciate it.
I wish there was a way to use Whisper for continuous recognition! Unfortunately, Whisper doesn't support streaming — and AFAIK it's unlikely that it ever will.
Nice! I tried something similar where I orally gave a speech, transcribed it and then had ChatGPT turn it into a coherent text. Saved me a bunch of time.
Yeah there are some models that I played with that can do this. They only work for 2 or 3 speakers currently though. They term for this is "diarization".
I wonder, do any conference call services (zoom, GMeet, etc) offer the ability to record each participant's audio stream separately in a way that would make it easy to transcribe them separately then combine?
I messed with a combo of Whisper and ChatGPT. I took a whisper transcript and asked chatgpt to fix mistranscriptions using the context of the transcript and based on potential phonetic issues. I asked it to replace transcribed words that don't make sense with "[unintelligable]", which improved the output even more.
Transcription error rate was almost nonexistent, even on the smallest whisper model.