On a related note for anyone interested in this and want better performance toda...

oefrha · on March 30, 2023

Probably depends on the language. I tried to transcribe some hour-long talks in Chinese with whisper the other day. Even with the large model, it manages to skip some sentences from time to time (no idea why), and repeat the same sentence over and over in other instances. I tweaked parameters for a long time to no vail. While GPT should be able to clean up the repetition (there will be false positives though, since human speakers do repeat themselves occasionally), it can’t really fill out the missing sentences.

malborodog · on March 30, 2023

I have to transcribe a tonne of Chinese interviews soonish -- any further thoughts or experiments you can think of? Maybe some preprocessing steps to the audio? For example, cut it into one minute chunks with some overlap, then transcribe those, so that it can't skip those bits...? Or can we finetune it on a library of Chinese mp3s + transcripts?

qwertox · on March 30, 2023

Whisper cuts the audio into chunks of 30 seconds. So If you have a one-minute recording where the first half has conversation in it and the remainder nothing, then it will think that it has to find something in that second 30 seconds block without knowing how "speech" actually sounded like as it did in the first chunk.

Try to pre-process it where just "voice" is detected, not the meaning, just some speaking, and cut the audio into snippets which only contain speech, so that Whisper doesn't have to guess if the segment will contain speech or not.

Also, if you cut it up into chunks and let it transcribe each chunk and expect JSON as the output, instead of the other output methods, then you'll get a bunch of extra parameters with it which will help you identifying problematic sections. For example hallucinated repetitions usually have a higher "no_speech_prob" parameter, or segments with lower "compression_ratio" will also not be that accurate.

oefrha · on March 30, 2023

If cost isn’t an issue, I’d use one of the established commercial offerings. Otherwise, you could try splitting into shorter chunks (20 minutes maybe?), do multiple runs on each chunk and pick out the best run according to some criteria, e.g. character count after removing repetitions. Whisper isn’t deterministic so some runs can be better than others; you could also tweak parameters like compression ratio or silence thresholds between runs, but in my experience there’s not going to be an obvious winner leading to a marked improvement in quality. Anyway, I’m no expert, and maybe you’ll have better luck than me. My recordings do have background music and conversations in some places that might confuse the model.

malborodog · on April 1, 2023

Can't use third party service -- compliance stuff. This is helpful though. Using another model to tidy it up -- maybe Alpaca -- could be an option too. Then we'll just do speaker separation etc. manually later.

skim_milk · on March 30, 2023

I use a custom babbage model fined tuned using the original whisper transcripts as the prompt and the fixed transcript as the result. It does a very good job at correcting common jargon and names, correcting something like 1/3 to 1/2 of all total errors. Also is very good at correcting transcripts which whisper fails to put punctuation in for whatever reason by adding punctuation and capitalization where appropriate. I used this to fix 1000+ transcripts of lectures totaling around 700 hours of speech on the cheap.

Personally wouldn't recommend my approach however unless you are okay with doing some hardcore text manipulation and fuzzy math. It fails to produce text that matches up with the prompt 10% of the time and lots of other caveats with what to do with text that doesn't fit into the prompt.

thefourthchime · on March 30, 2023

I've suspected as such, which makes using the Siri and Google Assistant of today even more infuriating! I know it's going to be exponentially better just around the corner!

qwertox · on March 30, 2023

I've been doing the same. Apart from sometimes changing the meaning of sentences, it does a pretty good job at improving the message one is trying to convey.

In my case I'm recording notes while exercising on my bike, so there's wind and friction and breathing, then I transcribe them with Whisper, do try to filter out things that don't make much sense (where the temperature, compression_ratio, avg_logprob and/or no_speech_prob are not acceptabe).

Since I start recording at will and don't prepare the sentence to be recorded, the recording is a bit chaotic, and ChatGPT does a good job in correcting Whisper's mistakes and transforming the sentences.

But, as I said, sometimes this fails and the real message gets lost.

Naracion · on March 31, 2023

This is tangential to the software techniques, but have you tried using throat microphones? They're essentially the microphone counterpart to the bone conducting headphones. They're aimed at use cases where there's a lot of noise (originally created for paragliding but also used in biking scenarios), and low voice scenarios (eg covert operations and even to amplify voice from people having speech problems due to things like Parkinson's disease).

I'm using them to record multiple people in the same room but avoid cross-talk (for data collection in a research study). The audio isn't great, but I've used transcription services and they seem to be able to make the words out just fine.

Rustwerks · on March 30, 2023

I tried this as well but ChatGPT just refused to do the transcription fixing because PG13 movies contain content that is too controversial.

derbOac · on March 30, 2023

This is one of my frustrations with generative AI APIs, the censoring of content. I understand why it's done but it seems overboard to me, and has created huge headaches with material that is not at all prurient, but where there's some homonyms involved.

verghese · on March 30, 2023

Checkout https://exemplary.ai - we incorporate a similar approach with our transcript editor.

yigitkonur35 · on March 30, 2023

I use Whisper actively in my daily life, but I'm curious about the prompt you used to make these corrections. I didn't get great results in my trials. What technology are you using? GPT-3.5 or GPT-4? If you could provide me with some information on this, I'd like to incorporate it into my workflow right away. I understand that you may have fine-tuned the models on Davinci as custom. If any of ones working in this field have a chance to give me more detailed information, I'd really appreciate it.

braindead_in · on March 30, 2023

Signed up for early access.

SamPatt · on March 30, 2023

Great idea. Any code available? I haven't used the Whisper API yet but the OpenAI API for chatGPT seems pretty easy to work with.

groestl · on March 30, 2023

Ask ChatGPT ;) (that's what I did)

CastFX · on March 30, 2023

Have you tried transcripts with swearing? Last time I tried, it censored "shit" with "stuff".

Not ideal, maybe I should've asked to censor with asterisks instead of changing the whole world

razemio · on March 30, 2023

What fixed it for me was whisper + language tool. It produces very good results with almost no errors. Even if the audio quality is bad.

rendall · on March 30, 2023

Would you mind expanding on this? I'm curious about the "+ language tool" part. Which tool?

petemir · on March 30, 2023

Not op, but probably he literally meant LanguageTool [0], an open-source grammarly alternative.

[0] https://languagetool.org/

razemio · on March 30, 2023

Correct!

barrenko · on March 30, 2023

Even something like "spacy" can help with the grammar.

exizt88 · on March 30, 2023

I wish there was a way to use Whisper for continuous recognition! Unfortunately, Whisper doesn't support streaming — and AFAIK it's unlikely that it ever will.

idleproc · on March 30, 2023

I've not tried it because I transcribe Japanese podcasts, but whisper.cpp[0] seems to support streaming.

[0]: https://github.com/ggerganov/whisper.cpp#real-time-audio-inp...

oliwary · on March 30, 2023

Nice! I tried something similar where I orally gave a speech, transcribed it and then had ChatGPT turn it into a coherent text. Saved me a bunch of time.

mariusio · on March 30, 2023

Did anyone find a solution to have whisper differentiate between multiple speakers in a conversation and mark them in the written output?

freeqaz · on March 30, 2023

Yeah there are some models that I played with that can do this. They only work for 2 or 3 speakers currently though. They term for this is "diarization".

https://huggingface.co/pyannote/speaker-diarization

swores · on March 30, 2023

I wonder, do any conference call services (zoom, GMeet, etc) offer the ability to record each participant's audio stream separately in a way that would make it easy to transcribe them separately then combine?

zamnos · on March 30, 2023

FWIW, GMeet supports meeting transcription natively.

https://support.google.com/meet/answer/12849897?hl=en

amadiver · on March 30, 2023

Zoom has this option.

swores · on March 30, 2023

Thanks

iagooar · on March 30, 2023

How do you work around the file size limit and generating a transcript in vtt or srt? The shifting of timestamps is a major pain.

jazzyjackson · on March 30, 2023

just found out about this today, maybe it's helpful:

https://github.com/m-bain/whisperX

SSLy · on March 30, 2023

I like output from https://github.com/jianfch/stable-ts way more

chenzhekl · on March 30, 2023

This deserves a paper! I look forward to your publication.

rattray · on March 30, 2023

Can you share your prompt?