Hacker News new | past | comments | ask | show | jobs | submit login

On a related note for anyone interested in this and want better performance today:

I messed with a combo of Whisper and ChatGPT. I took a whisper transcript and asked chatgpt to fix mistranscriptions using the context of the transcript and based on potential phonetic issues. I asked it to replace transcribed words that don't make sense with "[unintelligable]", which improved the output even more.

Transcription error rate was almost nonexistent, even on the smallest whisper model.




Probably depends on the language. I tried to transcribe some hour-long talks in Chinese with whisper the other day. Even with the large model, it manages to skip some sentences from time to time (no idea why), and repeat the same sentence over and over in other instances. I tweaked parameters for a long time to no vail. While GPT should be able to clean up the repetition (there will be false positives though, since human speakers do repeat themselves occasionally), it can’t really fill out the missing sentences.


I have to transcribe a tonne of Chinese interviews soonish -- any further thoughts or experiments you can think of? Maybe some preprocessing steps to the audio? For example, cut it into one minute chunks with some overlap, then transcribe those, so that it can't skip those bits...? Or can we finetune it on a library of Chinese mp3s + transcripts?


Whisper cuts the audio into chunks of 30 seconds. So If you have a one-minute recording where the first half has conversation in it and the remainder nothing, then it will think that it has to find something in that second 30 seconds block without knowing how "speech" actually sounded like as it did in the first chunk.

Try to pre-process it where just "voice" is detected, not the meaning, just some speaking, and cut the audio into snippets which only contain speech, so that Whisper doesn't have to guess if the segment will contain speech or not.

Also, if you cut it up into chunks and let it transcribe each chunk and expect JSON as the output, instead of the other output methods, then you'll get a bunch of extra parameters with it which will help you identifying problematic sections. For example hallucinated repetitions usually have a higher "no_speech_prob" parameter, or segments with lower "compression_ratio" will also not be that accurate.


If cost isn’t an issue, I’d use one of the established commercial offerings. Otherwise, you could try splitting into shorter chunks (20 minutes maybe?), do multiple runs on each chunk and pick out the best run according to some criteria, e.g. character count after removing repetitions. Whisper isn’t deterministic so some runs can be better than others; you could also tweak parameters like compression ratio or silence thresholds between runs, but in my experience there’s not going to be an obvious winner leading to a marked improvement in quality. Anyway, I’m no expert, and maybe you’ll have better luck than me. My recordings do have background music and conversations in some places that might confuse the model.


Can't use third party service -- compliance stuff. This is helpful though. Using another model to tidy it up -- maybe Alpaca -- could be an option too. Then we'll just do speaker separation etc. manually later.


I use a custom babbage model fined tuned using the original whisper transcripts as the prompt and the fixed transcript as the result. It does a very good job at correcting common jargon and names, correcting something like 1/3 to 1/2 of all total errors. Also is very good at correcting transcripts which whisper fails to put punctuation in for whatever reason by adding punctuation and capitalization where appropriate. I used this to fix 1000+ transcripts of lectures totaling around 700 hours of speech on the cheap.

Personally wouldn't recommend my approach however unless you are okay with doing some hardcore text manipulation and fuzzy math. It fails to produce text that matches up with the prompt 10% of the time and lots of other caveats with what to do with text that doesn't fit into the prompt.


I've suspected as such, which makes using the Siri and Google Assistant of today even more infuriating! I know it's going to be exponentially better just around the corner!


I've been doing the same. Apart from sometimes changing the meaning of sentences, it does a pretty good job at improving the message one is trying to convey.

In my case I'm recording notes while exercising on my bike, so there's wind and friction and breathing, then I transcribe them with Whisper, do try to filter out things that don't make much sense (where the temperature, compression_ratio, avg_logprob and/or no_speech_prob are not acceptabe).

Since I start recording at will and don't prepare the sentence to be recorded, the recording is a bit chaotic, and ChatGPT does a good job in correcting Whisper's mistakes and transforming the sentences.

But, as I said, sometimes this fails and the real message gets lost.


This is tangential to the software techniques, but have you tried using throat microphones? They're essentially the microphone counterpart to the bone conducting headphones. They're aimed at use cases where there's a lot of noise (originally created for paragliding but also used in biking scenarios), and low voice scenarios (eg covert operations and even to amplify voice from people having speech problems due to things like Parkinson's disease).

I'm using them to record multiple people in the same room but avoid cross-talk (for data collection in a research study). The audio isn't great, but I've used transcription services and they seem to be able to make the words out just fine.


I tried this as well but ChatGPT just refused to do the transcription fixing because PG13 movies contain content that is too controversial.


This is one of my frustrations with generative AI APIs, the censoring of content. I understand why it's done but it seems overboard to me, and has created huge headaches with material that is not at all prurient, but where there's some homonyms involved.


Checkout https://exemplary.ai - we incorporate a similar approach with our transcript editor.


I use Whisper actively in my daily life, but I'm curious about the prompt you used to make these corrections. I didn't get great results in my trials. What technology are you using? GPT-3.5 or GPT-4? If you could provide me with some information on this, I'd like to incorporate it into my workflow right away. I understand that you may have fine-tuned the models on Davinci as custom. If any of ones working in this field have a chance to give me more detailed information, I'd really appreciate it.


Signed up for early access.


Great idea. Any code available? I haven't used the Whisper API yet but the OpenAI API for chatGPT seems pretty easy to work with.


Ask ChatGPT ;) (that's what I did)


Have you tried transcripts with swearing? Last time I tried, it censored "shit" with "stuff".

Not ideal, maybe I should've asked to censor with asterisks instead of changing the whole world


What fixed it for me was whisper + language tool. It produces very good results with almost no errors. Even if the audio quality is bad.


Would you mind expanding on this? I'm curious about the "+ language tool" part. Which tool?


Not op, but probably he literally meant LanguageTool [0], an open-source grammarly alternative.

[0] https://languagetool.org/


Correct!


Even something like "spacy" can help with the grammar.


I wish there was a way to use Whisper for continuous recognition! Unfortunately, Whisper doesn't support streaming — and AFAIK it's unlikely that it ever will.


I've not tried it because I transcribe Japanese podcasts, but whisper.cpp[0] seems to support streaming.

[0]: https://github.com/ggerganov/whisper.cpp#real-time-audio-inp...


Nice! I tried something similar where I orally gave a speech, transcribed it and then had ChatGPT turn it into a coherent text. Saved me a bunch of time.


Did anyone find a solution to have whisper differentiate between multiple speakers in a conversation and mark them in the written output?


Yeah there are some models that I played with that can do this. They only work for 2 or 3 speakers currently though. They term for this is "diarization".

https://huggingface.co/pyannote/speaker-diarization


I wonder, do any conference call services (zoom, GMeet, etc) offer the ability to record each participant's audio stream separately in a way that would make it easy to transcribe them separately then combine?


FWIW, GMeet supports meeting transcription natively.

https://support.google.com/meet/answer/12849897?hl=en


Zoom has this option.


Thanks


How do you work around the file size limit and generating a transcript in vtt or srt? The shifting of timestamps is a major pain.


just found out about this today, maybe it's helpful:

https://github.com/m-bain/whisperX


I like output from https://github.com/jianfch/stable-ts way more


This deserves a paper! I look forward to your publication.


Can you share your prompt?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: