Universal Speech Model

mikeortman · on March 30, 2023

On a related note for anyone interested in this and want better performance today:

I messed with a combo of Whisper and ChatGPT. I took a whisper transcript and asked chatgpt to fix mistranscriptions using the context of the transcript and based on potential phonetic issues. I asked it to replace transcribed words that don't make sense with "[unintelligable]", which improved the output even more.

Transcription error rate was almost nonexistent, even on the smallest whisper model.

oefrha · on March 30, 2023

Probably depends on the language. I tried to transcribe some hour-long talks in Chinese with whisper the other day. Even with the large model, it manages to skip some sentences from time to time (no idea why), and repeat the same sentence over and over in other instances. I tweaked parameters for a long time to no vail. While GPT should be able to clean up the repetition (there will be false positives though, since human speakers do repeat themselves occasionally), it can’t really fill out the missing sentences.

malborodog · on March 30, 2023

I have to transcribe a tonne of Chinese interviews soonish -- any further thoughts or experiments you can think of? Maybe some preprocessing steps to the audio? For example, cut it into one minute chunks with some overlap, then transcribe those, so that it can't skip those bits...? Or can we finetune it on a library of Chinese mp3s + transcripts?

qwertox · on March 30, 2023

Whisper cuts the audio into chunks of 30 seconds. So If you have a one-minute recording where the first half has conversation in it and the remainder nothing, then it will think that it has to find something in that second 30 seconds block without knowing how "speech" actually sounded like as it did in the first chunk.

Try to pre-process it where just "voice" is detected, not the meaning, just some speaking, and cut the audio into snippets which only contain speech, so that Whisper doesn't have to guess if the segment will contain speech or not.

Also, if you cut it up into chunks and let it transcribe each chunk and expect JSON as the output, instead of the other output methods, then you'll get a bunch of extra parameters with it which will help you identifying problematic sections. For example hallucinated repetitions usually have a higher "no_speech_prob" parameter, or segments with lower "compression_ratio" will also not be that accurate.

oefrha · on March 30, 2023

If cost isn’t an issue, I’d use one of the established commercial offerings. Otherwise, you could try splitting into shorter chunks (20 minutes maybe?), do multiple runs on each chunk and pick out the best run according to some criteria, e.g. character count after removing repetitions. Whisper isn’t deterministic so some runs can be better than others; you could also tweak parameters like compression ratio or silence thresholds between runs, but in my experience there’s not going to be an obvious winner leading to a marked improvement in quality. Anyway, I’m no expert, and maybe you’ll have better luck than me. My recordings do have background music and conversations in some places that might confuse the model.

malborodog · on April 1, 2023

Can't use third party service -- compliance stuff. This is helpful though. Using another model to tidy it up -- maybe Alpaca -- could be an option too. Then we'll just do speaker separation etc. manually later.

skim_milk · on March 30, 2023

I use a custom babbage model fined tuned using the original whisper transcripts as the prompt and the fixed transcript as the result. It does a very good job at correcting common jargon and names, correcting something like 1/3 to 1/2 of all total errors. Also is very good at correcting transcripts which whisper fails to put punctuation in for whatever reason by adding punctuation and capitalization where appropriate. I used this to fix 1000+ transcripts of lectures totaling around 700 hours of speech on the cheap.

Personally wouldn't recommend my approach however unless you are okay with doing some hardcore text manipulation and fuzzy math. It fails to produce text that matches up with the prompt 10% of the time and lots of other caveats with what to do with text that doesn't fit into the prompt.

thefourthchime · on March 30, 2023

I've suspected as such, which makes using the Siri and Google Assistant of today even more infuriating! I know it's going to be exponentially better just around the corner!

qwertox · on March 30, 2023

I've been doing the same. Apart from sometimes changing the meaning of sentences, it does a pretty good job at improving the message one is trying to convey.

In my case I'm recording notes while exercising on my bike, so there's wind and friction and breathing, then I transcribe them with Whisper, do try to filter out things that don't make much sense (where the temperature, compression_ratio, avg_logprob and/or no_speech_prob are not acceptabe).

Since I start recording at will and don't prepare the sentence to be recorded, the recording is a bit chaotic, and ChatGPT does a good job in correcting Whisper's mistakes and transforming the sentences.

But, as I said, sometimes this fails and the real message gets lost.

Naracion · on March 31, 2023

This is tangential to the software techniques, but have you tried using throat microphones? They're essentially the microphone counterpart to the bone conducting headphones. They're aimed at use cases where there's a lot of noise (originally created for paragliding but also used in biking scenarios), and low voice scenarios (eg covert operations and even to amplify voice from people having speech problems due to things like Parkinson's disease).

I'm using them to record multiple people in the same room but avoid cross-talk (for data collection in a research study). The audio isn't great, but I've used transcription services and they seem to be able to make the words out just fine.

Rustwerks · on March 30, 2023

I tried this as well but ChatGPT just refused to do the transcription fixing because PG13 movies contain content that is too controversial.

derbOac · on March 30, 2023

This is one of my frustrations with generative AI APIs, the censoring of content. I understand why it's done but it seems overboard to me, and has created huge headaches with material that is not at all prurient, but where there's some homonyms involved.

verghese · on March 30, 2023

Checkout https://exemplary.ai - we incorporate a similar approach with our transcript editor.

yigitkonur35 · on March 30, 2023

I use Whisper actively in my daily life, but I'm curious about the prompt you used to make these corrections. I didn't get great results in my trials. What technology are you using? GPT-3.5 or GPT-4? If you could provide me with some information on this, I'd like to incorporate it into my workflow right away. I understand that you may have fine-tuned the models on Davinci as custom. If any of ones working in this field have a chance to give me more detailed information, I'd really appreciate it.

braindead_in · on March 30, 2023

Signed up for early access.

SamPatt · on March 30, 2023

Great idea. Any code available? I haven't used the Whisper API yet but the OpenAI API for chatGPT seems pretty easy to work with.

groestl · on March 30, 2023

Ask ChatGPT ;) (that's what I did)

CastFX · on March 30, 2023

Have you tried transcripts with swearing? Last time I tried, it censored "shit" with "stuff".

Not ideal, maybe I should've asked to censor with asterisks instead of changing the whole world

razemio · on March 30, 2023

What fixed it for me was whisper + language tool. It produces very good results with almost no errors. Even if the audio quality is bad.

rendall · on March 30, 2023

Would you mind expanding on this? I'm curious about the "+ language tool" part. Which tool?

petemir · on March 30, 2023

Not op, but probably he literally meant LanguageTool [0], an open-source grammarly alternative.

[0] https://languagetool.org/

razemio · on March 30, 2023

Correct!

barrenko · on March 30, 2023

Even something like "spacy" can help with the grammar.

exizt88 · on March 30, 2023

I wish there was a way to use Whisper for continuous recognition! Unfortunately, Whisper doesn't support streaming — and AFAIK it's unlikely that it ever will.

idleproc · on March 30, 2023

I've not tried it because I transcribe Japanese podcasts, but whisper.cpp[0] seems to support streaming.

[0]: https://github.com/ggerganov/whisper.cpp#real-time-audio-inp...

oliwary · on March 30, 2023

Nice! I tried something similar where I orally gave a speech, transcribed it and then had ChatGPT turn it into a coherent text. Saved me a bunch of time.

mariusio · on March 30, 2023

Did anyone find a solution to have whisper differentiate between multiple speakers in a conversation and mark them in the written output?

freeqaz · on March 30, 2023

Yeah there are some models that I played with that can do this. They only work for 2 or 3 speakers currently though. They term for this is "diarization".

https://huggingface.co/pyannote/speaker-diarization

swores · on March 30, 2023

I wonder, do any conference call services (zoom, GMeet, etc) offer the ability to record each participant's audio stream separately in a way that would make it easy to transcribe them separately then combine?

zamnos · on March 30, 2023

FWIW, GMeet supports meeting transcription natively.

https://support.google.com/meet/answer/12849897?hl=en

amadiver · on March 30, 2023

Zoom has this option.

swores · on March 30, 2023

Thanks

iagooar · on March 30, 2023

How do you work around the file size limit and generating a transcript in vtt or srt? The shifting of timestamps is a major pain.

jazzyjackson · on March 30, 2023

just found out about this today, maybe it's helpful:

https://github.com/m-bain/whisperX

SSLy · on March 30, 2023

I like output from https://github.com/jianfch/stable-ts way more

chenzhekl · on March 30, 2023

This deserves a paper! I look forward to your publication.

rattray · on March 30, 2023

Can you share your prompt?

akiselev · on March 30, 2023

> USM, which is for use in YouTube (e.g., for closed captions), can perform automatic speech recognition (ASR) on widely-spoken languages like English and Mandarin, but also languages like Punjabi, Assamese, Santhali, Balinese, Shona, Malagasy, Luganda, Luo, Bambara, Soga, Maninka, Xhosa, Akan, Lingala, Chichewa, Nkore, Nzema to name a few.

I find this part to be the most impressive thing in the OP. Most of those languages are spoken by fewer than 0.1% of the world and Nzema is spoken by less than half a million people. Where are they even getting enough training data to figure those languages out?

est31 · on March 30, 2023

They used data from youtube, with multiple selection methods. A little supervised data for 72 languages (90k hours of audio with text labels provided), some pseudo-supervised data in english (100k hours; the english labels were generated by having a model do the labeling), and a LOT of unsupervised data in 568 languages (12.1 million hours). That last group has no labels (aka captions) available, just the audio, but they create pseudo-labels by using the FLEURS dataset. I'm not really sure how this works as FLEURS itself only has 102 languages... I guess in different words, it just makes some phonetic impression of how that language should be written and then compares itself to how well it can do that impression?

RC_ITR · on March 30, 2023

> We demonstrate that utilizing a large unlabeled multilingual dataset to pre-train the encoder of our model and fine-tuning on a smaller set of labeled data enables us to recognize these under-represented languages. Moreover, our model training process is effective for adapting to new languages and data.

What this implies is that the core encoder model (trained on popular languages with a ton of data) does a really good job at learning the generalized basics of any language (period).

Then, they enter a comparatively minuscule amount of labeled data from the small languages at the end (supervised fine tuning - since parameters are tuned iteratively, the latest training data has the biggest impact on the final performance). The model has enough of a general “understanding” of grammar across languages that it can fill in the gaps it doesn’t get from that small labeled set.

Thats the true crazy part. The pre trained encoder is an ersatz Rosetta Stone.

creamyhorror · on March 30, 2023

Cool though not entirely surprising that a network that learns how to process many human languages might develop internal subnets/structures that generalise to nearly all human languages. Could be seen as a humanlang coprocessor, as a sibling says.

Understanding those generic internal structures would be fascinating.

thelittleone · on March 30, 2023

If it can do this, I wonder if it could be adapted to develop a language that is universal and easy to learn.

deskamess · on March 30, 2023

Right? I think understanding, or perhaps a better word is representing, the internal structure is also best left to a machine to extract. If it can categorize/cluster into various neural neighborhoods then we can feed that output (or perhaps many outputs from different runs) into another system that is responsible for explaining to humans.

int_19h · on March 30, 2023

Doesn't this also strongly imply that universal grammar is indeed a thing?

RC_ITR · on March 30, 2023

Well, it’s certainly not evidence against, but it’s still possible that there are n-families of unrelated grammar structures and Google’s training caught enough examples of each of those families via popular languages.

Also, for all we know, they failed on a bunch of languages (unless there’s a paper somewhere I missed, but sadly the days of good AI papers from for-profits seems to be over).

PeterisP · on March 30, 2023

Not necessarily, the concept of universal grammar is more about the concept of innate grammar that has to be universal for any homo sapiens language, but this empirical observation is more about the similarity of particular languages, many of which are significantly related to each other.

For example, if it had historically turned out (or will turn out in some future) that everyone on Earth speaks some variant of English or other Indo-European language; or some variant of Chinese or other Sino-Tibetan, then there would be many features of grammar that technically are universal across the population but that wouldn't be evidence that "universal grammar" (as used by Chomsky) is indeed a thing.

numpad0 · on March 30, 2023

Can it be more like a universal grammar coprocessor, with finite complexity?

williamcotton · on March 30, 2023

There is a lot of evidence that training can model underlying structures in languages. This means that the model doesn’t need to see as much of certain kinds of languages and can benefit from a larger corpus of various different languages.

It’s like Chomsky, but with the Norvig approach.

fenomas · on March 30, 2023

Wouldn't the simpler explanation be that the pretraining teaches the model to extract human voices and phonemes from general audio, and so extending the model to other languages is then easier because the model is already working from phonemes?

wongarsu · on March 30, 2023

It probably goes even beyond that. Most languages are related to other languages, and somebody good in German and Norwegian could do an ok job transcribing Swedish with minimal training. Similarly, learning English is a lot easier for German speakers than for Chinese, even though German and English are quite different in pronunciation.

The more languages from a language family a human speaks, the easier it is for them to learn each additional language from that family. It stands to reason that deep learning can benefit similarly.

e12e · on March 30, 2023

That's probably what op meant by "it's like Chomsky" :)

hackernewds · on March 30, 2023

> It's like Chomsky, but with the Norvig approach.

what does this mean?

ot · on March 30, 2023

Chomsky postulates that all languages boil down to a universal set of rules, and he went as far as saying that this is biologically motivated, as if there was a "language device" in the brain.

Norvig was one of the pioneers in proposing that NLP could be solved by statistical machine learning on large quantities of data.

So the joke here is that we're using large-scale machine learning to infer the underlying universal structure of language.

---

EDIT: After writing the comment above I asked ChatGPT-4 the same question, and it came up with the following answer, which I think is flawless.

---

The last sentence in the paragraph is referencing two well-known figures in the field of linguistics and artificial intelligence: Noam Chomsky and Peter Norvig.

Noam Chomsky is a linguist who proposed the idea of a "universal grammar," suggesting that all human languages share a common underlying structure, and that humans have an innate ability to acquire language. This idea implies that even with limited exposure to a specific language, humans can still learn it thanks to this innate structure.

Peter Norvig is a computer scientist and artificial intelligence researcher who is known for advocating a more data-driven, statistical approach to natural language processing. This approach emphasizes the importance of large datasets in teaching machines to understand and process language, rather than relying on pre-defined rules or structures.

The sentence "It's like Chomsky, but with the Norvig approach" suggests that the described language model incorporates elements of both Chomsky's universal grammar and Norvig's data-driven approach. The model benefits from an underlying structure (akin to Chomsky's universal grammar) while also leveraging a large corpus of various languages (as advocated by Norvig) to improve its performance in understanding and processing language.

oriettaxx · on March 30, 2023

user3939382 · on March 30, 2023

I’m familiar with Lingala and Punjabi, they have millions of speakers. The others might too.

notatoad · on March 30, 2023

"millions of speakers" and fewer than "0.1% of the world" are not mutually exclusive claims

user3939382 · on March 30, 2023

They are not. Implied is that the latter doesn’t matter for the purpose of getting training data.

somebee · on March 30, 2023

We've been researching different speech models at Scrimba, and went for Whisper on our own infrastructure. A few days ago I stumbled onto Deepgram, which blows whisper out of the water in terms of speed and accuracy (we need high precision word level timestamps). I thought their claim of being 80x faster than whisper had to be hyperbole, but it turned out to be true for us. Would recommend checking it out for anyone who need performant speech-to-text.

sason · on March 30, 2023

80x faster than Whisper is an incredible feat. How is Deepgram's transcription accuracy?

Also, have you heard of Conformer-1 by Assembly-AI[1]? It released a few days ago and supposedly scored higher than Whisper on various benchmarks.

[1]: https://www.assemblyai.com/blog/conformer-1/

somebee · on March 30, 2023

In my experience the accuracy is at least a bit better than whisper-small on their enhanced models. But we've just started using it so haven't had time to do many direct comparisons with whisper. Their word-timestamps are _much_ better, which is important if you want to be able to edit the audio based on the transcription.

As for speed I have no idea how they make it so fast, but I'm sure they've written about it somewhere. My guess is at least that they are slicing the audio and parallelising it. Will look into Conformer-1 as well!

pantalaimon · on March 30, 2023

The nice thing about whisper is that it runs locally.

Simon321 · on March 30, 2023

I saw deepgrams claims as well an believed them also, then i tried it, it was TERRIBLE. Don't believe them. It only does well on the benchmark they trained it on. It is faster though but the quality is terrible.

somebee · on March 30, 2023

Did you try their enhanced models? We're using it for relatively high-quality audio files and their accuracy is better than the whisper small.en model. More importantly, their word level timestamps is worlds better than whisper.

abraxas · on March 30, 2023

Yeah, I'm not sure why people get so hyped up about Whisper. In production use it's middling at best and there are commercial offerings the handily beat it in both accuracy and speed.

Whisper is mostly an academic toy.

inciampati · on March 30, 2023

In most real world settings, at least in my personal use, latency to a remote AI comprises most of the usability difficulty with automated speech recognition. The larger whisper models can be run directly on a laptop using multi threading and achieve speech to text transcription that is fully sufficient to almost completely write whole emails, papers, documents with them. In fact, I've written most of this comment using an ASR system on my phone that uses whisper. While the smaller models (like the one user here) can need some correction, the bigger ones are almost perfect. They are both very sufficient and for realtime interactive use I see no future market for paid APIs.

Yesterday I wrote virtually all the prose in the manuscript while walking around with a friend and discussing it. We didn't even look at the phone.

Obviously there's an academic element here because I'm saying I'm using it for writing. But it's more of a human-centric computing thing. I'm replacing a lot of time that my thumbs are spent tapping on keys, my fingers are spent tapping on keyboard, and my eyes are spent staring at the words that are appearing, looking for typographical errors to correct, with time organizing my thoughts in a coherent way that can be spoken and read easily. I'm basically using whisper to create a new way to write that's more fluid, direct, and flows exactly as my speech does. I've tried this for years with all of the various ASR models on all the phones I've had and never been satisfied in the same way.

DougBTX · on March 30, 2023

Sounds great! Which app are you using for this?

inciampati · on March 30, 2023

"Openai Whisper Keyboard" is good. It doesn't use whisper.cpp but rather a pytorch implementation that runs on Android.

I also use whisper.el on emacs. It's amazing. Much more powerful but computer based of course.

DougBTX · on March 30, 2023

Thanks!

selfhoster11 · on March 30, 2023

Whisper democratises high-quality transcription of the languages I personally care about, whether using a CPU or a GPU. It's FOSS, self-hostable, and very easy to use.

That's a big deal for someone who wants to build something using speech recognition (voice assistant, transcription of audio or video) without resorting to APIs.

abraxas · on March 30, 2023

Is this in the realm of aspiration or something you've actually worked on? Because Whisper is incredibly difficult (I'd say impossible) to use in a real time conversational setting. The transcription speed is too slow for interactive use even on a GPU once you step up above tiny or base. And when you step down this low the accuracy is attrocious (especially in noisy settings or with accented voices) and then you have to post process the output with a good NLP to make it usable in whatever actions you're driving.

Look, it's nice that it's out there and free to use. For the sake of my wallet I hope it gets really good. But it isn't competitive with top of the line commercial offerings if you need to ship something today.

codethief · on March 30, 2023

At the company I work for we are currently in the process of transitioning from being a German-only to an international company. Recently, we started using Whisper to live-transcribe + translate all our (German) all-hands meetings to English. Yes, it required some fine-tuning (i.e. what size do you choose for the audio chunks you pass to Whisper) but overall it's been working incredibly well – almost flawlessly so. I don't recall which model size we use but it does run on a beefy GPU.

aftbit · on March 30, 2023

What if you are operating in an internet-denied application? A remote radio sensing system with a low bandwidth back channel or an airplane cabin, just to name two.

selfhoster11 · on March 30, 2023

Actively working on it. I've not noticed any performance problems even with the large model (though the plan was always to run the speech recognition on a GPU - your use case may differ). It seems to be doing fairly well even with slightly noisy inputs, and certainly has better bang/$ than other non-API solutions that service my native language.

While true real-time would definitely be nice, I can approximate it well enough with various audio slicing techniques.

MacsHeadroom · on March 30, 2023

Faster Whisper is 8x faster than real time on CPU and even faster on GPU. https://github.com/guillaumekln/faster-whisper

Vocode uses Whisper for real-time zero latency voicechat with chatGPT. Give their demo line a call to see how well it works: +1-650-729-9536

rolisz · on March 30, 2023

On a 1080Ti (6 year old GPU) I found Whisper large models to take around as much time to transcribe as the length of the audio.

inciampati · on March 30, 2023

That's very similar to CPU-based performance with modern CPUs and parallelization! Frankly, with whisper.cpp it tends to be a little faster than the length of the audio for the "small" model, and much faster for "base" and "tiny".

pantalaimon · on March 30, 2023

Doesn't even have to be that modern, my Ivy Bridge CPU already achieves faster than realtime performance - which makes me wonder if there is maybe some upstart cost for the GPU based solution and it would outperform the CPU only with longer clips.

selfhoster11 · on March 30, 2023

Try quantised models. They perform reasonably well, although you probably want to run some benchmarks if you really want to get it done properly.

dontreact · on March 30, 2023

If there is a better mode I can easily integrate using python I’m all ears. For what I’m building quality is most important

abraxas · on March 30, 2023

Oh, Azure's speech recognition API beats it handily on English language. Both in accuracy and speed.

Another is Deepgram. Even this obscure vendor seems to be able to handle the samples I tried better than Whisper: https://picovoice.ai/platform/cat/

But yeah, go with Azure as your starting point. It is good and the price is likely acceptable unless you're transcribing all of youtube.

dontreact · on March 30, 2023

Umm I want to pay zero and run locally

generalizations · on March 30, 2023

If you use the large_v2 version of whisper, and give it a prompt to indicate what it's transcribing, it can do extremely well. But do use the prompt feature.

dontreact · on March 31, 2023

Yeah exactly this is why there’s hype. It’s the best model that you can use for free easily

leetharris · on March 30, 2023

I work at Rev.AI, check us out

Lowest WER in the industry, cheaper than Whisper API, and we have an on-prem solution

elif · on March 30, 2023

If it's the auto CC currently used by YouTube... I think it needs a few billion more (or less) sentences.

It is comically bad, with nonsense words that don't make any sense in context about every other sentence in English, and an absolute inability to produce a coherent thought in Japanese.

GaggiX · on March 30, 2023

The auto CC used by Youtube has improved greatly, when I started using it years ago it was almost unintelligible but now it's impressive. (At least in English and Italian is practically perfect now)

oittaa · on March 30, 2023

Same with Spanish. It has gotten surprisingly good.

cyberax · on March 30, 2023

It's nowhere near good for Russian, though.

onion2k · on March 30, 2023

USM has a word error rate of ~15% on US English according to the article. That means it's getting roughly one or two words wrong in every sentence. If you're seeing wrong words in every other sentence it's doing better than you'd expect.

pcurve · on March 30, 2023

When the auto CC first came out, it was horrible, especially if you had thick accent. This video was one of them. https://www.youtube.com/watch?v=pp446OPWsvI

But it has come a long way since then.

personjerry · on March 30, 2023

In my experience it still fails every two or three sentences unless the speakers are deliberately speaking clearly

ClimaxGravely · on March 30, 2023

That was my immediate thought as well. The Japanese to English CC's are absolutely terrible to this day.

2h · on March 30, 2023

agree on this. I wonder how threads like this even get off the ground. anyone who watches about 10 minutes of auto captioned videos can see how awful it is.

derefr · on March 30, 2023

That’s funny; I watch a lot of YouTube — mostly in English, mind you, but sometimes in other languages with English auto-translated captions — and I find the quality of YouTube’s auto-captions better than most “professionally”-produced captions, to the point that I would sometimes switch explicit captions off and revert back to auto-captions if I could.

I think this says more about the generally-low quality of the captioning services used by YouTube content creators than anything, though.

My biggest complaint is that I often find that human captioners don’t recognize domain-specific words. Like, names of ethnic foods. The word might be literally showing on the screen while the presenter is talking — and the “professional” captioer will just put [indistinct] there, as if they aren’t even watching the video as they’re captioning it. YouTube’s auto-captions get these words right every time.

My impression is that USM is seemingly uniquely good at code-switching from word to word within a sentence — which makes sense, given its “universality.” I think, if they allowed it, it would even be able to embed clauses and quotations generated in one language and alphabet, into a sentence generated in an entirely different language and alphabet, keeping syntax and grammatical structure correct for the given language within its clause.

derbOac · on March 30, 2023

I actually feel like the autocaptioning is one of those "I'm living in the future" moments. It's amazing to play a video in Swedish, and just have it autotranslate. I love it. It's not as good in my opinion as good human translation, but I agree that I've seen many translations that were much much much worse than the autocaptioning.

I have had some channels where I was laughing out loud at the autocaptioning, which was probably more so translation, but I did get a laugh out of it after all, and I generally knew what they were saying.

I've also noticed that, at least with the videos I watch, usually the autocaptioning errors seem "phonemically correct", so it's substituting a word that sounds the same, and I can easily figure out what was meant. Usually I've noticed these problems with more so with non-American English (British or Australian for example), especially where there are multiple people all speaking English, but with different accents. It does seem to me the English speech recognition is honed in on some west coast or midwest US English accent.

I am surprised contextual cues aren't being utilized more but I'm very happy with the YouTube speech recognition.

ccooffee · on March 30, 2023

They aren't hiding that, though? Literally the first graphic on the page shows their claim of a "word error rate" on test data has an error rate of around 14% in the best case compared to the state of the art at 15% (for en-US content).

It's only 1% better than the current state of the art. But it's still noteworthy. From the end of the abstract:

> We demonstrate that utilizing a large unlabeled multilingual dataset to pre-train the encoder of our model and fine-tuning on a smaller set of labeled data enables us to recognize these under-represented languages. Moreover, our model training process is effective for adapting to new languages and data.

It's amazing to me that the chaotic process of "machine learning" can end up with an internal state for languages that is readily adapted to entirely new languages.

For now, they've got this handling audio transcription, but with some hints that this approach could work well for translation. Perhaps we'll be able to use these improved models to decipher Linear A[0] or other un-deciphered languages. It sounds like "magic", but it's the kind that could maybe exist.

[0] https://en.wikipedia.org/wiki/Linear_A

lovemenot · on March 30, 2023

>> It's amazing to me that the chaotic process of "machine learning" can end up with an internal state for languages that is readily adapted to entirely new languages.

Yeah and interestingly, that was roughly Chomsky's breakthrough finding with respect to how humans learn language as children. We are born with an innate language acquisition device.

vkazanov · on March 30, 2023

Chomsky's ideas around the Universal Grammar are just a theory. Similar to how his formal grammars somewhat represent real languages but never fully, the UG model will never explain it all, or even most of it. Brain biology just doesn't like the idea of formal things/rules/grammars.

Here's an alternative theory/approach. What if natural languages are just the way the device starts working when the number of neurons grows quickly? NL properties sort of emerge out of low-level details of brain work? Neurons are simple but the brain is not. Complex brain properties emerge from trivial parts the same way our full bodies emerge from a simple DNA/RNA system. Any details in these systems would be too statistical to expose a limited rules system.

Obviously, powerful enough ML system can infer the system's properties. In fact, it can infer any function. The thing is that this doesn't mean there's some kind of simpler model explaining details of emergent system's work.

What is surprising is the way LLMs imitate a stateful function (our brain, with memory, fluid biology, etc) using a stateless inferred function (the model). I suspect this statefulness might be the answer to the question of "poverty of stimulus" problem.

8note · on March 30, 2023

The business model is the interesting part - it's proving out OpenAI's model of selling access to a model, rather than insights that Google derived from the model

kerpotgh · on March 30, 2023

Whisper also manages to add punctuations, line breaks and can attribute the speech to a particular speaker. The YouTube version is all lower case and has no punctuation. That “simple” change would make things so much better in YouTube CC.

jiggawatts · on March 30, 2023

These "pure speech" models could really benefit from being coupled to a large language model like ChatGPT.

YouTube live transcriptions are terrible, because they get confused by homonyms and can't follow the context in a sentence.

In the same manner that Dall-E joined an LLM to an image generator, they ought to train a combined speech model + LLM so that the uncertainties in the speech model output is disambiguated by the LLM.

reliableturing · on March 30, 2023

This is exactly part of this Google USM approach, although these pretrained models are significantly smaller than ChatGPT. They reference this paper [1] which contains more details on the pretrained text-only alignment with the speech model.

[1] https://arxiv.org/abs/2209.15329

spullara · on March 30, 2023

Use whisper. It doesn't get confused in that way.

IshKebab · on March 30, 2023

Are you sure about that? I've seen YouTube captions understand homophones, and Google Assistant definitely can (though that may be linked to some other system).

Surely the speech recognition model itself learns some basic language statistics in order to recognise homophones?

kleiba · on March 30, 2023

There are no "pure speech" models. All ASR uses language models.

andy_ppp · on March 30, 2023

Same with spell check, with context and phonetics you could make autocorrection better than any human.

lysozyme · on March 30, 2023

Interesting that they enriched the training data by asking people to point out YouTube videos in specific languages for which they needed data

> YT-513-U: We create an additional dataset called YT-513-U to ensure coverage of lower resource languages in our pre-training dataset. We reached out to vendors and native speakers to identify YT videos containing speech in specific long tail languages, collecting a dataset of unlabeled speech in 513 languages. [1]

1. https://arxiv.org/abs/2303.01037

gigel82 · on March 30, 2023

That's cool, but Whisper is open source and I can run it today on my machine (even without a GPU) - it gives great results even compiled to WebAssembly and running in the browser with smaller models.

Totally free.

This needs to be much better to make sense and their own graphs show only marginal improvements in specific scenarios.

thangalin · on March 30, 2023

For my sci-fi story (alpha readers wanted; see profile), I used Whisper to transcribe an interview of a Malawian President. From there, I created a vocabulary comprised of only the president's words, which I used almost exclusively when writing his speech.

The results from Whisper are incredible, with very few mistakes. Though it did get Nelson Mandela's first name wrong (transcribed as Nesson). What's more, Whisper finished transcribing a 60-minute audio stream in 20 minutes on commodity hardware (T1000 G8 NVIDIA GPU). Broadly, here are the steps I used:

* Download and install podman.

* Download and install git.

* Download and install curl.

* Open a command prompt.

* Run the following commands to containerize Whisper:

    git clone https://github.com/lablab-ai/whisper-api-flask whisper
    cd whisper
    mv Dockerfile Containerfile
    podman build --network="host" -t whisper .
    podman run --network="host" -p 5000:5000 whisper

* Download MP3 file (e.g., filename.mp3).

* Run the following command to produce a transcription:

    curl -F "file=@filename.mp3" http://localhost:5000/whisper

101008 · on March 30, 2023

Whisper is great. You can get faster results running the tiny model. I used it for podcast transcription and it is much faster and the quality is not worse than the medium model - there are some podcast episodes that the transcription is the same.

alach11 · on March 30, 2023

If speed is important, you're much better off using a larger model and whisper.cpp.

pantalaimon · on March 30, 2023

Wow thank you! That's a nice speedup indeed, with whisper I get

    33,53s user 2,05s system 443% cpu 8,023 total

with the 'tiny.en' model whereas whisper.cpp gives me

    22,71s user 0,12s system 745% cpu 3,062 total

with the 'base.en' model for a 15s audio clip on an i7-3770 (8 threads).

alach11 · on March 30, 2023

Awesome! Thanks for posting the stats.

In my workflows I've found rare but noticeable quality differences between the model sizes. So when practical I try to use the larger ones.

malborodog · on March 30, 2023

why not just run whisper from the command line directly? Why put it into a docker container??

ec109685 · on March 30, 2023

Why not keep everything tightly contained?

malborodog · on March 30, 2023

Hm, I'm on Mac so it takes up a bunch of ram and I'm not used to this workflow. good point though.

ec109685 · on March 31, 2023

Unless you actually use the memory (e.g. allocate it), it won’t impact system performance, but yeah, it definitely is overhead.

pdntspa · on March 30, 2023

some people just love making their environments needlessly complicated.

jazzyjackson · on March 30, 2023

complexity is in the eye of the beholder, some people just get docker enough that it's not a friction

Now installing the dependencies of every git repo I want to try on my host system, that's how an environment becoming needlessly complicated

hodanli · on March 30, 2023

thank you for this

mmcwilliams · on March 30, 2023

It's interesting because while evaluating Whisper for an ASR task I found it to have some entertaining hallucinations when provided with silent or garbled audio.

For instance, this was added to the transcription of a silent section of audio:

> Hello everyone welcome to my channel. Today Im going to show you how to make a very simple and easy recipe. I hope you enjoy the video. If you like my recipe dont forget to subscribe to my channel

It makes me wonder how much of Whisper is trained on audio from Youtube, which was transcribed by this model.

nodja · on March 30, 2023

Similar experience, mine would turn background noises when no one's talking into random japanese words repeated. I was using the large model. I ended up fixing it by using the medium.en model and setting condition_on_previous_text to false.

Now if only the timestamp timings were correct for noisy audio... I've tried stable whisper and another fork I forget the name, but I need to run the audio through RTX voice if I want consistent timestamps...

resoluteteeth · on March 30, 2023

I wonder if that's specifically a result of training it on youtube videos where the audio/subtitles don't actually match (recipe videos with no speech that have additional text added in the subtitles)

coffeebeqn · on March 30, 2023

Don’t forget to smash that subscribe button

taberiand · on March 30, 2023

Does Whisper do streaming translation?

I'm imagining the near future will see a portable fast streaming model for real-time voice translation, piped into text to speech. Hook it up to an earpiece and you've got a real-life Babelfish

woodson · on March 30, 2023

Whisper doesn’t do streaming transcription or translation. It’s one of its architectural disadvantages over more commonly used ASR approaches, such as transducers (see e.g. Microsoft Research’s system trained on ~350k hours that does streaming transcription and translation: https://arxiv.org/abs/2211.02499).

speedgoose · on March 30, 2023

Yes it does streaming but you may have to reduce the quality to keep up if your machine is not very fast.

8f2ab37a-ed6c · on March 30, 2023

Whisper is incredible, and it's bananas that they give it away for free. It is able to decipher English with the thickest of accents, often better than me.

abraxas · on March 30, 2023

No, no it isn't.

Source, me with my Eastern Europe accent. There are engines out there that do far better with my accent. FAR better.

8f2ab37a-ed6c · on March 30, 2023

That wasn’t my experience, but that’s an interesting data point, thanks for sharing.

runnerup · on March 30, 2023

It's not even possible for me to verify the accuracy claims of this closed model with limited API access.

I really don't care about closed API models of anything that has a good/usable open source version. Whisper works well enough, I'm never going to follow up on this USM research or use it. The only reason to pay for the API access would be for some super niche language. And if Google is paywalling this only for the few customers who need it for use in terribly under-represented communities ... that's a kind of douchebaggery all its own.

The only reason people are paying for OpenAI's GPT-4 is because there's literally no usable open-source LLM. The instant a "good enough" one exists, OpenAI's revenues will drop by >95%.

Hopefully Google will at least use this in Google Home because it's still bad enough to notice.

saurik · on March 30, 2023

I mean, some people--like myself--are already paying for Google's multi-language speech recognition API, and have been for years, so the idea that there is a new even-better model for it sounds cool to me? My primary annoyance is that this is Google, so of course they aren't going to put in even the minimal effort to just make this a new backend for their existing ridiculously-insane API I had to build a miserable slightly-custom http/2 stack to access :/.

Regardless, I don't want to use the API, but I'm working with public information anyway; and so, while I have considered moving to Whisper now that that's an option, it hasn't been a priority and it isn't clear to me that Whisper is good at random non-English languages anyway.

UncleEntity · on March 30, 2023

> Our model has, on average, a 32.7% relative lower WER compared to Whisper for these 18 languages.

Which probably works out to one less error per thousand words or something crazy like that.

The state of the art is pretty much better than humans at this point iirc.

6gvONxR4sf7o · on March 30, 2023

On the youtube captions dataset, it says english (US) WER is nearly 15%. On those 18 languages, it's nearly 20%. How does that match up with 32% relative being one per thousand or so?

runnerup · on March 30, 2023

In the paper itself[0], they do have a couple comparisons for English-only (Fig. 2, Table 3), and it looks like Whisper has an 11.5% error rate and USM has a 10.5% error rate. It's a truly negligible difference. There's no way I'd ever pay for an API for this if I knew I only cared about English.

I know that's not the point of this model (the point is that for a lot of languages, its the only model available). But paywalling it seems greedy, you'll only extract money from those under-represented communities. On the other hand, maybe this never would have been built without the profit motive. Idk. I wish we could fund these things as "basic science research" without a need for direct profit. Let positive externalities pay us back down the road.

0: https://arxiv.org/pdf/2303.01037.pdf

UncleEntity · on March 30, 2023

Whisper can also be fine tuned on other languages. Don’t know how well it’ll do compared to this but it’s at least a possibility.

braindead_in · on March 30, 2023

Wow, training on a dataset of 12 million hours is quite impressive! I can only imagine the engineering feats required to accomplish that. To put it into perspective, Whisper was trained on 650K hours, Speechmatics' Ursa was trained on 1M hours, and AssemblyAI trained Conformer-1 on 650K hours. I hope Meta is also working on something similar!

That being said, Speaker Diarisation is still a problem that hasn't been fully solved. As of yet, AI hasn't been able to outperform humans in this area.

anigbrowl · on March 30, 2023

Release the model or go home, Google. I'm really tired of their striptease approach.

novaRom · on March 30, 2023

2B weights is not too big, so with some compression + sparsity + down-sampling it will run on any device offline

ironfootnz · on March 30, 2023

I don't understand, why Google doesn't let people use this in the wild, the best case scenario for improving, I've seen great Pull Requests from Whisper and lots of cool mods to help.

I feel like its watching an ad to buy a product but no 0800, QRCode or Web URL to go and use it. So frustrating.

blihp · on March 30, 2023

Better get used to it from the corporate labs. Now that ChatGPT has ignited widespread interest, these companies are all about building business models with moats. The easy going 'let's just release it' days are most likely over.

O5vYtytb · on March 30, 2023

People like to bash OpenAI because they've taken a profit-driven approach and haven't open sourced all of their products. However, they release their products for public consumption and improvement. In this case I wish Google would behave more like OpenAI.

orwellg1984 · on March 30, 2023

is this how it's gonna be now ? no more model release but instead "request our api"

MayeulC · on March 30, 2023

Slightly disappointed this is for recognition, not for synthesis (and not open source).

Is there any good quality, open source multi-language TTS setup available? Flite works, but doesn't sound very good.

neop1x · on March 31, 2023

There are a lot of text-to-speech models on Hugging Face. Some are really great and support many languages! They can be run offline easily, you just need a GPU and some python modules. https://huggingface.co/models?pipeline_tag=text-to-speech

Ruthalas · on March 30, 2023

Another open source option is espeak[1], though it isn't substantially better than flite, in my experience.

[1] https://github.com/rhdunn/espeak

jerpint · on March 30, 2023

I’ve never seen such a difficult form to fill out to get access to an API, it’s kafkaesque

nr2x · on March 30, 2023

Ah, you didn’t see the forms somebody at google had to fill out to make that form!

Three design documents, committee review LGTM, director-level LGTM, VP approval for headcount of person to make the form, HR forms to process head count to make the new form, legal review, privacy review, security review, and finally one of the original LGTM from committee was laid off, so you need to start over.

practice9 · on March 30, 2023

No weights - then I don't care.

abraxas · on March 30, 2023

I would not use whisper as a good yardstick for English language transcription. I'm not sure what the hubbab is all about but for myself who is not a native English speaker, Whisper is not very impressive. There are engines out there that produce a far better word error rate on my speech than Whisper does.

Maybe it works well with native speakers? But since it's supposed to be so multilingual I hoped that it would work well with my accented speech... maybe that's a wrong conclusion to draw.

pantalaimon · on March 30, 2023

What alternative to whisper would you recommend? I just recently looked at what’s available, and most of what I found was much worse - what did I miss?

abraxas · on March 30, 2023

Try Azure or Deepgram

emadda · on March 30, 2023

Side note: I released a web UI to Whipser.cpp using WASM a few weeks ago.

https://bigwav.app

iandanforth · on March 30, 2023

This is an impressive feat. I wish auto captioning were even better though. At least for Japanese videos I find it to be less than great. If you try to throw in auto-tranlastion on top of that (which is impressive YouTube attempts at all) it falls pretty flat.

astrange · on March 30, 2023

It would work better if you could feed speech embeddings to the translation model directly, since it has more language knowledge to choose what the original was more likely to say.

(Of course that might just lead to picking more common translated phrases.)

codedokode · on March 30, 2023

> trained on 12 million hours of speech

Humans train to recognize speech on much smaller datasets. If a small human is awake 16 hrs/day, that amounts to maximum 5840 hrs/year or 58400 hrs per 10 years. Why do mathematical models use more data and produce lower quality results? Is it because they don't understand the meaning of words?

wasabi991011 · on March 30, 2023

>that amounts to maximum 5840 hrs/year or 58400 hrs per 10 years

But no human can understand 300+ languages, especially not at 10 years old.

Still a very approximate comparison, but if you multiply your upper bound by 300, that gives 17.5 million hours, so more than what was used to train.

vkazanov · on March 30, 2023

My personal explanation is how large ML models are stateless while the growing brain is a (very) stateful thing. Stateless imitation will never be able to fully replicate a stateful system.

We evolved into having these brains that learn in stages, when the early years are responsible for core functions (walking, running, talking, etc). There's a lot of input (all the senses) that feeds learning, while in later years we mostly just take for granted whatever we learnt as children.

verbify · on March 30, 2023

The dataset humans are using does not just include the audio stream but also visual context. E.g. people pointing at things while talking about them.

tgv · on March 30, 2023

Blind people...

deskamess · on March 30, 2023

Fun/interesting question... do you think blind people gesture when speaking?

tgv · on March 30, 2023

https://news.uchicago.edu/story/blind-adults-gestures-resemb...

It's not from a department I'm a fan of, but they do know how to catalog gestures.

nr2x · on March 30, 2023

…or deaf.

b0afc375b5 · on March 30, 2023

To be fair, humans have the benefit of millions (billions?) of years of evolution.

ec109685 · on March 30, 2023

Would feeding the resultant recognized text into an LLM to have it correct the remaining mistakes be useful? Essentially to repair the transcription?

I did that with Whisper output and it helped improve a podcast transcription and logically inserted proper line spacing.

yosito · on March 30, 2023

I recently witnessed a speech in Indonesia for Nyepi, the Balinese New Year. I was trying to use Google Translate's live conversation feature to get an idea of what was being said. I still couldn't make out much more than "he's saying something about the importance of the holiday and being pure".

I pasted the auto-translation into ChatGPT and asked it to summarize:

> The speaker seems to be discussing the importance of Nyepi, a Balinese Day of Silence. They mention that happiness, peace, and prosperity can be achieved by being in harmony with space and time. The speaker also references ogoh-ogoh, which are statues symbolizing negative influences, and the Panca Maha Bhuta, or the five elements of life. They suggest that negative emotions and behaviors can lead to a "darkness of the mind."

> The speaker emphasizes the importance of self-control and using the Nyepi celebration as a milestone for personal growth. They mention the parading of the ogoh-ogoh, which represents negative behaviors being confronted and released. Following the Nyepi celebration, people should aim for a new and better life, leaving behind their past negative behaviors and emotions.

I'm not convinced of the accuracy, but I'd definitely say it's useful.

robopsychology · on March 30, 2023

How long until we can have realtime translations so I can talk to someone in China on the phone and I hear English & they hear Chinese? That is one of my scifi dreams. Any open source projects working on this atm?

pixl97 · on March 30, 2023

How could that ever work with languages that have different noun/verb/idea ordering?

Intent translation generally requires you to say your sentence in your language, then the translator adds/removes words and can swap order of stuff around.

robopsychology · on March 30, 2023

You’d expect some delay to handle that, but it’s all taken care of with minimal delays. Have a UI that indicates it’s waiting for the translation or something. I’m sure someone can figure out a nice UX.

berkle4455 · on March 30, 2023

If Google keeps this up, they're gonna surpass Dragon Naturally Speaking ~1999

chrnola · on March 30, 2023

This is such a sick burn.

paxys · on March 30, 2023

Google – here's some crazy research that could hypothetically be turned into a useful product one day.

OpenAI – write that down, write that down!

dang · on March 30, 2023

Please don't post unsubstantive stuff on sensational topics. We're trying for the opposite here.

https://news.ycombinator.com/newsguidelines.html

b7r6 · on March 30, 2023

A very funny phrasing of some legitimately confusing news.

I keep thinking that there's no way that Brain/DeepMind are just getting stomped, lapped, generation-gapped by e.g. `ChatGPT`: they must have had an internal demo of this sort of thing like 2 years ago, right? At some point the Empire strikes back?

But the rollout and product integration has been so well done, so coordinated and cohesive that it's now just obvious that it was way too soon to count MSFT out of the game. It's all through search and Office 365 and GitHub/CoPilot/etc. and the whole stack in such a legitimately compelling way that you can almost forget that the DNA is Win32.

It's a bad thing to let Microsoft get a stranglehold on developer and user mindshare network effects: the 90s were rough. But with how cool it all is it's very tempting...

generalizations · on March 30, 2023

Google has almost no business sense, or marketing acumen. They've only ever done one thing well, from a business perspective - create a world class search algo, put it behind a minimalistic web page, and pay for it with ads. And they didn't even come up with that business model: they just did what the competition did, but made the idealistic-nerd version of it.

Everything since then has been a combination of algorithm and compsci research (which they are world-class in; credit where credit's due), vague ideas about things people might like, and copies (or buyouts) of their competition. They remind me of my engineering friends who tried to come up with business plans in college...good at building things, but clueless about figuring out what people actually need, what they should therefore actually build, and how to make it user-friendly. You know, all the stuff that you need if you want to run a business. (As I've said before on hn, their initial competition against youtube is a great example of this)

It's a surprise that a technology came along which upended them so abruptly, but it's been clear for a long time that they were only alive because their search engine couldn't be beat, and they didn't have a clue how to replicate that success.

b7r6 · on March 30, 2023

There seems to be a lot of truth in that, but I think it's also maybe a little harsh as well.

Google hasn't needed to generate another monster revenue stream outside of ad sales, so it's possible that they never really tried all that hard (they've certainly killed it on the things that protect it, notably Android and Chrome). An utterly dominant position in how people access information that lasts for decades is probably "a hell of a drug".

For example, GCP is technically a really, really good cloud offering, maybe even the best for a lot of use cases (if you haven't looked at it lately, Cloud BigTable looks friggin amazing, I wish I'd had that database for the last ten years). They've obviously failed to parley that technical achievement into dominant market share, but maybe with the pressure on around search they get serious about whatever combination of pricing and marketing and customer support that gets them some serious market share.

YouTube has been quietly building their TikTok competitor into something I'm actually starting to waste some time on, they people who work on that are clearly really good at their jobs even if they started a little slow.

And on the LLM space, honestly I'm rooting for them: MSFT/OpenAI/ChatGPT need at least some competition and they are probably the best positioned to do so. Facebook/Meta is also doing this stuff in a more "open" way and that's keeping the pressure on around some competition too.

In general this LLM stuff is going to be a great thing long-term, but letting one company dominate both mindshare and marketshare is going to make that a much rougher road for society than if it's avoided.

Pilottwave · on March 30, 2023

Google senior management seems out of touch. It baffles me, since they got the money and the influx of talent. If you got those things, how you use it becomes the problem and that is all on management. Google might have become the old school corporate incapable of innovating or producing new modern business lines. having worked in those corporate environments, I can say that badly incentivized management can kill any giant of industry.

Its been going on for some time. Something that was once a joke in good jest AKA Google's graveyard, is now their actual reputation, and helping their strong big tech competitors when competing on new services.

danielvf · on March 30, 2023

gmail

gobengo · on March 30, 2023

yes "They've only ever done one thing well, from a business perspective - create a world class search algo, put it behind a minimalistic web page, and pay for it with ads. And they didn't even come up with that business model: they just did what the competition did, but made the idealistic-nerd version of it."

LewisVerstappen · on March 30, 2023

They're literally comparing it to OpenAI's open-sourced speech recognition model in the paper...

Oh and of course their model is hidden behind an API.

ShamelessC · on March 30, 2023

Hot take, but OpenAI's whisper was released earlier in the year and was quite impressive. They were definitely "first" - even if this model claims to outperform that one.

ravish-tech · on March 30, 2023

Even though Hindi is the third most spoken language. I don’t see it on the list. However other Indian languages are mentioned.

HeavyStorm · on March 31, 2023

So maybe now Google Home will stop misunderstanding what I say all the time?

SillyUsername · on March 30, 2023

"request API access"

MagicMoonlight · on March 30, 2023

But it still gets like 10% of words wrong which is enough to make it unusable

meesles · on March 30, 2023

As a user of Youtube CC, I disagree

brokensegue · on March 30, 2023

the WER is going to vary a lot depending on conditions. This model is better than Whisper and Whisper does much better than 10% on good quality audio with common accents.

IIAOPSW · on March 30, 2023

More like universal speech memorization. Model implies they had some sort of insight, a simplification, an understanding of how natural language works. This is just bragging about the number of parameters they can pull off.

nl · on March 30, 2023

> More like universal speech memorization.

If it can memorize all possible waveforms across all those languages and how they convert into words in only 2B parameters then I'm incredibly impressed.

(Of course it isn't doing this, and this criticism is just wrong.)

> Model implies they had some sort of insight, a simplification, an understanding of how natural language works.

Well it shows that using the same model for multiple languages increases performance which was something that was not at all clear 10 or even 5 years ago.

> This is just bragging about the number of parameters they can pull off.

A humble brag about how small the model is maybe?

IIAOPSW · on March 30, 2023

>If it can memorize all possible waveforms across all those languages and how they convert into words in only 2B parameters then I'm incredibly impressed.

Well then I don't think you've looked at the topic very deeply. Our voices don't make every possible wave form. The IPA alphabet has existed for a generation now, is universal for all spoken human languages, and has a basis in physiological origins. Of more than 160 IPA symbols, relatively few will be used to transcribe speech in any one language. It should not take 2B parameters to do this.

Look at the inner ear and you'll see a big hint. Before even getting to the brain the sound goes through a spiral shaped chamber. Quote wikipedia: "The hair cells in the Organ of Corti are tuned to certain sound frequencies by way of their location in the cochlea...". Clearly looking at time series wave form data in the first place is a mistake. Rolling window fourier transform all the speech data first, then train on it. I will bet 2-digit sums of money that simple preprocessing step would outperform what they've done.

Really, take any recording of people talking, open it up with audacity, and check out the spectrogram view. The sort of sounds we make with speech are a lot like an FM radio transmission. There's a baseband average pitch that we speak at, and then information encoded on top by deviating over/under the base band in smooth, simple ways. If that spectrogram was an OCR problem, it would already be solved.

>A humble brag about how small the model is maybe?

I see a B at the end of a number I assume its part of a pissing contest. But if they really are playing parameter golf bragging about how small they can get it, then I admit that's a step in the right direction.

schoen · on March 30, 2023

I think I remember reading that languages can choose any point on the vowel diagram as a canonical phonetic expression of a vowel, even though a given language will only have a small finite inventory of vowels.

If this is right, one could interpret it as phonemes being digital (a native speaker can consistently transcribe them using a small number of IPA symbols), but (at least vowel) phones being analog (the typical frequency spectrum appearance typically representing a specific vowel could best be represented by floating-point numbers that are culturally chosen somewhat arbitrary by language evolution). Like /ɪ/ might be 0.82 closeness, 0.23 backness, 0.05 roundedness in some language, but 0.79 closeness, 0.26 backness, and 0.1 roundedness in another.

In turn, if that's right, I don't know how much variation there is among native speakers (from a particular region) in production of a given vowel (or how consistently and precisely children can learn these values). Presumably for recognition the brain rounds off somehow to the nearest vowel phoneme, kind of like picking the nearest point on the constellation diagram in digital communications demodulation?

IIAOPSW · on March 30, 2023

>a specific vowel could best be represented by floating-point numbers that are culturally chosen somewhat arbitrary by language evolution). Like /ɪ/ might be 0.82 closeness, 0.23 backness, 0.05 roundedness in some language, but 0.79 closeness, 0.26 backness, and 0.1 roundedness in another.

that's not 3 floats worth of data, those are all ints over 100.

We generally think of people, but not languages, as having deep or squeaky voices. Furthermore we usually have no problem following when the speakergoesfaster or enunciates reaaaallllyy slooowwwlllyy or even ish they slur theirr wurds abit. I don't think a language specific absolute value of vowel tone is a real thing. Rather, what matters is that the direction and magnitude of the shifts in tone are all in consistent proportions relative to each other. When you look at the vocalizations of humans or even other mammals on the spectrogram there's very clearly just simple modulations happening, like moving linearly from a starting frequency to the ending frequency, or at most taking a x^2 curve to get there. When you think about it this makes sense as those sorts of patterns are trivial to implement on the hardware. Just squeeze or release the muscle on the vocal cord harder. You're right in principle any set of frequencies can be the vowels but every language is going to discretize down to a fixed number. So instead of having or needing language specific absolute "vowel frequencies", you can just look for k distinct levels in the individual speakers voice.

If you've gotten as far as calculating the spectrogram, everything I described can be easily implemented by applying curve fitting to that spectrum and then thresholding the coefficients

schoen · on March 30, 2023

> You're right in principle any set of frequencies can be the vowels but every language is going to discretize down to a fixed number. So instead of having or needing language specific absolute "vowel frequencies", you can just look for k distinct levels in the individual speakers voice.

But if a native English speaker speaks Japanese with five appropriately-spaced English vowels, it will be directionally correct and comprehensible, but native Japanese speakers would notice that those aren't the normal vowels used by native speakers in Japanese.

So I guess I'm wondering about how much leeway there is in what sounds like a correct vowel to a native speaker, how much range there is in a given person's production, how much range there is in production among members of a community who interact with each other constantly, and how small a difference between vowels someone can detect as a listener.

IIAOPSW · on March 31, 2023

Here is my rank conjecture. Part of the advantage to working in frequency space as I suggest is because the output is invariant under certain transformations (up to a multiplication by a phase factor e^(i omega)). In the 2d image case, the invariant is translations along either axis. In the 1d time series case, the invariant is translations along the time axis. That's just a fancy way to say you get the same frequencies out no matter what time you pressed record. So I conjecture that the sorts of changes to the sound wave form which don't get in the way of our comprehension are exactly the invariant of some transform similar to Fourier which happens as a low level processing step shortly after the ear. The exact extent to which a vowel tone differs from the native speaker average manifests as (and is encoded in) a phase shift on the output of the transform, but the output magnitudes stay invariant and thus recognizable.

notahacker · on March 30, 2023

I think the other thing that matters is that the relative proportions of vowel tones correspond to a word the listener expects to hear given the sentence/situation. A word pronounced so unconventionally it's been turned into a homophone for another word will still generally be understood, even if the listener notices the "wrong" vowel tone. Which is where the 2bn parameters comes in...

IIAOPSW · on March 30, 2023

I fully agree that context the semantic content itself primes the listenerrs expectations of the next rabbit. But that sort of language model need not be GPT-level good. It just needs to identify which words/letters are unlikely to be next to each other and which ones they might be mistaken for. An unsophisticated Markov chain could do that. Is this really a 2B parameter problem, or do we do things this way because parameters are currently cheaper than expertise?

TheDong · on March 30, 2023

> I will bet 2-digit sums of money that simple preprocessing step would outperform what they've done.

Okay. I take this bet. If your simple preprocessing step outperforms whisper on, say, me reading a random wikipedia article, I'll happily send you any 2-digit sum of money (denoted in dollars presumably), or donate it to a charity of your choice.

IIAOPSW · on March 30, 2023

In retrospect I should have realized the corner I was painting myself here. If I'm right, you just paid bargain basement prices for an algorithm that can replace an expensive 2B parameter ML process. But, if I had offered what the bet is actually worth, it would look like I was trying to price people out of ever taking me up on it. Well, shit.