On a related note for anyone interested in this and want better performance today:
I messed with a combo of Whisper and ChatGPT. I took a whisper transcript and asked chatgpt to fix mistranscriptions using the context of the transcript and based on potential phonetic issues. I asked it to replace transcribed words that don't make sense with "[unintelligable]", which improved the output even more.
Transcription error rate was almost nonexistent, even on the smallest whisper model.
Probably depends on the language. I tried to transcribe some hour-long talks in Chinese with whisper the other day. Even with the large model, it manages to skip some sentences from time to time (no idea why), and repeat the same sentence over and over in other instances. I tweaked parameters for a long time to no vail. While GPT should be able to clean up the repetition (there will be false positives though, since human speakers do repeat themselves occasionally), it can’t really fill out the missing sentences.
I have to transcribe a tonne of Chinese interviews soonish -- any further thoughts or experiments you can think of? Maybe some preprocessing steps to the audio? For example, cut it into one minute chunks with some overlap, then transcribe those, so that it can't skip those bits...? Or can we finetune it on a library of Chinese mp3s + transcripts?
Whisper cuts the audio into chunks of 30 seconds. So If you have a one-minute recording where the first half has conversation in it and the remainder nothing, then it will think that it has to find something in that second 30 seconds block without knowing how "speech" actually sounded like as it did in the first chunk.
Try to pre-process it where just "voice" is detected, not the meaning, just some speaking, and cut the audio into snippets which only contain speech, so that Whisper doesn't have to guess if the segment will contain speech or not.
Also, if you cut it up into chunks and let it transcribe each chunk and expect JSON as the output, instead of the other output methods, then you'll get a bunch of extra parameters with it which will help you identifying problematic sections. For example hallucinated repetitions usually have a higher "no_speech_prob" parameter, or segments with lower "compression_ratio" will also not be that accurate.
If cost isn’t an issue, I’d use one of the established commercial offerings. Otherwise, you could try splitting into shorter chunks (20 minutes maybe?), do multiple runs on each chunk and pick out the best run according to some criteria, e.g. character count after removing repetitions. Whisper isn’t deterministic so some runs can be better than others; you could also tweak parameters like compression ratio or silence thresholds between runs, but in my experience there’s not going to be an obvious winner leading to a marked improvement in quality. Anyway, I’m no expert, and maybe you’ll have better luck than me. My recordings do have background music and conversations in some places that might confuse the model.
Can't use third party service -- compliance stuff. This is helpful though. Using another model to tidy it up -- maybe Alpaca -- could be an option too. Then we'll just do speaker separation etc. manually later.
I use a custom babbage model fined tuned using the original whisper transcripts as the prompt and the fixed transcript as the result. It does a very good job at correcting common jargon and names, correcting something like 1/3 to 1/2 of all total errors. Also is very good at correcting transcripts which whisper fails to put punctuation in for whatever reason by adding punctuation and capitalization where appropriate. I used this to fix 1000+ transcripts of lectures totaling around 700 hours of speech on the cheap.
Personally wouldn't recommend my approach however unless you are okay with doing some hardcore text manipulation and fuzzy math. It fails to produce text that matches up with the prompt 10% of the time and lots of other caveats with what to do with text that doesn't fit into the prompt.
I've suspected as such, which makes using the Siri and Google Assistant of today even more infuriating! I know it's going to be exponentially better just around the corner!
I've been doing the same. Apart from sometimes changing the meaning of sentences, it does a pretty good job at improving the message one is trying to convey.
In my case I'm recording notes while exercising on my bike, so there's wind and friction and breathing, then I transcribe them with Whisper, do try to filter out things that don't make much sense (where the temperature, compression_ratio, avg_logprob and/or no_speech_prob are not acceptabe).
Since I start recording at will and don't prepare the sentence to be recorded, the recording is a bit chaotic, and ChatGPT does a good job in correcting Whisper's mistakes and transforming the sentences.
But, as I said, sometimes this fails and the real message gets lost.
This is tangential to the software techniques, but have you tried using throat microphones? They're essentially the microphone counterpart to the bone conducting headphones. They're aimed at use cases where there's a lot of noise (originally created for paragliding but also used in biking scenarios), and low voice scenarios (eg covert operations and even to amplify voice from people having speech problems due to things like Parkinson's disease).
I'm using them to record multiple people in the same room but avoid cross-talk (for data collection in a research study). The audio isn't great, but I've used transcription services and they seem to be able to make the words out just fine.
This is one of my frustrations with generative AI APIs, the censoring of content. I understand why it's done but it seems overboard to me, and has created huge headaches with material that is not at all prurient, but where there's some homonyms involved.
I use Whisper actively in my daily life, but I'm curious about the prompt you used to make these corrections. I didn't get great results in my trials. What technology are you using? GPT-3.5 or GPT-4? If you could provide me with some information on this, I'd like to incorporate it into my workflow right away. I understand that you may have fine-tuned the models on Davinci as custom. If any of ones working in this field have a chance to give me more detailed information, I'd really appreciate it.
I wish there was a way to use Whisper for continuous recognition! Unfortunately, Whisper doesn't support streaming — and AFAIK it's unlikely that it ever will.
Nice! I tried something similar where I orally gave a speech, transcribed it and then had ChatGPT turn it into a coherent text. Saved me a bunch of time.
Yeah there are some models that I played with that can do this. They only work for 2 or 3 speakers currently though. They term for this is "diarization".
I wonder, do any conference call services (zoom, GMeet, etc) offer the ability to record each participant's audio stream separately in a way that would make it easy to transcribe them separately then combine?
> USM, which is for use in YouTube (e.g., for closed captions), can perform automatic speech recognition (ASR) on widely-spoken languages like English and Mandarin, but also languages like Punjabi, Assamese, Santhali, Balinese, Shona, Malagasy, Luganda, Luo, Bambara, Soga, Maninka, Xhosa, Akan, Lingala, Chichewa, Nkore, Nzema to name a few.
I find this part to be the most impressive thing in the OP. Most of those languages are spoken by fewer than 0.1% of the world and Nzema is spoken by less than half a million people. Where are they even getting enough training data to figure those languages out?
They used data from youtube, with multiple selection methods. A little supervised data for 72 languages (90k hours of audio with text labels provided), some pseudo-supervised data in english (100k hours; the english labels were generated by having a model do the labeling), and a LOT of unsupervised data in 568 languages (12.1 million hours). That last group has no labels (aka captions) available, just the audio, but they create pseudo-labels by using the FLEURS dataset. I'm not really sure how this works as FLEURS itself only has 102 languages... I guess in different words, it just makes some phonetic impression of how that language should be written and then compares itself to how well it can do that impression?
> We demonstrate that utilizing a large unlabeled multilingual dataset to pre-train the encoder of our model and fine-tuning on a smaller set of labeled data enables us to recognize these under-represented languages. Moreover, our model training process is effective for adapting to new languages and data.
What this implies is that the core encoder model (trained on popular languages with a ton of data) does a really good job at learning the generalized basics of any language (period).
Then, they enter a comparatively minuscule amount of labeled data from the small languages at the end (supervised fine tuning - since parameters are tuned iteratively, the latest training data has the biggest impact on the final performance). The model has enough of a general “understanding” of grammar across languages that it can fill in the gaps it doesn’t get from that small labeled set.
Thats the true crazy part. The pre trained encoder is an ersatz Rosetta Stone.
Cool though not entirely surprising that a network that learns how to process many human languages might develop internal subnets/structures that generalise to nearly all human languages. Could be seen as a humanlang coprocessor, as a sibling says.
Understanding those generic internal structures would be fascinating.
Right? I think understanding, or perhaps a better word is representing, the internal structure is also best left to a machine to extract. If it can categorize/cluster into various neural neighborhoods then we can feed that output (or perhaps many outputs from different runs) into another system that is responsible for explaining to humans.
Well, it’s certainly not evidence against, but it’s still possible that there are n-families of unrelated grammar structures and Google’s training caught enough examples of each of those families via popular languages.
Also, for all we know, they failed on a bunch of languages (unless there’s a paper somewhere I missed, but sadly the days of good AI papers from for-profits seems to be over).
Not necessarily, the concept of universal grammar is more about the concept of innate grammar that has to be universal for any homo sapiens language, but this empirical observation is more about the similarity of particular languages, many of which are significantly related to each other.
For example, if it had historically turned out (or will turn out in some future) that everyone on Earth speaks some variant of English or other Indo-European language; or some variant of Chinese or other Sino-Tibetan, then there would be many features of grammar that technically are universal across the population but that wouldn't be evidence that "universal grammar" (as used by Chomsky) is indeed a thing.
There is a lot of evidence that training can model underlying structures in languages. This means that the model doesn’t need to see as much of certain kinds of languages and can benefit from a larger corpus of various different languages.
Wouldn't the simpler explanation be that the pretraining teaches the model to extract human voices and phonemes from general audio, and so extending the model to other languages is then easier because the model is already working from phonemes?
It probably goes even beyond that. Most languages are related to other languages, and somebody good in German and Norwegian could do an ok job transcribing Swedish with minimal training. Similarly, learning English is a lot easier for German speakers than for Chinese, even though German and English are quite different in pronunciation.
The more languages from a language family a human speaks, the easier it is for them to learn each additional language from that family. It stands to reason that deep learning can benefit similarly.
Chomsky postulates that all languages boil down to a universal set of rules, and he went as far as saying that this is biologically motivated, as if there was a "language device" in the brain.
Norvig was one of the pioneers in proposing that NLP could be solved by statistical machine learning on large quantities of data.
So the joke here is that we're using large-scale machine learning to infer the underlying universal structure of language.
---
EDIT: After writing the comment above I asked ChatGPT-4 the same question, and it came up with the following answer, which I think is flawless.
---
The last sentence in the paragraph is referencing two well-known figures in the field of linguistics and artificial intelligence: Noam Chomsky and Peter Norvig.
Noam Chomsky is a linguist who proposed the idea of a "universal grammar," suggesting that all human languages share a common underlying structure, and that humans have an innate ability to acquire language. This idea implies that even with limited exposure to a specific language, humans can still learn it thanks to this innate structure.
Peter Norvig is a computer scientist and artificial intelligence researcher who is known for advocating a more data-driven, statistical approach to natural language processing. This approach emphasizes the importance of large datasets in teaching machines to understand and process language, rather than relying on pre-defined rules or structures.
The sentence "It's like Chomsky, but with the Norvig approach" suggests that the described language model incorporates elements of both Chomsky's universal grammar and Norvig's data-driven approach. The model benefits from an underlying structure (akin to Chomsky's universal grammar) while also leveraging a large corpus of various languages (as advocated by Norvig) to improve its performance in understanding and processing language.
We've been researching different speech models at Scrimba, and went for Whisper on our own infrastructure. A few days ago I stumbled onto Deepgram, which blows whisper out of the water in terms of speed and accuracy (we need high precision word level timestamps). I thought their claim of being 80x faster than whisper had to be hyperbole, but it turned out to be true for us. Would recommend checking it out for anyone who need performant speech-to-text.
In my experience the accuracy is at least a bit better than whisper-small on their enhanced models. But we've just started using it so haven't had time to do many direct comparisons with whisper. Their word-timestamps are _much_ better, which is important if you want to be able to edit the audio based on the transcription.
As for speed I have no idea how they make it so fast, but I'm sure they've written about it somewhere. My guess is at least that they are slicing the audio and parallelising it. Will look into Conformer-1 as well!
I saw deepgrams claims as well an believed them also, then i tried it, it was TERRIBLE. Don't believe them. It only does well on the benchmark they trained it on. It is faster though but the quality is terrible.
Did you try their enhanced models? We're using it for relatively high-quality audio files and their accuracy is better than the whisper small.en model. More importantly, their word level timestamps is worlds better than whisper.
Yeah, I'm not sure why people get so hyped up about Whisper. In production use it's middling at best and there are commercial offerings the handily beat it in both accuracy and speed.
In most real world settings, at least in my personal use, latency to a remote AI comprises most of the usability difficulty with automated speech recognition. The larger whisper models can be run directly on a laptop using multi threading and achieve speech to text transcription that is fully sufficient to almost completely write whole emails, papers, documents with them. In fact, I've written most of this comment using an ASR system on my phone that uses whisper. While the smaller models (like the one user here) can need some correction, the bigger ones are almost perfect. They are both very sufficient and for realtime interactive use I see no future market for paid APIs.
Yesterday I wrote virtually all the prose in the manuscript while walking around with a friend and discussing it. We didn't even look at the phone.
Obviously there's an academic element here because I'm saying I'm using it for writing. But it's more of a human-centric computing thing. I'm replacing a lot of time that my thumbs are spent tapping on keys, my fingers are spent tapping on keyboard, and my eyes are spent staring at the words that are appearing, looking for typographical errors to correct, with time organizing my thoughts in a coherent way that can be spoken and read easily. I'm basically using whisper to create a new way to write that's more fluid, direct, and flows exactly as my speech does. I've tried this for years with all of the various ASR models on all the phones I've had and never been satisfied in the same way.
Whisper democratises high-quality transcription of the languages I personally care about, whether using a CPU or a GPU. It's FOSS, self-hostable, and very easy to use.
That's a big deal for someone who wants to build something using speech recognition (voice assistant, transcription of audio or video) without resorting to APIs.
Is this in the realm of aspiration or something you've actually worked on? Because Whisper is incredibly difficult (I'd say impossible) to use in a real time conversational setting. The transcription speed is too slow for interactive use even on a GPU once you step up above tiny or base. And when you step down this low the accuracy is attrocious (especially in noisy settings or with accented voices) and then you have to post process the output with a good NLP to make it usable in whatever actions you're driving.
Look, it's nice that it's out there and free to use. For the sake of my wallet I hope it gets really good. But it isn't competitive with top of the line commercial offerings if you need to ship something today.
At the company I work for we are currently in the process of transitioning from being a German-only to an international company. Recently, we started using Whisper to live-transcribe + translate all our (German) all-hands meetings to English. Yes, it required some fine-tuning (i.e. what size do you choose for the audio chunks you pass to Whisper) but overall it's been working incredibly well – almost flawlessly so. I don't recall which model size we use but it does run on a beefy GPU.
What if you are operating in an internet-denied application? A remote radio sensing system with a low bandwidth back channel or an airplane cabin, just to name two.
Actively working on it. I've not noticed any performance problems even with the large model (though the plan was always to run the speech recognition on a GPU - your use case may differ). It seems to be doing fairly well even with slightly noisy inputs, and certainly has better bang/$ than other non-API solutions that service my native language.
While true real-time would definitely be nice, I can approximate it well enough with various audio slicing techniques.
That's very similar to CPU-based performance with modern CPUs and parallelization! Frankly, with whisper.cpp it tends to be a little faster than the length of the audio for the "small" model, and much faster for "base" and "tiny".
Doesn't even have to be that modern, my Ivy Bridge CPU already achieves faster than realtime performance - which makes me wonder if there is maybe some upstart cost for the GPU based solution and it would outperform the CPU only with longer clips.
If you use the large_v2 version of whisper, and give it a prompt to indicate what it's transcribing, it can do extremely well. But do use the prompt feature.
If it's the auto CC currently used by YouTube... I think it needs a few billion more (or less) sentences.
It is comically bad, with nonsense words that don't make any sense in context about every other sentence in English, and an absolute inability to produce a coherent thought in Japanese.
The auto CC used by Youtube has improved greatly, when I started using it years ago it was almost unintelligible but now it's impressive. (At least in English and Italian is practically perfect now)
USM has a word error rate of ~15% on US English according to the article. That means it's getting roughly one or two words wrong in every sentence. If you're seeing wrong words in every other sentence it's doing better than you'd expect.
agree on this. I wonder how threads like this even get off the ground. anyone who watches about 10 minutes of auto captioned videos can see how awful it is.
That’s funny; I watch a lot of YouTube — mostly in English, mind you, but sometimes in other languages with English auto-translated captions — and I find the quality of YouTube’s auto-captions better than most “professionally”-produced captions, to the point that I would sometimes switch explicit captions off and revert back to auto-captions if I could.
I think this says more about the generally-low quality of the captioning services used by YouTube content creators than anything, though.
My biggest complaint is that I often find that human captioners don’t recognize domain-specific words. Like, names of ethnic foods. The word might be literally showing on the screen while the presenter is talking — and the “professional” captioer will just put [indistinct] there, as if they aren’t even watching the video as they’re captioning it. YouTube’s auto-captions get these words right every time.
My impression is that USM is seemingly uniquely good at code-switching from word to word within a sentence — which makes sense, given its “universality.” I think, if they allowed it, it would even be able to embed clauses and quotations generated in one language and alphabet, into a sentence generated in an entirely different language and alphabet, keeping syntax and grammatical structure correct for the given language within its clause.
I actually feel like the autocaptioning is one of those "I'm living in the future" moments. It's amazing to play a video in Swedish, and just have it autotranslate. I love it. It's not as good in my opinion as good human translation, but I agree that I've seen many translations that were much much much worse than the autocaptioning.
I have had some channels where I was laughing out loud at the autocaptioning, which was probably more so translation, but I did get a laugh out of it after all, and I generally knew what they were saying.
I've also noticed that, at least with the videos I watch, usually the autocaptioning errors seem "phonemically correct", so it's substituting a word that sounds the same, and I can easily figure out what was meant. Usually I've noticed these problems with more so with non-American English (British or Australian for example), especially where there are multiple people all speaking English, but with different accents. It does seem to me the English speech recognition is honed in on some west coast or midwest US English accent.
I am surprised contextual cues aren't being utilized more but I'm very happy with the YouTube speech recognition.
They aren't hiding that, though? Literally the first graphic on the page shows their claim of a "word error rate" on test data has an error rate of around 14% in the best case compared to the state of the art at 15% (for en-US content).
It's only 1% better than the current state of the art. But it's still noteworthy. From the end of the abstract:
> We demonstrate that utilizing a large unlabeled multilingual dataset to pre-train the encoder of our model and fine-tuning on a smaller set of labeled data enables us to recognize these under-represented languages. Moreover, our model training process is effective for adapting to new languages and data.
It's amazing to me that the chaotic process of "machine learning" can end up with an internal state for languages that is readily adapted to entirely new languages.
For now, they've got this handling audio transcription, but with some hints that this approach could work well for translation. Perhaps we'll be able to use these improved models to decipher Linear A[0] or other un-deciphered languages. It sounds like "magic", but it's the kind that could maybe exist.
>> It's amazing to me that the chaotic process of "machine learning" can end up with an internal state for languages that is readily adapted to entirely new languages.
Yeah and interestingly, that was roughly Chomsky's breakthrough finding with respect to how humans learn language as children. We are born with an innate language acquisition device.
Chomsky's ideas around the Universal Grammar are just a theory. Similar to how his formal grammars somewhat represent real languages but never fully, the UG model will never explain it all, or even most of it. Brain biology just doesn't like the idea of formal things/rules/grammars.
Here's an alternative theory/approach. What if natural languages are just the way the device starts working when the number of neurons grows quickly? NL properties sort of emerge out of low-level details of brain work? Neurons are simple but the brain is not. Complex brain properties emerge from trivial parts the same way our full bodies emerge from a simple DNA/RNA system. Any details in these systems would be too statistical to expose a limited rules system.
Obviously, powerful enough ML system can infer the system's properties. In fact, it can infer any function. The thing is that this doesn't mean there's some kind of simpler model explaining details of emergent system's work.
What is surprising is the way LLMs imitate a stateful function (our brain, with memory, fluid biology, etc) using a stateless inferred function (the model). I suspect this statefulness might be the answer to the question of "poverty of stimulus" problem.
The business model is the interesting part - it's proving out OpenAI's model of selling access to a model, rather than insights that Google derived from the model
Whisper also manages to add punctuations, line breaks and can attribute the speech to a particular speaker. The YouTube version is all lower case and has no punctuation. That “simple” change would make things so much better in YouTube CC.
These "pure speech" models could really benefit from being coupled to a large language model like ChatGPT.
YouTube live transcriptions are terrible, because they get confused by homonyms and can't follow the context in a sentence.
In the same manner that Dall-E joined an LLM to an image generator, they ought to train a combined speech model + LLM so that the uncertainties in the speech model output is disambiguated by the LLM.
This is exactly part of this Google USM approach, although these pretrained models are significantly smaller than ChatGPT. They reference this paper [1] which contains more details on the pretrained text-only alignment with the speech model.
Are you sure about that? I've seen YouTube captions understand homophones, and Google Assistant definitely can (though that may be linked to some other system).
Surely the speech recognition model itself learns some basic language statistics in order to recognise homophones?
Interesting that they enriched the training data by asking people to point out YouTube videos in specific languages for which they needed data
> YT-513-U: We create an additional dataset called YT-513-U to ensure coverage of lower resource languages in our pre-training dataset. We reached out to vendors and native speakers to identify YT videos containing speech in specific long tail languages, collecting a dataset of unlabeled speech in 513 languages. [1]
That's cool, but Whisper is open source and I can run it today on my machine (even without a GPU) - it gives great results even compiled to WebAssembly and running in the browser with smaller models.
Totally free.
This needs to be much better to make sense and their own graphs show only marginal improvements in specific scenarios.
For my sci-fi story (alpha readers wanted; see profile), I used Whisper to transcribe an interview of a Malawian President. From there, I created a vocabulary comprised of only the president's words, which I used almost exclusively when writing his speech.
The results from Whisper are incredible, with very few mistakes. Though it did get Nelson Mandela's first name wrong (transcribed as Nesson). What's more, Whisper finished transcribing a 60-minute audio stream in 20 minutes on commodity hardware (T1000 G8 NVIDIA GPU). Broadly, here are the steps I used:
* Download and install podman.
* Download and install git.
* Download and install curl.
* Open a command prompt.
* Run the following commands to containerize Whisper:
Whisper is great. You can get faster results running the tiny model. I used it for podcast transcription and it is much faster and the quality is not worse than the medium model - there are some podcast episodes that the transcription is the same.
It's interesting because while evaluating Whisper for an ASR task I found it to have some entertaining hallucinations when provided with silent or garbled audio.
For instance, this was added to the transcription of a silent section of audio:
> Hello everyone welcome to my channel. Today Im going to show you how to make a very simple and easy recipe. I hope you enjoy the video. If you like my recipe dont forget to subscribe to my channel
It makes me wonder how much of Whisper is trained on audio from Youtube, which was transcribed by this model.
Similar experience, mine would turn background noises when no one's talking into random japanese words repeated. I was using the large model. I ended up fixing it by using the medium.en model and setting condition_on_previous_text to false.
Now if only the timestamp timings were correct for noisy audio... I've tried stable whisper and another fork I forget the name, but I need to run the audio through RTX voice if I want consistent timestamps...
I wonder if that's specifically a result of training it on youtube videos where the audio/subtitles don't actually match (recipe videos with no speech that have additional text added in the subtitles)
I'm imagining the near future will see a portable fast streaming model for real-time voice translation, piped into text to speech. Hook it up to an earpiece and you've got a real-life Babelfish
Whisper doesn’t do streaming transcription or translation. It’s one of its architectural disadvantages over more commonly used ASR approaches, such as transducers (see e.g. Microsoft Research’s system trained on ~350k hours that does streaming transcription and translation: https://arxiv.org/abs/2211.02499).
Whisper is incredible, and it's bananas that they give it away for free. It is able to decipher English with the thickest of accents, often better than me.
It's not even possible for me to verify the accuracy claims of this closed model with limited API access.
I really don't care about closed API models of anything that has a good/usable open source version. Whisper works well enough, I'm never going to follow up on this USM research or use it. The only reason to pay for the API access would be for some super niche language. And if Google is paywalling this only for the few customers who need it for use in terribly under-represented communities ... that's a kind of douchebaggery all its own.
The only reason people are paying for OpenAI's GPT-4 is because there's literally no usable open-source LLM. The instant a "good enough" one exists, OpenAI's revenues will drop by >95%.
Hopefully Google will at least use this in Google Home because it's still bad enough to notice.
I mean, some people--like myself--are already paying for Google's multi-language speech recognition API, and have been for years, so the idea that there is a new even-better model for it sounds cool to me? My primary annoyance is that this is Google, so of course they aren't going to put in even the minimal effort to just make this a new backend for their existing ridiculously-insane API I had to build a miserable slightly-custom http/2 stack to access :/.
Regardless, I don't want to use the API, but I'm working with public information anyway; and so, while I have considered moving to Whisper now that that's an option, it hasn't been a priority and it isn't clear to me that Whisper is good at random non-English languages anyway.
On the youtube captions dataset, it says english (US) WER is nearly 15%. On those 18 languages, it's nearly 20%. How does that match up with 32% relative being one per thousand or so?
In the paper itself[0], they do have a couple comparisons for English-only (Fig. 2, Table 3), and it looks like Whisper has an 11.5% error rate and USM has a 10.5% error rate. It's a truly negligible difference. There's no way I'd ever pay for an API for this if I knew I only cared about English.
I know that's not the point of this model (the point is that for a lot of languages, its the only model available). But paywalling it seems greedy, you'll only extract money from those under-represented communities. On the other hand, maybe this never would have been built without the profit motive. Idk. I wish we could fund these things as "basic science research" without a need for direct profit. Let positive externalities pay us back down the road.
Wow, training on a dataset of 12 million hours is quite impressive! I can only imagine the engineering feats required to accomplish that. To put it into perspective, Whisper was trained on 650K hours, Speechmatics' Ursa was trained on 1M hours, and AssemblyAI trained Conformer-1 on 650K hours. I hope Meta is also working on something similar!
That being said, Speaker Diarisation is still a problem that hasn't been fully solved. As of yet, AI hasn't been able to outperform humans in this area.
I don't understand, why Google doesn't let people use this in the wild, the best case scenario for improving, I've seen great Pull Requests from Whisper and lots of cool mods to help.
I feel like its watching an ad to buy a product but no 0800, QRCode or Web URL to go and use it. So frustrating.
Better get used to it from the corporate labs. Now that ChatGPT has ignited widespread interest, these companies are all about building business models with moats. The easy going 'let's just release it' days are most likely over.
People like to bash OpenAI because they've taken a profit-driven approach and haven't open sourced all of their products. However, they release their products for public consumption and improvement. In this case I wish Google would behave more like OpenAI.
There are a lot of text-to-speech models on Hugging Face. Some are really great and support many languages! They can be run offline easily, you just need a GPU and some python modules. https://huggingface.co/models?pipeline_tag=text-to-speech
Ah, you didn’t see the forms somebody at google had to fill out to make that form!
Three design documents, committee review LGTM, director-level LGTM, VP approval for headcount of person to make the form, HR forms to process head count to make the new form, legal review, privacy review, security review, and finally one of the original LGTM from committee was laid off, so you need to start over.
I would not use whisper as a good yardstick for English language transcription. I'm not sure what the hubbab is all about but for myself who is not a native English speaker, Whisper is not very impressive. There are engines out there that produce a far better word error rate on my speech than Whisper does.
Maybe it works well with native speakers? But since it's supposed to be so multilingual I hoped that it would work well with my accented speech... maybe that's a wrong conclusion to draw.
What alternative to whisper would you recommend?
I just recently looked at what’s available, and most of what I found was much worse - what did I miss?
This is an impressive feat. I wish auto captioning were even better though. At least for Japanese videos I find it to be less than great. If you try to throw in auto-tranlastion on top of that (which is impressive YouTube attempts at all) it falls pretty flat.
It would work better if you could feed speech embeddings to the translation model directly, since it has more language knowledge to choose what the original was more likely to say.
(Of course that might just lead to picking more common translated phrases.)
Humans train to recognize speech on much smaller datasets. If a small human is awake 16 hrs/day, that amounts to maximum 5840 hrs/year or 58400 hrs per 10 years. Why do mathematical models use more data and produce lower quality results? Is it because they don't understand the meaning of words?
My personal explanation is how large ML models are stateless while the growing brain is a (very) stateful thing. Stateless imitation will never be able to fully replicate a stateful system.
We evolved into having these brains that learn in stages, when the early years are responsible for core functions (walking, running, talking, etc). There's a lot of input (all the senses) that feeds learning, while in later years we mostly just take for granted whatever we learnt as children.
I recently witnessed a speech in Indonesia for Nyepi, the Balinese New Year. I was trying to use Google Translate's live conversation feature to get an idea of what was being said. I still couldn't make out much more than "he's saying something about the importance of the holiday and being pure".
I pasted the auto-translation into ChatGPT and asked it to summarize:
> The speaker seems to be discussing the importance of Nyepi, a Balinese Day of Silence. They mention that happiness, peace, and prosperity can be achieved by being in harmony with space and time. The speaker also references ogoh-ogoh, which are statues symbolizing negative influences, and the Panca Maha Bhuta, or the five elements of life. They suggest that negative emotions and behaviors can lead to a "darkness of the mind."
> The speaker emphasizes the importance of self-control and using the Nyepi celebration as a milestone for personal growth. They mention the parading of the ogoh-ogoh, which represents negative behaviors being confronted and released. Following the Nyepi celebration, people should aim for a new and better life, leaving behind their past negative behaviors and emotions.
I'm not convinced of the accuracy, but I'd definitely say it's useful.
How long until we can have realtime translations so I can talk to someone in China on the phone and I hear English & they hear Chinese? That is one of my scifi dreams. Any open source projects working on this atm?
How could that ever work with languages that have different noun/verb/idea ordering?
Intent translation generally requires you to say your sentence in your language, then the translator adds/removes words and can swap order of stuff around.
You’d expect some delay to handle that, but it’s all taken care of with minimal delays. Have a UI that indicates it’s waiting for the translation or something. I’m sure someone can figure out a nice UX.
A very funny phrasing of some legitimately confusing news.
I keep thinking that there's no way that Brain/DeepMind are just getting stomped, lapped, generation-gapped by e.g. `ChatGPT`: they must have had an internal demo of this sort of thing like 2 years ago, right? At some point the Empire strikes back?
But the rollout and product integration has been so well done, so coordinated and cohesive that it's now just obvious that it was way too soon to count MSFT out of the game. It's all through search and Office 365 and GitHub/CoPilot/etc. and the whole stack in such a legitimately compelling way that you can almost forget that the DNA is Win32.
It's a bad thing to let Microsoft get a stranglehold on developer and user mindshare network effects: the 90s were rough. But with how cool it all is it's very tempting...
Google has almost no business sense, or marketing acumen. They've only ever done one thing well, from a business perspective - create a world class search algo, put it behind a minimalistic web page, and pay for it with ads. And they didn't even come up with that business model: they just did what the competition did, but made the idealistic-nerd version of it.
Everything since then has been a combination of algorithm and compsci research (which they are world-class in; credit where credit's due), vague ideas about things people might like, and copies (or buyouts) of their competition. They remind me of my engineering friends who tried to come up with business plans in college...good at building things, but clueless about figuring out what people actually need, what they should therefore actually build, and how to make it user-friendly. You know, all the stuff that you need if you want to run a business. (As I've said before on hn, their initial competition against youtube is a great example of this)
It's a surprise that a technology came along which upended them so abruptly, but it's been clear for a long time that they were only alive because their search engine couldn't be beat, and they didn't have a clue how to replicate that success.
There seems to be a lot of truth in that, but I think it's also maybe a little harsh as well.
Google hasn't needed to generate another monster revenue stream outside of ad sales, so it's possible that they never really tried all that hard (they've certainly killed it on the things that protect it, notably Android and Chrome). An utterly dominant position in how people access information that lasts for decades is probably "a hell of a drug".
For example, GCP is technically a really, really good cloud offering, maybe even the best for a lot of use cases (if you haven't looked at it lately, Cloud BigTable looks friggin amazing, I wish I'd had that database for the last ten years). They've obviously failed to parley that technical achievement into dominant market share, but maybe with the pressure on around search they get serious about whatever combination of pricing and marketing and customer support that gets them some serious market share.
YouTube has been quietly building their TikTok competitor into something I'm actually starting to waste some time on, they people who work on that are clearly really good at their jobs even if they started a little slow.
And on the LLM space, honestly I'm rooting for them: MSFT/OpenAI/ChatGPT need at least some competition and they are probably the best positioned to do so. Facebook/Meta is also doing this stuff in a more "open" way and that's keeping the pressure on around some competition too.
In general this LLM stuff is going to be a great thing long-term, but letting one company dominate both mindshare and marketshare is going to make that a much rougher road for society than if it's avoided.
Google senior management seems out of touch. It baffles me, since they got the money and the influx of talent. If you got those things, how you use it becomes the problem and that is all on management. Google might have become the old school corporate incapable of innovating or producing new modern business lines. having worked in those corporate environments, I can say that badly incentivized management can kill any giant of industry.
Its been going on for some time. Something that was once a joke in good jest AKA Google's graveyard, is now their actual reputation, and helping their strong big tech competitors when competing on new services.
yes "They've only ever done one thing well, from a business perspective - create a world class search algo, put it behind a minimalistic web page, and pay for it with ads. And they didn't even come up with that business model: they just did what the competition did, but made the idealistic-nerd version of it."
Hot take, but OpenAI's whisper was released earlier in the year and was quite impressive. They were definitely "first" - even if this model claims to outperform that one.
the WER is going to vary a lot depending on conditions. This model is better than Whisper and Whisper does much better than 10% on good quality audio with common accents.
More like universal speech memorization. Model implies they had some sort of insight, a simplification, an understanding of how natural language works. This is just bragging about the number of parameters they can pull off.
If it can memorize all possible waveforms across all those languages and how they convert into words in only 2B parameters then I'm incredibly impressed.
(Of course it isn't doing this, and this criticism is just wrong.)
> Model implies they had some sort of insight, a simplification, an understanding of how natural language works.
Well it shows that using the same model for multiple languages increases performance which was something that was not at all clear 10 or even 5 years ago.
> This is just bragging about the number of parameters they can pull off.
>If it can memorize all possible waveforms across all those languages and how they convert into words in only 2B parameters then I'm incredibly impressed.
Well then I don't think you've looked at the topic very deeply. Our voices don't make every possible wave form. The IPA alphabet has existed for a generation now, is universal for all spoken human languages, and has a basis in physiological origins. Of more than 160 IPA symbols, relatively few will be used to transcribe speech in any one language. It should not take 2B parameters to do this.
Look at the inner ear and you'll see a big hint. Before even getting to the brain the sound goes through a spiral shaped chamber. Quote wikipedia: "The hair cells in the Organ of Corti are tuned to certain sound frequencies by way of their location in the cochlea...". Clearly looking at time series wave form data in the first place is a mistake. Rolling window fourier transform all the speech data first, then train on it. I will bet 2-digit sums of money that simple preprocessing step would outperform what they've done.
Really, take any recording of people talking, open it up with audacity, and check out the spectrogram view. The sort of sounds we make with speech are a lot like an FM radio transmission. There's a baseband average pitch that we speak at, and then information encoded on top by deviating over/under the base band in smooth, simple ways. If that spectrogram was an OCR problem, it would already be solved.
>A humble brag about how small the model is maybe?
I see a B at the end of a number I assume its part of a pissing contest. But if they really are playing parameter golf bragging about how small they can get it, then I admit that's a step in the right direction.
I think I remember reading that languages can choose any point on the vowel diagram as a canonical phonetic expression of a vowel, even though a given language will only have a small finite inventory of vowels.
If this is right, one could interpret it as phonemes being digital (a native speaker can consistently transcribe them using a small number of IPA symbols), but (at least vowel) phones being analog (the typical frequency spectrum appearance typically representing a specific vowel could best be represented by floating-point numbers that are culturally chosen somewhat arbitrary by language evolution). Like /ɪ/ might be 0.82 closeness, 0.23 backness, 0.05 roundedness in some language, but 0.79 closeness, 0.26 backness, and 0.1 roundedness in another.
In turn, if that's right, I don't know how much variation there is among native speakers (from a particular region) in production of a given vowel (or how consistently and precisely children can learn these values). Presumably for recognition the brain rounds off somehow to the nearest vowel phoneme, kind of like picking the nearest point on the constellation diagram in digital communications demodulation?
>a specific vowel could best be represented by floating-point numbers that are culturally chosen somewhat arbitrary by language evolution). Like /ɪ/ might be 0.82 closeness, 0.23 backness, 0.05 roundedness in some language, but 0.79 closeness, 0.26 backness, and 0.1 roundedness in another.
that's not 3 floats worth of data, those are all ints over 100.
We generally think of people, but not languages, as having deep or squeaky voices. Furthermore we usually have no problem following when the speakergoesfaster or enunciates reaaaallllyy slooowwwlllyy or even ish they slur theirr wurds abit. I don't think a language specific absolute value of vowel tone is a real thing. Rather, what matters is that the direction and magnitude of the shifts in tone are all in consistent proportions relative to each other. When you look at the vocalizations of humans or even other mammals on the spectrogram there's very clearly just simple modulations happening, like moving linearly from a starting frequency to the ending frequency, or at most taking a x^2 curve to get there. When you think about it this makes sense as those sorts of patterns are trivial to implement on the hardware. Just squeeze or release the muscle on the vocal cord harder. You're right in principle any set of frequencies can be the vowels but every language is going to discretize down to a fixed number. So instead of having or needing language specific absolute "vowel frequencies", you can just look for k distinct levels in the individual speakers voice.
If you've gotten as far as calculating the spectrogram, everything I described can be easily implemented by applying curve fitting to that spectrum and then thresholding the coefficients
> You're right in principle any set of frequencies can be the vowels but every language is going to discretize down to a fixed number. So instead of having or needing language specific absolute "vowel frequencies", you can just look for k distinct levels in the individual speakers voice.
But if a native English speaker speaks Japanese with five appropriately-spaced English vowels, it will be directionally correct and comprehensible, but native Japanese speakers would notice that those aren't the normal vowels used by native speakers in Japanese.
So I guess I'm wondering about how much leeway there is in what sounds like a correct vowel to a native speaker, how much range there is in a given person's production, how much range there is in production among members of a community who interact with each other constantly, and how small a difference between vowels someone can detect as a listener.
Here is my rank conjecture. Part of the advantage to working in frequency space as I suggest is because the output is invariant under certain transformations (up to a multiplication by a phase factor e^(i omega)). In the 2d image case, the invariant is translations along either axis. In the 1d time series case, the invariant is translations along the time axis. That's just a fancy way to say you get the same frequencies out no matter what time you pressed record. So I conjecture that the sorts of changes to the sound wave form which don't get in the way of our comprehension are exactly the invariant of some transform similar to Fourier which happens as a low level processing step shortly after the ear. The exact extent to which a vowel tone differs from the native speaker average manifests as (and is encoded in) a phase shift on the output of the transform, but the output magnitudes stay invariant and thus recognizable.
I think the other thing that matters is that the relative proportions of vowel tones correspond to a word the listener expects to hear given the sentence/situation. A word pronounced so unconventionally it's been turned into a homophone for another word will still generally be understood, even if the listener notices the "wrong" vowel tone. Which is where the 2bn parameters comes in...
I fully agree that context the semantic content itself primes the listenerrs expectations of the next rabbit. But that sort of language model need not be GPT-level good. It just needs to identify which words/letters are unlikely to be next to each other and which ones they might be mistaken for. An unsophisticated Markov chain could do that. Is this really a 2B parameter problem, or do we do things this way because parameters are currently cheaper than expertise?
> I will bet 2-digit sums of money that simple preprocessing step would outperform what they've done.
Okay. I take this bet. If your simple preprocessing step outperforms whisper on, say, me reading a random wikipedia article, I'll happily send you any 2-digit sum of money (denoted in dollars presumably), or donate it to a charity of your choice.
In retrospect I should have realized the corner I was painting myself here. If I'm right, you just paid bargain basement prices for an algorithm that can replace an expensive 2B parameter ML process. But, if I had offered what the bet is actually worth, it would look like I was trying to price people out of ever taking me up on it. Well, shit.
I messed with a combo of Whisper and ChatGPT. I took a whisper transcript and asked chatgpt to fix mistranscriptions using the context of the transcript and based on potential phonetic issues. I asked it to replace transcribed words that don't make sense with "[unintelligable]", which improved the output even more.
Transcription error rate was almost nonexistent, even on the smallest whisper model.