Neat, https://github.com/openai/whisper - they have open-sourced it, even the model weights, so they are living up to their name in this instance.
The 4 examples are stunningly good (the examples have speakers with heavy accents, speaking in foreign language, speaking with dynamic background noise, etc.), this is far and away better than anything else I've seen. Will be super curious to see other folks trying it out and seeing if it's as robust as it seems, including when confronted with audio speech with natural tics and uhhh's and uhmm's and everything in-between.
I think it's fair to say that AI-transcription accuracy is now decidedly superior to the average human's, what the implications of this are I'm not sure.
It was already better. I edit a podcast and have > a decade of pro audio editing experience in the film industry, and I was already using a commercial AI transcription service to render the content to text and sometimes edit it as such (outputting edited audio).
Existing (and affordable) offerings are so good that they can cope with shitty recordings off a phone speaker and maintain ~97% accuracy over hour-long conversations. I'm sure it's been an absolute godsend for law enforcement other people who need to gather poor-quality audio at scale, though much less great for the targets of repressive authority.
Having this fully open is a big deal though - now that level of transcription ability can be wrapped as an audio plugin and just used wherever. Given the parallel advances in resynthesis and understanding idiomatic speech, in a year or two I probably won't need to cut out all those uuh like um y'know by hand ever again, and every recording can be given an noise reduction bath and come out sounding like it was recorded in a room full of soft furniture.
>~97% accuracy over hour-long conversations. I'm sure it's been an absolute godsend for law enforcement
97% accuracy means roughly three or four errors per minute of speech. That seems potentially extremely problematic for something like law enforcement use where decisions with significant impact on people's day and/or life might be made on the basis of "evidence".
No it isn't. That just means 2-3% of your content needs to be double-checked by a person at the audio level, saving huge amounts of time - equally true of human transcription, in which individual words are often [UNINTELLIGEBLE].
Would you want to review this fully before going into court, absolutely - because you'd want to play the recording to a jury for emotional impact. Can you rely on it when you want to quickly read through hours of conversation and make decisions about whether to invest further resources (which might just mean another hour of listening back to the original audio)? Also absolutely. Bear in mind that a lot of these errors have little to no semantic impact, being on the same level as typos or misspellings in a written communication.
Bear in mind too that if law enforcement (honest or not) is so interested in you that they're willing to record your conversations, your day is already ruined, you just don't know it yet. The change here is one of scale rather than quality.
Doesn't it mean 100% of your content needs to be double-checked? You can't easily identify which 2-3% of your content has errors. I'm aware that errors are more likely when the model is less confident of its predictions, but that shouldn't be enough.
(edit for clarification: errors are not always something like "[UNINTELLIGIBLE]", where the system knows it doesn't know; they can also be misrecognitions that the system believes in with high confidence.)
By the time you're prosecuting someone in court, yes of course you double, triple, quadruple check everything. That's why lawyers get paid the big bucks (for now...). But yes you can identify which content probably has errors and flag it as such.
Look, I have decades of experience dealing with human speech, and not just as an editor - I can trace the human voice from neural impulses in Broca's region through the physiology of vocal production, mechanical transduction into electrical signals, discrete fourier transforms of the resultant waveforms into spectral information and back again, the reproduction of altered signals from time-aligned speakers to create a sense of spatialization, how those are processed in the human ear, and how the cilia are connected by nerves back to your brain. I'm a good enough editor that I can recognize many short words by sight of a waveform, or make 10 edits in a row by sight and know it will sound good on playback.
So when I say that machine transcription is as good as human realtime transcription now, I say so with the clear expectation that those decades of craft are very close to being rendered obsolete. I absolutely expect to hand off the mechanical part of editing to a machine within 2 years or so. It's already at the stage where I edit some interviews as text, like in a word processor, and then export the edited document as audio and it's Good Enough - not for every speaker, but more than half the time.
NPR and a lot of commercial broadcasters cut their material this way already, because you can get the same result from 30 minutes of reading and text editing that would require 3 hours of pure audio editing with no transcription.
What tools do you use to do this? I once hacked together an editor like this maybe a decade ago -- edit speech as text from OCR -- and sorely need one now.
Alignment of video to text is a big problem for me too.
> So when I say that machine transcription is as good as human realtime transcription now...
Would you go as far as to assert machine transcription can be used as an objective benchmark of a speaker’s verbal legibility?
It is fraught with political and interpersonal dynamics to approach someone even privately one on one today and gently suggest their career would get a huge boost if they hired a voice coach to help improve their verbal communication delivery. So even when I don’t directly mention their accent, it becomes a very sensitive subject with many.
However, if audio professionals like you can point to a system and say the raw biomechanics and acoustic physics of the world dictate that this is as physically and psychometrically good as audio parsing of human speech gets regardless whether the system was biologically evolved or ML evolved, the conversation can be couched even more objectively.
I enable recording and voice transcription in every meeting I can (ostensibly for DE&I but really for my own selfish purposes), and already observe in myself I have to work hard to overcome a tendency to gloss over speakers who don’t transcribe well when I review meeting transcripts to jot down any key information I might have missed taking notes upon during the meeting.
Note that I’m perfectly aware that my foreign language verbal skills are nowhere near the English skills of those I have tried to help. If the lingua franca of the coding world switched to Urdu tomorrow, then I’d hire help to learn and polish my spoken Urdu, like I went to a speech coach when learning public speaking because I can always use help in the many skills I lack.
Presumably you can use the 97% that is correctly transcribed to rapidly filter out the relevant content. This is likely to be only a small portion of the total content. Then you check 100% of that.
> I'm aware that errors are more likely when the model is less confident of its predictions, but that shouldn't be enough.
Suppose 90% of the errors are in the 10% where the model is least confident. Then you can review just 10% of your content and take a 2% error rate down to 0.2% error rate.
You can also use multiple transcription engines and then use mismatches among the text streams to narrow down the % of content that needs to be reviewed. This is quite similar to multi-voting OCR for document images.
The principle is that the engines have different failure modes (hopefully) and therefore the 2-3% error rate of each engine is in different areas of the audio. The key underlying assumption is that the events are mutually exclusive.
With 3 engines, you can use something like 2-of-3 stream matches to override the stream that mismatches.
I had to do a lot of manual transcription in Journalism school. Using a tool like Descript saved HOURS of my life. Generally it was 80% accurate, but going over an two-hour-long recording again at 3x speed while reading over the transcript, fixing errors from memory or pausing took a five hour job down to 30-40 minutes. Either way, somebody is going to have to listen to the recording. This just removes a layer of grunt work.
Having done audio transcription in college as a side gig, it takes a lot longer than it sounds. Even at a decent 100wpm you'll take about 5 minutes to type out 1 minute of audio.
Not having to pause + rewind will save a ton of time for that 3%.
For real. The way people normally speak, with backtracking, repetition, restarting sentences, or stopping mid sentence and starting a new one with entirely different nouns or entire subjects is perfectly normal in synchronous conversation and isn't jarring, but written down as is, it's like 40% noise.
To be fair, you chose a video that displays an amalgamation of the biggest gaffes of 2021 for Biden.
“During his term as President of the United States, Donald Trump made tens of thousands of false or misleading claims. The Washington Post's fact-checker had tallied the number as 30,573 by January 2021, an average of about 21 per day by the end of his presidency.” [1][2][3][4]
I think it’s fair to say there would be a 100 hour long plus video / documentary if they were all compiled into one. lovely!
- [1] Fact Checker (January 20, 2021). "In four years, President Trump made 30,573 false or misleading claims". The Washington Post. Archived from the original on January 20, 2021.
- [2] Kessler, Glenn (January 23, 2021). "Trump made 30,573 false or misleading claims as president. Nearly half came in his final year". The Washington Post. Archived from the original on January 24, 2021. Retrieved January 24, 2021.
- [3] Elfrink, Tim (August 14, 2020). "'Do you regret at all, all the lying you've done?': A reporter's blunt question to Trump goes unanswered". The Washington Post. Retrieved August 14, 2020.
>equally true of human transcription, in which individual words are often [UNINTELLIGEBLE].
ML systems somewhat notoriously do not necessarily make the same sorts of errors that a human would. And I'd expect a large portion of the errors to be transcribing the wrong words rather that indicating that a word couldn't be transcribed. That sort of error means that you can't really get away with manually reviewing just 3% of the audio.
ML tending to make weird mistakes rather than subtle ones that make sense in context like human transcribers is likely to make them easier to spot.
And there are humans in the loop too, and an enormous amount of redundancy in the questions and answer, so even plausible false transcriptions will get picked up on if they matter. Nobody gets sent to jail simply because the transcription process - human or machine - accidentally substitutes "I did it" in place of "I didn't" midway through a two hour interview.
The thing is that 'Likely' is very far away from 'always'.
There is no guarantee the mistake will be easy to spot.
For entertainment purposes AI transcription is awesome.
For serious business applications the ability to recognize mistakes will continue to be a field to which serious attention is given. It would be interesting to see AI processes double check itself, and also run a logic check on whether the transcription makes sense. So that it can report sections flagged as incongruous or of dubious reliability.
+1. There is a widespread "metric fallacy" or "task fallacy" going around. Models of course optimize for metrics, so they tend to perform well on those related metrics.
Humans, however, are not simply metric optimizers. Though it's always in the interest of those corporations producing metric optimizers (i.e. models) to paint humans as such, so their models shine in comparison. They want humans to look like bad machines, so it looks like they should be automated. Not to say they shouldn't in many cases, just that there's a clear one-sidedness in all corporate PR (and funded research, especially that research which is also PR).
All this to say that yes I agree with you. And if we humans don't want our unsustainable economic growth to turn us even more into machines (as our bureaucratic creep has done quite well thus far), we should fight such rhetoric that aims to paint humans simply as machines or task-doers.
When doing validation, I find it will often be the same errors repeated again and again in a transcription. Like it will fail on someone or some thing's name (that is rare / unique) and map it onto a known similar sounding word.
Sometimes even human will disagree about what was said in a recording - I had this happen recently. I heard a specific sentence, the other person heard the exact opposite. I cannot say who was right, even after listening to the recording several times on headphones and speakers I'm as certain of my interpretation as was the other party.
It'd [UNINTELLIGIBLE score="92%" alternatives="pro-rabble; pourable"]probably[/UNINTELLIGIBLE] be useful to make a markup-based output... though you'd probably find it gave you more info than you wanted.
It already exists. The commercial product I use most is called sonix.ai and I think they have a free tier or trial period. It has shortcomings but it's shockingly good, despite having some limitations.
Yeah, I tried to use automated transcription for a research project and we had to do it all manually because the few errors (I would say it did pretty well given our recording quality) were often dropping words like "not", which changed the whole meaning of a sentence! It was a useful assistance during transcription, but I really hope they would verify it was correct before arresting anyone based on it.
Microsoft announced their voice transcription technology a couple years ago and were also touting ~97-98% accuracy which was actually better than human transcription error rates. The errors are usually in part people garbling their own speech, or they move their head while talking and the microphone misses a syllable. Anything in that error bar would probably fall under "reasonable doubt"
I've worked with similar technology in the law enforcement space and the software is never used to make decisions. You can make out critical timestamps in conversations and a law enforcement officer will always manually confirm the software's assessments.
In all honesty, this is the correct mindset to have. I have limited expertise in this topic, and you should be aware that other law enforcement agencies probably do not handle this the same way.
This is called "a fishing expedition" and is wildly unconstitutional in the US.
>The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized.
Besides I wasn't talking about the USA when I said this. I was remembering a conversation I once had with a person who worked as a technician in a telephone exchange.
Yes, it is wildly unconstitutional, but in practice don't the courts endorse the asinine "it's not a search unless we find something" argument from the NSA?
Power always just finds a way to rationalize what it wants to do.
Not really. Imagine that they do simple keyword matching on the text. Anything that's missed (part of the 97%) the criminals get away with. Anything that matches in the 3% is then checked by a human (by listening to the audio at that time stamp). So you only need to manually check the 3%, and even then only if something you're interested in is found.
One would think that the few crucial bits of information gleaned are listened to manually, and the machine translation is not the only thing the judge or a jury sees.
For technical content, I use Rev.com and provide a glossary and real humans do the transcript. Other AI transcription services get lots wrong because the context often matters. Words like "TCP/IP" or "FAT disk format" or "Big Endian" I've never found AI so far to handle well.
There's already software that can imitate a person's voice, so we have all the pieces already to do speech-to-text, clean up with GPT-3, and back to text-to-speech in the original person's voice. Maybe with a style transfer to keep the person's inflections etc the same?
Since you work on podcasts, do any open source transcription tools currently identity the speaker in the output? This would be particularly helpful for interviews.
Not sure about open source, but in general, automated transcription systems need a separate track for each different speaker. So for example, for a phone call with one person on each end, you need two separate channels (recording systems usually split them left/right on one stereo file).
I use a service called sonix.ai. It's paid but I think they have a free tier or trial period, and it's not very expensive. I'm excited about this new OpenAI thing because I'd rather do it on my own hardware than send it to the cloud, but this company has earned its commercial success.
That is an exciting possibility. Being able to fix bad setups and missed takes automagically. It’s always been possible, just expensive and time consuming for moderate improvements.
The French version is a little contrived. The speaker is a native speaker, but the text is obviously the result of a translation from English to French, not idiomatic French.
I will try to put the code to the test, see how it goes.
Interesting, I'm a non-native French speaker, the original French piece struck me as being entirely normal (but maybe it was just the perfect French accent that swayed me). Can you please point out what he said which wasn't idiomatic or naturally-worded French?
Little details. The second sentence is really bizarre:
> Nous établissons que l'utilisation de données d'un tel nombre et d'une telle diversité est la raison pour laquelle le système est à même de comprendre de nombreux accents...
It doesn't sound natural at all. An idiomatic formulation would be more along the lines of:
Le recours à un corpus [de données] si riche et varié est ce qui permet au système de comprendre de nombreux accents (With 'corpus', 'données' is implied.)
Of course this is just an example, and I'm sure other French speakers could come up with a different wording, but "données d'un tel nombre et d'une telle diversité" sounds really wrong.
This is also weird and convoluted:
> Nous distribuons en tant que logiciel libre le code source pour nos modèles et pour l'inférence, afin que ceux-ci puissent servir comme un point de départ pour construire des applications utiles
It should at least be "le code source DE nos modèles" and "servir DE point de départ", and "en tant que logiciel libre" should placed at the end of the proposition (after 'inférence').
Also, "construire" isn't used for code but for buildings, and "applications utiles" is unusual, because "utiles" (useful) is assumed. "...pour le développement de nouvelles applications" would sound more French.
That's interesting, as a québécois I don't agree with any of this. The only thing that raised an eyebrow was "est à même de", but if turns out it's just another way of saying "capable de", I guess it's simply not a common idiom around here. Aside from that, I found the wording flowed well even if I personally would've phrased it differently.
Gonna have to agree with the other reply, as a french-canadian, except for "servir comme un point de départ" which should be "servir de point de départ", that all sounds perfectly fine.
If this is actually "good" or even acceptable French Canadian, then it's a different language from French (and the blog post should mention it).
I kind of doubt it though -- the speaker doesn't have a Canadian accent (which is hard to miss), and in my (admittedly limited) experience, French Canadian isn't that different from French.
Older generations sometimes do. My grandma and her sisters nearly never uses "on".
It is often used for larger groups or when the group is not very personally connected. For instance when talking about your company doing something you will often use "nous". I would also use "nous" to refer to the whole list of invitees to a wedding. And in formal contextes like research papers, reports etc. You would never use "on", always "nous".
I'm interested in building something with this to aid my own French learning. Would love to read your findings if you end up posting it somewhere like twitter/blog!
Trois mille six cents fois par heure, la Seconde
Chuchote Souviens-toi !– Rapide, avec sa voix
D'insecte, Maintenant dit Je suis Autrefois,
Et j'ai pompé ta vie avec ma trompe immonde !
Remember ! Souviens-toi ! prodigue ! Esto memor !
(Mon gosier de métal parle toutes les langues )
Les minutes, mortel folâtre, sont des gangues
Qu'il ne faut pas lâcher sans en extraire l'or !
Transcription:
> Trois mille six cents fois par heure, la seconde chuchote « Souviens toi », rapide, avec sa voix d''insecte, maintenant dit « Je suis autrefois », et j''ai pompé ta vie avec ma trompe immonde. « Remember, souviens toi, prodigue, est au mémoire, mon gosier de métal, parle toutes les langues, les minutes, mortelles folâtres, sont des gangs qu''il ne faut pas lâcher sans en extraire l''or. »
Not bad! Far from perfect but it's a difficult text. Interesting that it works better with Baudelaire than Pascal.
Tried again with Blaise Pascal -- the famous fragment of a letter where he says he's sorry he didn't have enough time to make it shorter.
Original:
> Mes révérends pères, mes lettres n’avaient pas accoutumé de se suivre de si près, ni d’être si étendues. Le peu de temps que j’ai eu a été cause de l’un et de l’autre. Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte. La raison qui m’a obligé de me hâter vous est mieux connue qu’à moi. Vos réponses vous réussissaient mal. Vous avez bien fait de changer de méthode ; mais je ne sais si vous avez bien choisi, et si le monde ne dira pas que vous avez eu peur des bénédictins.
Transcription:
> Mes rêves errent pères, mais l'detre navais pas accoutumé de se suivre de si près ni d'detre si étendu. Le peu de temps que j'sais eu a été cause de l'de l'de l'de autre. J'sais n'detre plus longue que parce que j'sais pas eu le loisir de la faire plus courte. La raison qui m'sa obligée de me hâter vous est mieux connue qu'moi. Vos réponses vous réussissaient mal. Vous avez bien fait de changer de méthode, mais je ne sais pas si vous avez bien choisi et si le monde ne dira pas que vous avez eu peur des bénédictes.
Here there are many more mistakes, so many that the beginning of the text is unintelligible. The language from the 17th century is probably too different. Still on the "medium" model, as the large one crashes the Colab (not sure how to select a beefier machine.)
Depends on the way you're pronouncing it maybe. To be intelligible IMO it must be read differently from a modern text, with well sounding liaisons, and all vowels very distinct: "un" sounds differently from "in", "â" clearly differs from "a", "ai" and "è" from "é" and for instance the "e" in "étendues" must be pronounced, though not loudly.
My test gives that, much better than yours:
Mes *rêverants* pères, mes lettres n'avaient pas accoutumé de se suivre de si près ni d'être si étendues. Le peu de temps que j'ai eu a été cause de l'un et de l'autre. Je n'ai fait celle aussi plus longue que parce que je n'ai pas eu le loisir de *l'af*faire plus courte. La raison qui m'a obligé de me *ra*ter vous est mieux connue qu'à moi. Vos réponses vous réussiss*ez* mal. Vous avez bien fait de changer de méthode. Mais je ne sais si vous avez bien choisi et si le monde ne dira pas que vous avez eu peur des bénédict*eurs*.
Curious. As mentioned I did three tests, two which went pretty well and this one that went bad. I'm French and enunciated the three tests in the exact same way. It's possible there was a technical glitch in this one (that I erroneously attributed to the language of the 17th century)... Will have to try again.
I tried the beginning of L'étranger (because you seem to be a fan of Camus ;-)
Here's the original:
> Aujourd’hui, maman est morte. Ou peut-être hier, je ne sais pas. J’ai reçu un télégramme de l’asile : « Mère décédée. Enterrement demain. Sentiments distingués. » Cela ne veut rien dire. C’était peut-être hier.
> L’asile de vieillards est à Marengo, à quatre-vingts kilomètres d’Alger. Je prendrai l’autobus à deux heures et j’arriverai dans l’après-midi. Ainsi, je pourrai veiller et je rentrerai demain soir. J’ai demandé deux jours de congé à mon patron et il ne pouvait pas me les refuser avec une excuse pareille. Mais il n’avait pas l’air content. Je lui ai même dit : « Ce n’est pas de ma faute. » Il n’a pas répondu. J’ai pensé alors que je n’aurais pas dû lui dire cela. En somme, je n’avais pas à m’excuser. C’était plutôt à lui de me présenter ses condoléances.
Here's the transcription:
> Aujourdhui, maman est morte, peut être hier, je ne sais pas. J''ai reçu un télégramme de l''asile. Mère décédée, enterrement demain, sentiment distingué. Cela ne veut rien dire. C''était peut être hier.
> L''asile de Vieillard est à Maringot, à 80 km d''Alger. Je prendrai l''autobus à deux heures et j''arriverai dans l''après midi. Ainsi, je pourrai veiller et je rentrerai demain soir. J''ai demandé deux jours de congé à mon patron et il ne pouvait pas me les refuser avec une excuse pareille. Mais il n''avait pas l''air content. Je lui ai même dit, ce n''est pas de ma faute. Il n''a pas répondu. J''ai alors pensé que je n''aurais pas dû lui dire cela. En somme, je n''avais pas à m''excuser. C''était plutôt à lui de me présenter ses condoléances.
Except for the weird double quotes instead of the single apostrophe ('), it's close to perfect, and it only uses the "medium" model.
This is extremely exciting and fun! Happy to try other texts if you have something specific in mind!
More of this is welcome, they should live up their name and original purpose and share other models (code, weights, dataset) in the open source community as well.
It seems far from good with mixed language content, especially with English and Japanese together. The timestamps are far from perfect. It's far from perfect. It's nowhere close to human for the more ambiguous translations that depend on context of word. It's far below what anyone that spoke either language would consider acceptable. Maybe it's unfair to use music, but music is the most realistic test of whether it's superior to the average human.
This isn't exactly a hard story to fact check. There is 0 evidence for this in either the reddit thread or really anywhere? If they were willing to lie about the company name why not just lie about the beef in their burgers it would be equally scandalous
Something being possible to do isn't enough evidence for rational people to believe that it happened. From my perspective, it's possible that you're Iron Mike Tyson, or that you died after your last comment and this one was posted by the assassin who killed you.
What? I never said it's evidence that it did happen, please don't make things up. I just pointed out the evidence provided to refute the claim is possibly invalid.
Because I'm not trying to prove that it did or not, but rather make parallels between that and OpenAI's name. For I care it could be an urban legend, but who cares that's not the point.
You are right, it could be. The problem is that its the kind of thing that would be almost impossible to disprove if it were false. So you can always raise doubts about a supposed disproof.
But it'd be really easy to prove if it were true and noone has offered proof. And there've been plenty of people who've looked for such proof, afaict.
My default assumption in such cases is that it is likely false.
There are at least two companies that have branded [..] Kosher Gelatin™. One of them makes gelatin that is considered non-kosher by all of the major kashrus agencies.
"Kosher Gelatin®", when in the ingredients, just means the product contains pork.
For what it's worth, I've spent a few minutes googling and can't find any story that corroborates this. The only US trademark I can find around "kosher gelatin" is by the brand Kolatin, which is apparently certified Kosher.
In the US, for a while I remember we had billboards advertising McDonald's burgers as being "1 <hamburger> <hamburger>% beef". Because the hamburgers were of course circular, it looked kind of like "100%".
I remember thinking that surely an image of a hamburger does not legally constitute a zero.
Yes. The same is true of many products from many companies.
I feel bad about GPT-3 and DALL-E being released under the terms they were, but I don't feel bad about this. I'm not going to condemn OpenAI for the good things they did, but I will hold them accountable for bad things or good ones they didn't do.
I'd given up on OpenAI being open or ethical, but this is a start. It took them down from "evil super-villain" status to mere villain.
> It's one model and in a non-strategic area where there are existing open source projects (Kaldi, DeepSpeech, ...).
I can already tell this is much better than any of the existing open source projects with the exception of the wav2* sequence of projects and potentially nvidia's nemo.
Kaldi is an open, pluggable framework and is a ton more flexible and powerful than this. It's used by hundreds of teams, including a number of consumer tech companies you've heard of. They're not going to move to this over it.
Especially because ASR is a living organism. You have to constantly update your language model as new people, ideas, and words move into the normal lexicon. As people start talking about "COVID", "metaverse", "king charles", or whatever new things that happen, these need to be added to your language model. You need these updates monthly at a minimum and OpenAI didn't release the raw data which means you can't retrain it even if you wanted to spend the time/resources to.
So, this is an interesting research project and helpful for small teams and side projects, but it's unlikely it makes any real impact on the industry.
Kaldi just is not fast or high quality enough compared to other modern alternatives like wav2letter. I appreciate that it is more flexible than this, it certainly is - but I am not so sure about "powerful."
True. The potential of GPT-3 to cause internet mayhem was/is significant. I would argue that the mere act of announcing it was still a catalyst for an eventual GPT-3-like model being released. In revealing it, they established a target for what open source models could aim to achieve, and simultaneously got bad actors thinking about ways to abuse it.
It was a credible argument when GPT-3 was released. But now there are open models that are as capable as GPT-3 and that mayhem has not materialized, with the possible exception of GPT-4chan. They could release it now under a non-commercial license, if they cared to.
My experience with GPT-3 is that while it does perform better than those mini-GPT small models, the gap does not compensate for the fact that the small models are free/unrestricted and you can use them as much as you like.
As mentioned elsewhere in the thread there are some large models around the 50-200B band that compete directly with GPT-3, but I haven’t used these.
Two reasons. First, someone else will release something similar. Second, I didn’t see a related push from them to work with other in the industry to do something productive towards safety with the time they got by delaying availability of these kinds of models. So it felt disingenuous.
Several groups already have. Facebook's OPT-175B is available to basically anyone with a .edu address (models up to 66B are freely available) and Bloom-176B is 100% open:
I don’t see how GPT-3 is any more dangerous than Stable Diffusion, Photoshop, that fake news website the crazy person you’re friends with on Facebook really likes, or any of the number of other tools and services that can be used to generate or spread fake information.
I wouldn't really say Stable Diffusion marks images as AI-generated. There's a script in the Stable Diffusion repository that will do that, but it's not connected to the model itself in a meaningful way. I use Stable Diffusion a lot and I've never touched this script.
Trivial to remove, I give you that. But AFAIK, the original repository + most forks put the watermark automatically unless you've removed it on your own.
>Trivial to remove, I give you that. But AFAIK, the original repository + most forks put the watermark automatically unless you've removed it on your own.
almost all of the 'low-vram' variant forks either have an argument to turn off the watermark (it saves a bit of memory) or come with it disabled all together.
It would be pretty trivial to have an invisible watermark in GPT3 output-- though you don't really need one: just score text with gpt3 to find out if it was likely gpt3 generated or not.
This is an astonishing package. Every AI voice-to-text model I've tried on "The Wire's" famous "fuck" scene [0] usually fails, because the youtube clip's audio quality is bad and it's a scene with virtually no dialogue except breathing and "Fuck". But Whisper returned impressive results [1]
Hey this looks great! I like to record audio notes while driving in my car after work, to kind of decompress my thoughts from the day. But I never go back and listen as they can be long and meandering. Sometimes in the audio log I will sum up my thoughts, but this might be 20 minutes in and hard to find. I really wish I had transcriptions so I could easily scan the full contents. I have tried Mozilla Deepspeech (I don't want a cloud solution) and I was surprised to find that I could not get Deepspeech to reliably transcribe them. There is a bit of road noise, though I think for a human listener they are easy to understand. It looks like this one might actually do the trick!
EDIT: Tried it and it worked great! It is very easy to use. I just did the pip install line in the readme and was ready to go. You literally just run the one pip install line, and then you run the program in the format "whisper my_audio.wav" and it goes. Really nice job OpenAI!
Is that application actually doing on-device transcription? Under "Data safety" on the Google Play page it says "This app may share these data types with third parties: Audio" which doesn't exactly instill confidence that my audio will 100% always stay on my device. It also says "Data is encrypted in transit" but if data stays on the device, why it has to be "encrypted in transit"? There should be no transit at all.
Yes, it works completely offline, including transcription and recognition of music. There's an optional cloud sync feature, which I assume is the reason for the notice on Google Play.
I just tested it and it was pretty mediocre at least with my accent. I can definitely benefit from a decent app for quick note recording with a button press->transcribe->upload to gdrive/good UI app for later grepping.
I'll probably explore using this, but I've used an app called Just Press Record to do what you say. Runs on Apple Watch too, so you can tap a complication at any time in the day, speak, and you get a transcript on your phone, etc.
I do this too! I have been doing it for about a year now, and haven't ever run into someone else that does this kind of audio-journaling. Would you be up for comparing notes sometime about how it is working out for you? I am finding that it is extremely effective form of self-care, but with lots of personal caveats. I would be so interested to hear your experience.
Oh cool! Yeah I have stopped doing it lately as I was not really using them (I would like to use them for making rough notes for future youtube video scripts), though in general it does seem like good self care too even if I don't review them. That said I just tried the base model on one of my voice logs and it was pretty good! Trying the medium model now and it seems basically perfect. So I will have to start doing these logs more!
Anyway I am pretty terrible with email but short exchanges can work for me, or maybe we can connect over signal. Send me a message to my email in my profile and I would be happy to sync up!
I suspect Whisper is more robust than other "SOTA" models, but this release is likely leaving a fair bit of accuracy on the table considering the amount of resources OpenAI is capable of throwing at training it.
Comparing the readily available test sets from the paper to some of my personal robust models (for the Talon models, this is greedy decoding, no language model):
Talon Talon Talon Whisper wav2vec 2.0
28M 300M 1B Large 960h
librispeech clean 3.21 2.52 2.40 2.7 2.7
librispeech other 8.21 6.56 5.63 5.6 6.2
common voice 13.88 11.65 8.86 9.5 29.9
tedlium 7.51 6.55 5.47 4.0 10.5
I have a battery of more difficult tests on hand (including adversarial tests, and diverse accent-specific metrics). I'll look at running these tests on each of the Whisper model sizes and following up with a larger comparison.
One of the things they point out is that the SoTA on e.g. LibriSpeech is only good at LibriSpeech, and doesn't generalise as well.
> Because Whisper was trained on a large and diverse dataset and was not fine-tuned to any specific one, it does not beat models that specialize in LibriSpeech performance, a famously competitive benchmark in speech recognition. However, when we measure Whisper’s zero-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models.
My own experience agrees: the generally available "SOTA" models are not especially robust, and can be _extremely_ bad (>50% absolute error rate) at some tasks. I'll post some preliminary numbers in a sibling comment and look into running my full set of tests on Whisper.
It looks like Whisper is probably leaving a lot of accuracy on the table, but initially it does seem to be a lot more robust than general "SOTA" models.
For a quick comparison, Silero's accuracy charts are kind of nice because they post results for a large variety of datasets. Scroll down to the EN V6 xlarge EE model (not the xlarge CE) [1]
Just tested this on some developer podcasts which usually fail hard given they're full of technical jargon, brand names, etc. Whisper is a revolution! It's picking up terms like Heroku, DigitalOcean, GitHub, ECS, AWS, etc. and capitalizing properly - something nothing else did unless you provided a whole pile of guiding vocabulary.
Did these podcasts have transcripts? You might be inadvertently evaluating it on data that it was trained on, which is basically cheating. Even if not, it might be trained on similar podcasts. Judging how good these kinds of models are is really hard.
Hold on, it does not only speech recognition, but also language translation, in the same model?
What an interesting approach. What benefits does this have over having two dedicated models, one for speech-to-text, and another for translation?
It just seems so odd, given the problems of speech-to-text and Spanish-to-English seems so different from one another (in terms of the problem domain). Seems so unusual to have both handled by one model!
Does knowledge of speech-to-text carry over into knowledge of translation? Does knowledge of translation carry over into knowledge of speech-to-text? So weird.
It seems these days that language-oriented models are commonly becoming multilingual by default. There are a lot of common threads when understanding sentence construction between different languages. French and English have different rules but they will still have things like nouns, adjectives, subjects, prepositions, etc. It seems that by training models on many languages you get both a more robust understanding of language, and it saves you the trouble of having to make many more localized models for every language. I also believe that the other languages help the models construct sentences in languages which have very small training sets. If it has a few examples in a rare language as well as good translations to a better-known language, then it can provide good support for the rare language.
We also see in image generation models that multi-modal networks are more powerful than single purpose networks. As we move towards more advanced AI systems I suspect we will see more and more generalizable networks with distinct advantages over separate networks that get plugged together.
My understanding is that multi-modal models are the primary focus of OpenAI right now, due to their stated goal of achieving AGI. This product is probably better thought of as an offshoot of their work to create a fully generalizable model, rather than a specific attempt to provide translation/transcription services.
Judging from the chart in their github README, Whisper performs much better in parsing Spanish audio than any other language and that in particular blows my mind. I would have expected English to be at the top of any such model, it being such an IT lingua franca.
Now I wonder if it works equally well with Spanish from Spain (and its different regions) and Spanish from the New World (and in its myriads of different flavours).
It sounds useful to me because you can use tone information to help with the translation, which text-to-text translation can't do. But I'm not sure if that's how this model actually works.
A bit more context on how it works:
The systems default audio input is captured with python, split into small chunks and is then fed to OpenAI's original transcription function. It tries (currently rather poorly) to detect word breaks and doesn't split the audio buffer in those cases. With how the model is designed, it doesn't make the most sense to do this, but i found it would be worth trying. It works acceptably well.
It's called hallucination. As the model is trained on unsupervised data, such errors do seldom happen. The model picks up that such phrases occur in translations and inserts them even if they do not appear in the source. This is described in the paper.
I came across it during a silent/instrumental portion in the song I was testing. I asked only because I am curious how frequently the error might show up, I don't expect it to be very common. It's looking at phrase level instead of word level timestamps which is going to make it hard to tokenize music. I asked simply because the parent comment also tested on Japanese.
This really makes me want to build a Amazon Echo/Google Nest/etc replacement that's open hardware, open source and most importantly recognises voice completely offline. I find that I don't use these smart devices for much more than setting timers anyway so this seems like an easy project.
I just wonder what system requirements Whisper has and whether there are open source voice recognition models that are specifically built for embedded devices.
I really want all this too. The smallest model is ~80mb and the largest is 3gb. Not sure about system requirements yet; but models that small suggest this may be doable locally on a single board computer.
Edit: According to this comment[0] the base model runs in real time on an M1 CPU. The tiny model apparently decodes an audio file twice as fast. These are promising results.
For an offline (non-streaming) model, 1x realtime is actually kind of bad, because you need to wait for the audio to be available before you can start processing it. So if you wait 10 seconds for someone to finish speaking, you won't have the result until 10 seconds after that.
You could use really small chunk sizes and process them in a streaming fashion, but that would impact accuracy, as you're significantly limiting available context.
This is only one side of the coin, you still need really good models for Speech Synthesis and then be able to have it all working in almost real time, ideally locally on device.
hah, of course someone had the idea already and executed on it. But yeah, basically that but without the screen (probably would go a long way to decrease the cost, $299 is pretty steep for such a device)
One thing they don't touch much on is the STT, as they use models from third parties. You could definitely do something that utilizes this model and then feeds the tokens to some of their parsing code. I've been working on something similar to this, but burned out around adding the STT portion [0].
[0]: https://github.com/Sheepybloke2-0/trashbot - It was called trashbot because the final implementation was going to look like oscar the grouch in a trashcan displaying the reminders.
[00:00.000 --> 00:06.500] Since the last one started, the number of times I've eaten has decreased.
[00:06.500 --> 00:11.000] If I get too carried away with the last one, I'll get hungry and do it.
[00:11.000 --> 00:14.500] I don't have time to eat.
[00:15.500 --> 00:18.000] I'm going to eat now.
[00:20.000 --> 00:23.000] It's going to take about 10 minutes from here.
[00:23.000 --> 00:31.000] It's been a while since I've had my last meal.
[00:31.000 --> 00:36.000] I feel like I'm losing my女子力.
[00:36.000 --> 00:39.000] I have to go back to my original self.
[00:39.000 --> 00:44.000] I have to get ready and go to bed.
[00:44.000 --> 00:46.000] It's not good.
[00:46.000 --> 00:51.000] I've been drinking a lot lately, so I'm going home.
[00:51.000 --> 00:53.000] I have to get my nails done this fall.
[00:53.000 --> 00:54.000] Halloween nails.
[00:54.000 --> 00:57.000] Halloween, Halloween, Halloween.
[00:57.000 --> 00:59.000] I'm going to the beauty salon today.
[00:59.000 --> 01:02.000] I'm going to get my nails done the day after tomorrow.
[01:02.000 --> 01:10.000] I used to look at a lot of clothes, but I stopped looking at them.
[01:10.000 --> 01:12.000] I'm going crazy.
[01:12.000 --> 01:22.000] My stomach's stopped in the middle of summer.
It's struggling with Norwegian. Which I guess isn't shocking. The large model performs a fair bit better than the small, though neither is "good".
Though I assume the amount of Norwegian it has been exposed to is fairly limited, so in that light I'm actually impressed as well.
I tried it on a news segment from the radio[1], this is the large model output:
[00:14.000 --> 00:17.200] En skamløs krenking av FN pakten.
[00:17.200 --> 00:24.000] USAs president og verdensledere svarer på den russiske presidentens atomtrusler og krigsmobilisering.
[00:25.500 --> 00:29.400] Arbeidsklær som er ment til å være til begge kjønn, har det med å være tilpasset.
[00:29.400 --> 00:33.400] Men hvordan ville det gått, om det var motsatt?
[00:34.100 --> 00:38.900] Dyrevernsorganisasjon vil ha digital merking av regnstyr,
[00:38.900 --> 00:44.900] men næringen selv insisterer på den gamle tradisjonsrike måten med rissing av kniv.
[00:45.600 --> 00:51.400] Mange strømselskaper er positive til å tilby kundene fastpris på strøm, og det årevis.
[00:51.400 --> 00:59.900] Da risikerer de å måtte betale mye i nettopp åretsvis, sier aktører som aldri tilbyr fastpris.
[00:59.900 --> 01:21.900] Dette er onsdagens Dagsnytten. Jeg heter Espen Ås.
For reference, here's what he actually said, from the source[1] itself:
* En skamløs krenking av FN-pakten. USAs president og verdensledere svarer på den russiske presidentens atomtrusler og krigsmobilisering.
* Arbeidsklær som er ment å være til begge kjønn, er som regel tilpasset ... menn. Hvordan hadde det gått om det var motsatt?
* Dyrevernsoganisasjon vil ha digital merking av reinsdyr, men næringen selv insisterer på den gamle tradisjonsrike måten med rissing av kniv.
* Mange strømselskaper er positive til å tilby kundene fastpris på strøm - og det i årevis.
- Da risikerer de å måtte betale mye i nettopp; årevis, sier aktør som aldri tilbyr fastpris
Dette er onsdagens Dagsnytt 18 - jeg heter Espen Aas.
The translation didn't fare that well though:
[00:14.000 --> 00:17.000] A shameless violation of the UN treaty.
[00:17.000 --> 00:24.000] The US president and world leaders respond to the Russian president's nuclear threats and war mobilization.
[00:24.000 --> 00:33.000] Work clothes that are meant to be for both genders have to be suitable, but how would it be if it was the other way around?
[00:34.000 --> 00:44.000] The animal welfare organization will have a digital marking of reindeer, but the industry itself insists on the old traditional way of tearing a knife.
[00:45.000 --> 00:51.000] Many electricity companies are positive in offering customers fixed electricity prices, and that is annual.
[00:51.000 --> 00:58.000] Then they risk having to pay a lot in just a year, says an actor who has never offered fixed prices.
[00:58.000 --> 01:20.000] This is Wednesday's Dagsnytt 18. My name is Espen Ås.
For reference, here's Google Translate's attempt, which is pretty good:
* A shameless violation of the UN Charter. The US president and world leaders respond to the Russian president's nuclear threats and war mobilization.
* Work clothes intended for both sexes are usually adapted to ... men. How would it have gone if it had been the other way around?
* Animal welfare organizations want digital marking of reindeer, but the industry itself insists on the old, traditional way of marking with a knife.
* Many electricity companies are positive about offering customers a fixed price for electricity - and for years.
- Then they risk having to pay a lot in precisely; for years, says a player who never offers a fixed price
This is Wednesday's Dagsnytt 18 - my name is Espen Aas.
Re-reading the transcription, I guess I was a bit harsh by saying it's not "good". It gets most of it right, but it keeps messing up some key words. Like "regnstyr" (not a word) rather than "reinsdyr" (reindeer), or "Dagsnytten" rather than "Dagsnytt 18".
It also didn't handle the hanging "... menn", instead thinking it was the start of the following sentence. Almost everyone would understand it was the end of the sentence based on the context.
The double-A vs Å is not an issue as it's the same letter, double-A is the older form.
The small model was considerably worse than the large one though.
I am impressed; some of the words are not that common, such as atomtrusler, krigsmobilisering, strømselskaper and dyrevernsorganisasjon, yet it got them correctly
Everything (and everyone, including myself :D ) seem to struggle with Norwegian, it seems the corpus size is simply too small. And/or maybe the market.
Deepl didn't do any Norwegian last I looked, even though it does most other Germanic languages (including Danish and Swedish).
Duolingo doesn't have a Norwegian class for Germans either, though they do have one with English as the source language.
How are you getting the transcription of the NRK episode? I am learning Norwegian and often struggle to find reliable transcriptions for audio where the text exactly matches the audio (often subtitles are heavily edited compared to what's actually being said)
The stuff I quoted was listed as an abstract of sorts for the episode. I know NRK is very good at providing subtitles for their TV productions, but as you say they're abbreviated.
I'm guessing maybe audio books along with the actual books would be the best source for such? I mean there's Mozilla Voice, but it's quite limited in the Norwegian department and perhaps not quite as interesting as an audio book would be.
We shouldn't call this open source. The model definition + the data is the source code. The model weights are a compilation artifact.
> The source code must be the preferred form in which a programmer would modify the program. [...] Intermediate forms such as the output of a preprocessor or translator are not allowed.
If I asked a programmer from OpenAI to modify the model to better support Japanese speakers from Hokkaido, their "preferred form" of the model's source code would include the 680,000 hours of audio used to train the model.
Yes that means that there are almost no open source models and yes it's awesome that they released this and made the weights available. Just don't call it open source.
BTW, wouldn't you take the existing model and do additional Hokkaido Japanese speaker training on top of it, rather than retraining the model from scratch?
Yes. It just like calling the release of compiled closed binary blobs as 'open source' even when the source of reproducing the compiled output is unavailable.
> If I asked a programmer from OpenAI to modify the model to better support Japanese speakers from Hokkaido, their "preferred form" of the model's source code would include the 680,000 hours of audio used to train the model.
Precisely. These 'users' lifting the model can't do it themselves. You will still be contacting OpenAI for support or to add support for another language and they will be the ones able to modify the model.
> Just don't call it open source.
That is true, it is still closed source and already we are seeing the hype squad already apologising to OpenAI as they 'open sourced' a closed model that you can't modify yourself.
OpenAI is still business as usual and nothing has changed.
You can do a lot with weights and no training data - for example you can pull the end layer off it and use it as a feature extractor.
And to modify it for Japanese speakers you'd fine train the existing model on additional data. If you wanted to modify the model you can (sometimes, depending on what you want to do) modify an existing architecture by removing layers, adding replacements and fine tuning.
I don't quite know what the right analogy of trained data is. In many ways it is more valuable than the training data because the compute needed to generate it is significant. In other ways it is nice to be able to inspect the data.
> The source code must be the preferred form in which a programmer would modify the program.
As a machine learning programmer I'd much prefer the weights than the raw data. It's no realistic for me to use that training data in any way with any compute I have access to.
Like every model I've seen there is something like this:
>>A decoder is trained to predict the corresponding text...
Prediction of expected text in the context of the previous text.
While this is valuable in casual transcription, it can be extremely dangerous in serious contexts.
From personal experience, having given a deposition with an "AI" transcription, it will literally reverse the meanings of sentences.
This is because it produces the EXPECTED output in a context, and NOT THE ACTUAL OUTPUT.
Like a speaker that clips the output, these types of systems 'clip' the really valuable information out of a transcription. Worse yet, this is a completely silent failure, as the transcript LOOKS really good.
Basic info theory shows that there is more information contained in 'surprising' chunks of data than in expected ones. These systems actively work to substitute 'expected' speech to overwrite 'surprising' speech.
The transcript I got was utter trash, multiple pages of errata I had to submit when the normal is a couple of lines. And as I said, some literally reversed the meaning in a consequential way, and yet completely silently.
This kind of silent active failure mode is terrifying. Unless it is solved, and I see no way to solve it without removing ALL predictive algos from the system, these types of systems must not be used in any situation of serious consequence, at least not without real redundancy and backup.
I've been saying this for years. Current "AI" algorithm are fundamentally flawed because they rely on a statistical approach. This works moderately well for some use cases but it will rarely give you 100% confidence.
Good luck with self-flying planes or self-running nuclear power plants.
>>Current "AI" algorithms are fundamentally flawed because they rely on a statistical approach.
YES! The old joke about "Artificial Stupidity" is actually more true than anyone realized.
These statistical so-called-AI systems actually work to actively REMOVE or sanitize out any unexpected information, making it all conform with the EXPECTED results from the training set.
This not only REMOVES the most high-information 'surprising' or unexpected nuggets, it actively HIDES them. When something unexpected comes up, it gets force fit into the expected prediction algorithms and output as if it were good.
I'm not saying that there are no useful things that can be done with this technology — there is a LOT of mundane work out there to be done.
But, we will never get this type of "AI" saying "Huh, that's odd, I wonder why that is?", which is exactly the kind of observation that leads a prepared and fertile mind to great discoveries.
One item I remember was that I said "Dr Kemeny" in relation to Dartmouth College (he was a famous mathematician, invented the BASIC programming language and was president of the college). It replaced those instances with "Jack Kennedy".
In another instance, I said that "Evidently, you have a reading comprehension problem.". It replaced it with "Evidently, I have a ...", completely reversing the meaning.
There was zero problems with the microphones or audio, and it was not rushed or mumbled talk. There were 80+ other examples over a few hours of talking, and some from other speakers. And those were just the obvious ones I could catch.
Another massive problem with this technology is that a human stenographer can notice when s/he missed something and didn't hear and ask the speaker to repeat or clarify what was said, and will often during a pause request clarification on spelling of names, addresses, etc. In contrast, this "AI" technology just barges ahead ASSuming that it knows what it is doing and inserts literally whatever sounds good in the transcript, completely silent that it doesn't have a clue.
Having seen this up close, I'm of the strong opinion that anyone foisting this software on the market without huge warnings that this is not usable for any critical functions is, basically a fraud. They know or certainly should know that these failures not only exist but are common and systemic, yet they barge along like it is OK. It is not.
Can this be used as a real-time transcription or is it too slow for that?
Curious what anyone is using these days for a real-time transcription. It doesn't have to be perfect, but just good enough.
My kids watch some youtube vidoes where people will make a mod where it converts them talking to text then look for keywords and spawn a boss in Terraria if you say the wrong keyword etc.
I made a clone of that with the .NET System.Speech.Recognition library. It... works.. but my biggest problem is that #1 it waits until you are done speaking to translate to text on the callback, so there was too much of a delay for it to be fun.. the point is that it will be checking a stream of chatter. #2 is the recognition is pretty crap, I mean it's nearly good enough for my silly purpose but it's still pretty bad.
It might require too much work for what you are looking for, but the wav2letter library is the best real-time transcription OSS I have found by a considerable margin.
If your family uses Apple devices, Apple offers free on-device speech recognition. Only caveat is that it needs to be restarted every minute due to whatever stupid limitation (or bug) they've introduced.
The base model seems to run faster than real time on my machine. The “medium” model is larger and runs more slowly - roughly real time or maybe slightly slower.
That example at the top of the page (speed talking) blew me away. He started talking, I was stunned for a minute, then realised yes, it really was English, and I just burst out laughing.
That's so, so far beyond the previous state-of-the-art, it's absurd.
I did! There are a few places it transcribes incorrectly, but overall I'm very impressed. Here's the first ~30 seconds:
[00:00.000 --> 00:09.000] Look, I was going to go easy on you, not to hurt your feelings, but I'm only going to get this one chance.
[00:09.000 --> 00:11.000] Something's wrong, I can feel it.
[00:11.000 --> 00:17.000] It's just a feeling I've got, like something's about to happen, but I don't know what.
[00:17.000 --> 00:21.000] If that means what I think it means, we're in trouble, big trouble.
[00:21.000 --> 00:24.000] Had to be as bananas as you say, I'm not taking any chances.
[00:24.000 --> 00:26.000] You're just one to die for.
[00:26.000 --> 00:32.000] I'm beginning to feel like a rap god, rap god. All my people from the front to the back nod, back nod.
It was doing it slowly, but hadn't got to the insane bit when I killed it to try and get it working with CUDA, so I had to do some digging and it turns out I need a version of pytorch with CUDA enabled, and so I had to go and install Anaconda, and now now conda is stuck trying to "solve" my environment to install pytorch with CUDA.
So...probably?
Pre-post edit: I can't get it to work.
I've installed pytorch with cuda via pip3, installed the nVidia toolkit and it doesn't see it:
I've wasted like an hour and a half on it now. I'm not a python dev, and don't have any ML experience so this was just for fun and now it's not anymore.
Welcome to every single Python ML project - dependency hell will quickly kill any enthusiasm one may have for trying out projects. It really feels archaic to have these issues with such cutting edge technology.
CUDA is not the problem, the problem is crappy code being released on Github where basic things like requirements.txt are missing, never mind an earnest attempt to provide details about the environment that the code was running on. This is on top of code that has lots of hard-coded references to files and directories, plus also many python libraries just breaking compatibility with each other on point releases.
I can't find a source now, but I remember reading some code where the maintainer had to change a huge chunk of code because the point change for a dependency library literally flipped either how the library handled height/width or BGR channels (I can't remember which one but it was preposterous) from the 2.5.4 to the 2.5.5 version. There is no reason for doing that - it breaks everything just for grins and giggles.
Python itself is also a problem, but that's a rant for another day. Ah, how I wish Ruby had become the defacto language of choice for ML/Deep Learning!
How is it Apple, Google, or Microsoft are not further ahead of the game on speech recognition like this? They have the resources to hire the best ML researchers and throw tons of computing hours at it, yet Siri, Google, and Cortana continue to struggle to get anywhere near this level of comprehension.
Siri and Cortana have to run at least in real time, with reasonable compute resources. Probably faster than real time when the audio gets shipped off to the cloud and transcribed there. This model can't do that (in the "large" version, which the examples use).
Also, you are comparing Whisper's highlight reel with everyday performance of other models. Nobody shows their weaknesses in their highlight reel.
Someone else in this thread[0] said Whisper was running at 17x real time for them. So, even a weak machine might be able to do an acceptable approximation of real time with Whisper.
Also, I feel like shipping to the cloud and back has been shown to be just as fast as on device transcription in a lot of scenarios. Doing it on device is primarily a benefit for privacy and offline, not necessarily latency. (Although, increasingly powerful smartphone hardware is starting to give the latency edge to local processing.)
Siri's dictation has had such terrible accuracy for me (an American English speaker without a particularly strong regional accent) and everyone else I know for so many years that it is just a joke in my family. Google and Microsoft have much higher accuracy in their models. The bar is so low for Siri that I automatically wonder how much Whisper is beating Siri in accuracy... because I assume it has to be better than that.
I really wish there was an easy demo for Whisper that I could try out.
“CPU” isn’t necessarily the benchmark, though. Most smartphones going back years have ML inference accelerators built in, and both Intel and AMD are starting to build in instructions to accelerate inference. Apple’s M1 and M2 have the same inference accelerator hardware as their phones and tablets. The question is whether this model is a good fit for those inference accelerators, and how well it works there, or how well it works running on the integrated GPUs these devices all have.
Brute forcing the model with just traditional CPU instructions is fine, but… obviously going to be pretty slow.
I have no experience on the accuracy of Talon, but I’ve heard that most open source models are basically overfit to the test datasets… so their posted accuracy is often misleading. If Whisper is substantially better in the real world, that’s the important thing, but I have no idea if that’s the case.
Ok, my test harness is ready. My A40 box will be busy until later tonight, but on an NVIDIA A2 [1], this is the batchsize=1 throughput I'm seeing. Common Voice, default Whisper settings, card is staying at 97-100% utilization:
tiny.en: ~18 sec/sec
base.en: ~14 sec/sec
small.en: ~6 sec sec/sec
medium.en: ~2.2 sec/sec
large: ~1.0 sec/sec (fairly wide variance when ramping up as this is slow to process individual clips)
Isn’t the A2 much weaker than a 3090? So those results are promising.
EDIT: for what it's worth, Nvidia rated the A2 at 18 TFLOPS of FP16, and Apple rates the current A16 Neural Engine at 17 TFLOPS of FP16. I'm sure it's not an "apples to apples" comparison.
If you count the GPU component and memory bandwidth, the Apple M2 is slightly weaker on paper for 16-bit inference than the NVIDIA A2, if you manage to use the whole chip efficiently. The A16 is then slightly weaker than the M2.
Sure, the Whisper Tiny model is probably going to be fast enough, but from my preliminary results I'm not sure it will be any better than other models that are much much faster at this power class.
Whisper Large looks pretty cool, but it seems much harder to run in any meaningful realtime fashion. It's likely pretty useful for batch transcription though.
Even if you hit a realtime factor of 1x, the model can leverage up to 30 seconds of future audio context. So at 1x, if you speak for 10 seconds, you'll potentially need to wait another 10 seconds to use the result. This kind of latency is generally unsatisfying.
EDIT: After writing and posting the original version of this comment, I did an experiment where I dictated it to Siri, and then saved that audio (which was recorded simultaneously), which I then fed to both Whisper's tiny.en and medium.en... Siri did terrible for me. Whisper tiny.en was 100% accurate, as far as I can tell, and the only thing Whisper medium.en did was add a few commas that tiny.en had missed. I actually ended up playing the audio file for Siri as well, and that did not end well either. YMMV, but even the tiny model seems very useful. tiny.en took 17.5 seconds to process the ~1 minute audio file, and medium.en took 351 seconds, but I think there is a lot of room for performance optimization on this M2 MBA. The model evaluation was purely using the CPU, not GPU or neural engine, and it wasn't even using all of the CPU cores for whatever reason.
----
With Siri dictation, I feel like I usually spend at least as much time correcting its mistakes as I do speaking the dictation itself. In some cases, that is still faster/easier than typing, but I would rather have a voice model that can work in about the same total amount of time without requiring constant corrections. If I speak for 30 seconds, then I can do other things for 30 seconds while my phone processes it… that might actually be preferable if it gets it right. Otherwise, I’ll be spending 30 seconds actively editing it anyways. Even an improvement on the number of edits required per dictation would be nice. Admittedly, I feel like Google and Microsoft already do a much better job here.
It could be interesting to use the tiny model to give a preview of the writing while the large model is taking its time, and then allow the user to tap on words that changed to see the predictions from the tiny model and correct back to them if they want. I was doing some experiments a few minutes ago, and on one audio clip, the tiny model wrote down a very literal interpretation of an uncommon sci-fi word, and that was more accurate than either the medium or the large models. The rest of the time, the larger models did better, as expected.
But, I don’t know. This is interesting to me, but I agree there could be issues with making is workable for real time transcription.
See https://news.ycombinator.com/item?id=32929029 re accuracy, I'm working on a wider comparison. My models are generally more robust than open-source models such as Vosk and Silero, but I'm definitely interested in how my stuff compares to Whisper on difficult held-out data.
> Brute forcing the model with just traditional CPU instructions is fine, but… obviously going to be pretty slow.
It's not that simple. Many of the mobile ML accelerators are more targeted for conv net image workloads, and current-gen Intel and Apple CPUs have dedicated hardware to accelerate matrix math (which helps quite a bit here, and these instructions were in use in my tests).
Also, not sure which model they were using at 17x realtime on the 3090. (If it's one of the smaller models, that bodes even worse for non-3090 performance.) The 3090 is one of the fastest ML inference chips in the world, so it doesn't necessarily set realistic expectations.
There are also plenty of optimizations that aren't applied to the code we're testing, but I think it's fairly safe to say the Large model is likely to be slow on anything but a desktop-gpu-class accelerator just due to the sheer parameter size.
Good point about realtime or not, however with ML I have found the weaknesses get addressed pretty fast by someone. There is a big step between proof of concept and practical application though, so we shall see.
This AI has a 30 second delay on the audio processing because it needs to be able to "look into the future" to get these good results. That 30s delay would be unacceptable for Siri/Google/Cortana.
A lot of models we currently use seem to do the same thing. The model will transcribe a "best effort" interpretation in real time, then as you can continue speaking, you'll see it go back and make corrections. I'm sure you can feed the first X seconds you have into the model, followed by (30-X) seconds of silence, and it will do real time transcription just fine... it would be weird if this broke anything. Then, as you get more speech, you continue getting better transcription of the first 30 seconds, then you switch to a 30 second sliding window.
Maybe I'm missing something, but I don't see the problem here.
Yes, that's because Whisper - like pretty much all of them - uses a Transformer encoder with Attention layers. And the Attention layers learn to look into the future.
And yes, what you describe could be done. But no, it won't reduce latency that much, because the model itself learns to delay the prediction w.r.t. the audio stream. That's why ASR-generated subtitles usually need to be re-aligned after the speech recognition step. And that's why there is research such as the FastEmit paper to prevent that, but then it is a trade-off between latency and quality again.
Also, running your "low-latency" model with 1s chunks means you now need to evaluate the AI 30x as often as if you'd be using 30s chunks.
You just said the models pretty much all work the same way, then you said doing what I described won't help. I'm confused. Apple and Google both offer real time, on device transcription these days, so something clearly works. And if you say the models already all do this, then running it 30x as often isn't a problem anyways, since again... people are used to that.
I doubt people run online transcription for long periods of time on their phone very often, so the battery impact is irrelevant, and the model is ideally running (mostly) on a low power, high performance inference accelerator anyways, which is common to many SoCs these days.
I meant that most research that has been released in papers or code recently uses the same architecture. But all of those research papers use something different than Apple and Google.
As for running the AI 30x, on current hardware that'll make it slower than realtime. Plus all of those 1GB+ models won't fit into a phone anyway.
> Plus all of those 1GB+ models won't fit into a phone anyway.
I don't think that's a requirement here. I've been playing with Whisper tonight, and even the tiny model drastically outperformed Siri dictation for me in my testing. YMMV, of course.
I tried feeding the four examples from this announcement into Google as dictation inputs and it just sits there blankly. On the JFK speech test file in the repo, Google understands perfectly. The samples in the announcement are clearly outside the capabilities of anything Google has launched publicly, but I don't know how that translates to overall utility in every day applications.
My experience with the APIs is Google is excellent and Microsoft is slightly better. And the offline model I've been using that's nearly as good as both is facebook's wav2vec2-large-960h-lv60-self.
Don't believe what's on marketing pages, they rarely transfer to the real world. Will have to make time to try it and see. In theory, given task diversity and sheer number of hours, it should be a lot more robust but will wait on evidence before believing any claims on SoTA.
Okay this is super impressive. I just downloaded Whisper and fed it a random flac file I had handy and it did a really good job. Also impressive that it works on my weak CPU:
A 3m07s flac took 5m to transcribe:
$ whisper --device cpu 'BLACKPINK - BORN PINK/01 Pink Venom.flac'
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: korean
[00:00.000 --> 00:10.000] Blackpink
[00:11.000 --> 00:14.000] Kick in the door, wave in the coco
[00:14.000 --> 00:16.000] 팝콘이는 친게 껴들 생각 말고
[00:16.000 --> 00:19.000] I talk to talk, run ways I walk walk
[00:19.000 --> 00:21.000] 힘 감고 팝 팝 안 봐도 척
[00:21.000 --> 00:24.000] By one and two by two
[00:24.000 --> 00:26.000] 내 손끝 두 하나에 타면 아지은 중
[00:26.000 --> 00:30.000] 갓 자쇼 지금 화려해 T makes no sense
[00:30.000 --> 00:32.000] You couldn't get a dollar out of me
[00:33.000 --> 00:38.000] 자 오늘 밤이야 눈톱을 품고
[00:38.000 --> 00:41.000] 미혼을 뺏음 down
[00:41.000 --> 00:43.000] Look what you made us do
[00:43.000 --> 00:47.000] 천천히 널 잠재울 파이어
[00:48.000 --> 00:52.000] 잠이 날 만큼 아름다워
[00:52.000 --> 00:53.000] I bring the pain like
[00:53.000 --> 00:57.000] 디스탑, 팽팽, 디스탑, 팽팽, 디스탑, 팽팽, 팽팽
[00:57.000 --> 00:58.000] Get em, get em, get em
[00:58.000 --> 01:00.000] Straight till you don't like
[01:00.000 --> 01:01.000] Whoa, whoa, whoa
[01:01.000 --> 01:03.000] Straight till you don't like
[01:03.000 --> 01:04.000] Ah, ah, ah
[01:04.000 --> 01:05.000] Taste that, pink venom
[01:05.000 --> 01:06.000] Taste that, pink venom
[01:06.000 --> 01:08.000] Taste that, pink venom
[01:08.000 --> 01:09.000] Get em, get em, get em
[01:09.000 --> 01:11.000] Straight till you don't like
[01:11.000 --> 01:12.000] Whoa, whoa, whoa
[01:12.000 --> 01:13.000] Straight till you don't like
[01:13.000 --> 01:14.000] Ah, ah, ah
[01:14.000 --> 01:15.000] Blackpink and Amo
[01:15.000 --> 01:17.000] Got it by the smack ram
[01:17.000 --> 01:18.000] But rest in peace
[01:18.000 --> 01:19.000] Please light up a candle
[01:19.000 --> 01:20.000] This the knife of a vando
[01:20.000 --> 01:22.000] Messed up and I'm still in saline
…SNIP…
I suspect this is coming. I mean we do have decent text to speech systems already, but in this vein of “we used neural networks and now it’s very very good” you can imagine that with something like GPT-3, to extend it they could use this speech to text system so you could speak to it for input, and then a natural progression is that it can use text to speech to return the output, so you just have a voice oriented conversational system.
So I think TTS is a logical part of the system. I also think that there are peculiarities of voice interaction that aren’t captured in text training datasets, so they would need to do some fine tuning on actual voice conversation to make it feel natural.
A full NLP system would include speech recognition, TTS, a large language model, and a vector search engine. The LM should be multi modal, multi language and multi task, "multi-multi-model" for short haha. I'm wondering when we'll have this stack as default on all OSes. We want to be able to search, transcribe, generate speech, run NLP tasks on the language model and integrate with external APIs by intent detection.
On the search part there are lots of vector search companies - Weaviate, Deepset Haystack, Milvus, Pinecone, Vespa, Vald, GSI and Qdrant. But it has not become generally deployed on most systems, people are just finding out about the new search system. Large language models are still difficult to run locally. And all these models would require plenty of RAM and GPU. So the entry barrier is still high.
Ah very interesting thank you. I’m not familiar with research in to vector search, I’ll look that up.
But yeah you make a good point about LLMs being too large to run on a normal PC. I do somewhat suspect that we might see some rapid acceleration in the size of neural network processors as large models begin to offer more utility. I think for now they have limited appeal but we’re already seeing things like Tesla’s Dojo make large leaps in capability to rapidly process complex networks.
In five to ten years we may see built in accelerators come standard in most computers capable of running very complex models. Already Apple provides ever more powerful accelerators in their phones. You could imagine Adobe offering real time diffusion models as part of Photoshop, among other things.
Likewise, TTS is what I really want. My goal is to be able to create audio books from text. I've been using Amazon Polly and it's acceptable quality, but I would be ecstatic to be able to do it locally on my own hardware.
Check out NaturalReader. It has hundreds of amazing voices, a system for highlighting text as it is being read, works on books (pdf) and webpages, and is available on phones and in browsers on all platforms. So I could have the same voice on Mac, Linux and iPhone.
> About a third of Whisper’s audio dataset is non-English, and it is alternately given the task of transcribing in the original language or translating to English. We find this approach is particularly effective at learning speech to text translation and outperforms the supervised SOTA on CoVoST2 to English translation zero-shot.
That's intriguing. You can just set the model to transcribe everything into English, no matter which language the speaker is using, and it just works. Given that many people are much better at understanding English than at speaking it, this might make voice interfaces much more accessible without much work.
Naively, training the same model on multiple languages has interesting implications.
On one hand, it may capture something "deeper" about language.
On the other hand, it's likely to do great in general, but miss particularities of some language.
Understanding the coverage of the training model seems a perennial problem. Is there any (shorthand) way to compare language model training corpora?
Clearly if they use common subsets we have a literal comparison. I'm more interested in whether there's progress in characterizing corpora by speech styles, fluency, vocabulary sets, (noise) environment, emotionality, proposition types, etc.
(btw: 25 minutes for a 9-minute segment on a 12-thread x86. Lots of jargon spelled as it sounds. Sentences capitalized but no punctuation. Overall good.)
I just tested the model [1] using an RTX3090, trying to translate a french text I found here [2].
Some observations:
- The full translation of the 6:22 minute video takes about 22 seconds (17x real time)
- It recognizes the language by default (and did a good job to recognize it was french audio)
- MIT License [3]!
- The quality of the transcription is good, but not perfect.
- The quality of the translation (if you don't consider transcription errors as a translation error) is generally very good.
---
The transcription:
> Bonjour à tous, <error>j'suis</error> espère que vous allez bien, c''est ENTI. Et aujourd', <error>aujourd',</error> on se retrouve <error>un peu physique</error> pour parler de la termo dynamique. Vous ne vous inquiétez pas, ça va bien se passer. On va y aller ensemble, <error>être à par exemple,</error> je vous accompagne à travers une série de vidéos pour vous expliquer les principes de base en termo dynamique. Et bah, c''est parti, on va y aller tranquillement. Lidée, c''est vous puissiez comprendre la termo dynamique dans son ensemble. Donc, je vais vraiment prendre mon temps pour <error>couplisser</error> bien comprendre les notions,
The translation:
> Hello everyone, I hope you're doing well, it's NT and today we find ourselves a little physical to talk about the thermo dynamic. Don't worry, it's going well, we're going to go together and be the same. I'm going to accompany you through a series of videos to explain the basic principles in thermo dynamic. Well, let's go, <error>we're going to go quietly</error>. The idea is that you can understand the thermo dynamic <error>in sound together</error>. So I'm really going to take my time to understand the notions,
---
All in all very happy that OpenAI is publishing their models. If Stable Diffusion is any guide, people will hack some crazy things with this.
It also runs well on a CPU and seems to have proper memory management. Wonderful timing because I was using DeepSpeech for some audio recordings and it required me to script up a splitter to make the files into .wav and then do snippets of 10 seconds each. Everything about this just works out of the box. On a core i5 I'm getting about 30 seconds every minute. Transcriptionist jobs just turned into editor jobs. I love how it drops the inflections in the audio as well, because it was trained on transcription work, and that is one of the first things you learn to do (drop the uhs and ums and huhs etc, unless it is a strictly verbose transcription).
That's hilarious and honestly, incredibly bad. "Dans son ensemble" is a very common idiom (meaning "as a whole") while "in sound together" has to be pretty rare. "Son" means "his/hers/its" as well as "sound", and the former meaning is probably more common in general so I have no idea how this result could arise.
"Termo" also doesn't exist in French, it's "thermo", so the transcript even makes orthographic errors.
And I forgot about "couplisser" which is also a hilarious made-up word that sounds like it could mean something, but doesn't! Edit Google finds exactly one reference of this, in a patent with a typo on the word "coulisser".
I'm still impressed by the transcript quality since it covers many languages, but the translation part is quite poor.
Was this with the `base` model? `large` is running ok on a P100 in colab, but is about 4% the speed of `base.en`. Certainly seems like some of these models will be fast enough for real-time.
I installed Whisper (and, I thought all the needed dependencies), and had it running on my M1 Max MacBook Pro with 64 GB ram, but it ran TERRIBLY slowly... taking an hour to do a couple of minutes...
I found this thread and wondered if Whisper was accessing all the cores or the gpu, so I've spent a couple of hours trying to get whisper to access the gpu - following the points made in this thread, and googling how to install via brew the various components.
Long story short, I keep getting an error message
"RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU."
or when I set --device to gpu, it get the error:
"RuntimeError: don't know how to restore data location of torch.storage._UntypedStorage (tagged with gpu)"
it's been a looong time since I wrote any code (remember basic?), so realise I may be missing a lot here!!
does anyone have any pointers?
thanks!
edit: I'm now trying it one more time after trying to set the cpu using this line:
map_location=torch.device('gpu')
and I get this message as whisper begins:
~/opt/anaconda3/lib/python3.9/site-packages/whisper/transcribe.py:78: UserWarning: FP16 is not supported on CPU; using FP32 instead
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
then I wait for whisper to do it's magic ...tho it looks like it will remain very slow...
Really interesting, I can see ton of potential uses.
2 questions:
1) how does it compare to state of the art FOSS solutions? I'm seeking about DeepSpeech or Vosk
2) would it be somehow possible to associate timestamp to the words recognized? That would be amazing for things such as audio editing or skipping to a particular location on a video
You properly mentioned timestamps. There are many other important properties of good ASR system like vocabulary adaptability (if you can introduce new words) or streaming. Or confidences. Or latency of the output. Compared to Vosk models this model can not work in streaming manner, so not very suitable for real-time applications.
But in general the model is robust and accurate and trained on the amount of speech we never dreamed about in Vosk. We will certainly benefit from this model as a teacher (together with others like gigaspeech models). I recently wrote about it https://alphacephei.com/nsh/2022/06/14/voting.html
for 2), it's actually written in the description: "phrase-level timestamps", so it should be possible (phrase level is neat for skipping to a special location on a video, but maybe not for audio editing).
Really incredible to see that their multilingual audio-to-English approach is viable. I'm super excited about this, and great to see that openai actually open up about something, for once.
Skimming the codebase I can't immediately see code to do additional training.
Being able to fine-tune the model to a specific language or case (eg. teach it specifically about some technical topic that might not be so prevalent in the current train set) would be majorly disruptive to current SOTA in "callcenter analytics" tech. Especially when combining Whisper with GPT3.
I knew there was a reason why I kept my MP3 library even after subscribing to Spotify. Now piping everything through whisper. So far the generated lyrics are reasonable, though it thinks the REM song says "Linnie Bruce is not afraid."
Would also like to know this. It looks like they're processing the audio file in 30 second chunks, so a naive approach of keeping a buffer of 30-second input stream chunks and just continually writing to an output .mp3 could work...
The model output can be tweaked to produce audio embeddings (akin to BERT for text embeddings and CLIP for image embeddings), which can lead to some interesting applications as the previous two examples have demonstrated.
Represent a given set of audio inputs as a numeric vector, which can then for example be finetuned for other ML/AI problems or placed in an embeddings database for easy ANN search with similar audio clips. In the extreme case it could facilitate better AI audio generation similar to how CLIP can guide a VQGAN.
Although the 30 second minimum input is a bit of a bummer since it may not allow much granularity in the resulting embeddings.
Ran it on Juicy by The Notorious B.I.G and results were considerably worse than my mix of prog-rock and british invasion music I had tried before, though at least some of that is due to the number of proper-nouns in that song.
It took about 1000 CPU-minutes for this 5 minute song on my Ryzen 2700 with 12 OpenMP threads (about 100 minutes wall-clock).
whisper never-gonna-give-you-up.mp3 --language English --model small
[00:00.000 --> 00:27.000] We're no strangers to love You know the rules and so do I
[00:27.000 --> 00:35.000] I feel commitments while I'm thinking of You wouldn't get this from any other guy
[00:35.000 --> 00:43.000] I just wanna tell you how I'm feeling Gotta make you understand
[00:43.000 --> 00:47.000] Never gonna give you up Never gonna let you down
[00:47.000 --> 00:53.000] Never gonna run around and desert you Never gonna make you cry
[01:00.000 --> 01:09.000] We've known each other for so long Your heart's been aching but you're too shy to say
[01:09.000 --> 01:17.000] Inside we both know what's been going on We know the game and we're gonna play it
It was running for quite a long time (20 minutes) on my admittedly low-budget specs.
Note that I did not omit 00:53.000 -> 01:00.000.
Shouldn't there be some type of unintelligible warning since it wasn't able to transcribe that part?
Model small is about as good at recognizing lyrics as an untrained Newton was at recognizing handwriting.
Here's a comparison of Basket Case by Greenday:
Small:
[00:00.000 --> 00:05.000] Do you have the time to listen to me whine
[00:05.000 --> 00:10.000] About nothing and everything I'll have once?
[00:11.000 --> 00:16.000] I am one of those melodramatic fools
[00:16.000 --> 00:20.000] Neurotic to the bone, no doubt about it
[00:23.000 --> 00:27.000] Sometimes I give myself the creeps
[00:27.000 --> 00:32.000] Sometimes my mind plays tricks on me
[00:32.000 --> 00:38.000] It all keeps headed up, I think I'm pregnant
[00:38.000 --> 00:43.000] And I'm just paranoid, I'm just stuck
[00:47.000 --> 00:52.000] I went to a shrink to have a life like my dreams
[00:52.000 --> 00:57.000] She says it's like a sex that's bringing me down
[00:57.000 --> 01:03.000] I went to a whore, he said my life's a bore
[01:03.000 --> 01:08.000] Choked with my widest buzz that's bringing her down
[01:10.000 --> 01:14.000] Sometimes I give myself the creeps
[01:15.000 --> 01:19.000] Sometimes my mind plays tricks on me
[01:19.000 --> 01:25.000] It all keeps headed up, I think I'm pregnant
[01:25.000 --> 01:30.000] And I'm just paranoid, I'm just stuck
[01:30.000 --> 01:48.000] Grasping to control, it's all I better hold on
[02:08.000 --> 02:12.000] Sometimes I give myself the creeps
[02:13.000 --> 02:17.000] Sometimes my mind plays tricks on me
[02:18.000 --> 02:23.000] It all keeps headed up, I think I'm pregnant
[02:23.000 --> 02:30.000] And I'm just paranoid, I'm just stuck
[02:53.000 --> 03:13.000] Thanks for watching!
Medium:
[00:00.000 --> 00:05.000] Do you have the time to listen to me whine
[00:05.000 --> 00:10.000] About nothing and everything all at once?
[00:11.000 --> 00:16.000] I am one of those melodramatic fools
[00:16.000 --> 00:20.000] Neurotic to the bone, no doubt about it
[00:23.000 --> 00:27.000] Sometimes I give myself the creeps
[00:27.000 --> 00:32.000] Sometimes my mind plays tricks on me
[00:33.000 --> 00:36.000] It all keeps adding up
[00:36.000 --> 00:39.000] I think I'm cracking up
[00:39.000 --> 00:41.000] Am I just paranoid?
[00:41.000 --> 00:43.000] Am I just sad?
[00:47.000 --> 00:50.000] I went to a shrink
[00:50.000 --> 00:53.000] To analyze my dreams
[00:53.000 --> 00:58.000] She says it's lack of sex that's bringing me down
[00:58.000 --> 01:01.000] I went to a whore
[01:01.000 --> 01:04.000] He said my life's a bore
[01:04.000 --> 01:09.000] So quit my whining cause it's bringing her down
[01:10.000 --> 01:14.000] Sometimes I give myself the creeps
[01:16.000 --> 01:20.000] Sometimes my mind plays tricks on me
[01:20.000 --> 01:23.000] It all keeps adding up
[01:23.000 --> 01:26.000] I think I'm cracking up
[01:26.000 --> 01:28.000] Am I just paranoid?
[01:28.000 --> 01:30.000] Am I just sad?
[01:40.000 --> 01:44.000] Grasping to control
[01:44.000 --> 01:50.000] So I better hold on
[02:07.000 --> 02:11.000] Sometimes I give myself the creeps
[02:11.000 --> 02:16.000] Sometimes my mind plays tricks on me
[02:16.000 --> 02:19.000] It all keeps adding up
[02:19.000 --> 02:22.000] I think I'm cracking up
[02:22.000 --> 02:24.000] Am I just paranoid?
[02:24.000 --> 02:52.000] Am I just sad?
[02:54.000 --> 02:58.000] Thanks for watching!
Large was not obviously better than medium when I tried it. My impression was that it tended to fit more to a language model than the sounds heard, which corrected some errors and introduced some others, but I didn't try a lot of songs because large won't run on my GPU.
I am one of the top contributors to the tiny Mozilla Common Voice data-set for my language. The data-set is very small compared to other popular languages and none of the other mentioned data-sets contribute to that language to train the model of Whisper.
And even with so little data to train on it still works surprisingly well.
Their models range from 70mb to 3gb. The largest model is smaller than the optimised stable diffusion. Not sure what the inference speed is like, haven't tried it myself yet.
For those on NixOS, here's a quick and dirty flake.nix that will let you make a venv in which to "pip install"'
Just put it in a flake.nix, and "nix develop" followed by "virtualenv ./venv; . ./venv/bin/activate; pip install git+https://github.com/openai/whisper.git"
This should, in theory, work with CUDA; my GPU doesn't have enough RAM to do it (it runs out at 2.9GiB allocated, I have 4GiB, but am running a compositing desktop, which chews up about 600MiB; not sure where the other ~400MiB went)
[edit]
I confirmed CUDA worked with the "small" model, which used 3.3GB of GPU ram, and resulted in much poorer recognition than the "medium" model on my CPU (but it ran at least two orders of magnitude faster).
CUDA worked fine with large on my 2080Ti FWIW. The speedup is ridiculous, as expected. My Ryzen 3800X used almost an hour transcribing a minute worth of speech, while the 2080Ti does it in like 10-20 seconds.
I'm on Windows, using Task Manager, the dedicated GPU memory went from 1GB before run to about 9.8GB for the most time during run, peaking at 10.2GB. So pretty close to the 11GB limit of my 2080Ti it seems.
I want to build a tool that takes a video and generates subtitles for it, then I want to index the subtitles and let people search for a specific quote to scrub to that part of the video using automatically generated urls.
This is for a specific fandom of a ton of content, lots of dirty audio mostly recorded in a gym setting with multiple people speaking.
I've never seen transcription and translation combined into a single step like this before...
Have I been living under a rock, or is this new?
I assume it should help performance, because it means emphasis, timing and tone can be used to inform the translation. Helps make better guesses about information missing from the source language.
I'm not in the Speech Recognition circles and am looking for open source speech recognition I can play around with - would this be the new state of the art?
For me as a deaf person the current state of art (in terms of speed & usability) is the Recorder app on a Google Pixel phone (4a/6 Pro is what I've used)
I've tried speaking to that demo several times... I used the built in feature to record from microphone, and I played back the samples to make sure they were audible and clear.
Sometimes it outputs the words "thank you" (which I did not say), sometimes it outputs a period. It never once output anything I said. It seems completely broken.
EDIT: apparently something about the combination of Safari+HF+Whisper was not working. I tried another Whisper demo on HF and had the same results. Switching to Chrome made it work flawlessly... I have no idea what kind of codec incompatibility was happening.
Given this, are there good (and available/open source) models for text to speech? Last time I tried everything still sounded extremely robotic, and/or were a pain to set up and run. It would be fun to set up a pipeline where the two processes 'communicate'.
This is so cool! I was just speaking to a non-technical family member about privacy concerns around using "OK Google" and the like. They responded inquiring about "private" alternatives, to which my answer was "I'm not aware of good ones that give you that level of accuracy and convenience."
Perhaps this development along with continued optimization and device compute power increases will lead us into a near-future where things like Mycroft devices and cellphones could have local-only speech-to-text and translation capabilities which are accurate even with environmental background noise variations encountered IRL.
Any opinions on what this means for speech-to-text companies like rev.ai and assmembly.ai ?
We've tested open source solutions for s2t, like kaldi, but the quality was not good enough. However, one of the main advantages of a service like assembly.ai to me was that they offer sentence splitting in form of punctuation and speaker detection, which Kaldi does not.
So I guess I answered my own question to some degree: A S2T service is more than just S2T. We already see assembly.ai add more and more features (like summarisation, PID redaction ect.) that are a value-add to plain S2T.
You can apply public punctation model from Vosk on top of Kaldi output, you can also get speaker labels with existing open source software.
On quick video transcription test this model is more accurate than AssemblyAI and Rev AI. It will be harder for them to sell pure ASR now. Some more business-oriented applications will still be important though, for example ASR as part of callcenter analytics solution or as a part of medical ERP system.
The value of automatic summarization is small, without AI it is very hard to make it right, you need to be an expert in the field to understand what is important.
> you can also get speaker labels with existing open source software.
Hello Nickolay :)
Diarization has always been the hard part for me, especially since it is very difficult to do comparisons within your domain. The evaluation metrics are not descriptive enough imo.
Would you say Titanet or EcapaTDNN are decent for use in production alongside, say, Whisper, or any other ASR output, if given the timestamps, so as to bypass running VAD? I'm just about to run experiments to try pyannote's diarization model and google's uis-rnn to test out how well they work, but it's a tad beyond my ability to evaluate.
I also wonder if Whisper architecture would be good for generating embeddings, but I feel it's focused so much on what is said rather than how it's said that it might not transfer over well to speaker tasks.
Rev AI will also create a transcription separated by multiple speakers, which it doesn't appear Whisper can do (yet). I expect that Whisper will overtake the alternatives soon, given that it's open source, but today it's not there yet.
This is awesome to see! Our team at Shipyard [1] has been creating a lot of solution videos on YouTube recently to show teams how they can build A -> B solutions in a few minutes. We've been meaning to provide captions or transcripts for the backlog, but the overhead was either pretty high or too expensive.
Tested this out in the span of a few hours and got a solution up and running to download the video from Youtube, spit out the transcription and upload the resulting transcription file externally. We're still missing a piece to upload directly to YouTube, but it's a start!
As a part of this experiment, we built out some templates that will allow anyone to play around with Whisper in our platform. If you're interested in seeing it, we built a video for doing the process with our templates [2], or directly with Python [3].
I really wish I had this about half a year ago when I was building a tool to automatically turn online school lectures into searchable, clickable transcripts (kind of like YouTube or EdX transcripts).
I was originally using Adobe Premiere Pro's speech to text to do it, and wrote Python to convert its output to the Hyperaudio format on GitHub. With this, I can totally skip all of that step and this is fully open source, too.
App idea:
Build an app that takes a video and uses Hyperaudio or a similar project to add a clickable and searchable transcript (clicking in transcript seeks video)
You still interested in this? I'd be keen to chat to you, worked on a searchable transcript provider for educational youtube videos (likewise, unfortunately pre-whisper, so I did a lot of work with sentence completion perplexity and rpunct to try and improve transcript quality from youtube automatic transcriptions). Can be contacted at revision.ai and demo what we were able to do till now, would be great to hear your thoughts.
So I guess we can easily use this to generate subtitles?? Which would be nice! Cause ummm some of the movies that I download from the internet arrrrrr! don't have subtitles available
I'm seeing some weird bugs. For example, in one 30 minute mp3, about 6 minutes in it decided that someone said "2200." And then exactly 5.000 seconds later, "2200". And every 5.000 seconds after that, for the next 24 minutes. (No one actually repeated "2200" for 24 minutes.)
A second run gave better results, but in most runs I do see instances where phrases repeat from 2-20 times.
I'm surprised by the quality on non-English languages, given that 80+% of the training data is English, and the rest is split between tens of languages.
It's sometimes close to perfect, and sometimes goes off the rail; I think that maybe the model tries to establish some sort of consistency for each sentence; if starts wrong for the first few words of a sentence, it can't build the rest properly.
1. Make sure you're using a model that isn't suffixed with `.en` (`base`, not `base.en).
2. Use `model.transcribe(your_input_audio, language='Japanese', task='translate')` ... with the appropriate input language.
It understands my Swedish attempts at English really well with the medium.en model. (Although, it gives me a funny warning: `UserWarning: medium.en is an English-only model but receipted 'English'; using English instead.`. I guess it doesn't want to be told to use English when that's all it can do.)
However, it runs very slowly. It uses the CPU on my macbook, presumably because it hasn't got a NVidia card.
Googling about that I found [plaidML](https://github.com/plaidml/plaidml) which is a project promising to run ML on many different gpu architectures. Does anyone know whether it is possible to plug them together somehow? I am not an ML researcher, and don't quite understand anything about the technical details of the domain, but I can understand and write python code in domains that I do understand, so I could do some glue work if required.
I was comparing a batch of transcriptions between these models and vosk, and noticed that the medium.en model produces some weird results compared to the others. I've seen a number of loops with one word or a small sequence of words repeating several times. It seems more prone to output that reads like nonsense than the others.
More troubling is a short audio clip that got a few full sentences back, several times the text length that comes back from the other models or vosk. The content of the sentences is extremely far from the audio content. The best alignment I can find is the first word of medium.en's interpretation is somewhat phonetically similar to the audio.
The small.en model doesn't show these behaviors, at least in this data set.
The whole value of this model is in 680 000 hours of training data and to reuse this value you need large model, not smaller ones. Smaller versions just don't have enough capacity to represent training data properly.
I get that. I'm saying the medium.en model specifically seems to have some weird edges to its behavior that is not present in the models up or down the scale from it, or similarly (the plain 'medium' model).
It's the only one that seems to be occasionally spitting out significant chunks of training data versus something that resembles the audio.
I'd love to find a way to test this with longer audio but I don't have GPU resources and not exactly sure how to load that into the Colab. Is anyone planning on hosting or sharing a model that can be used by others to test longer form audio (for podcast transcription)?
First off, it seems that the model can easily run on M1/M2 with minor modification. However `aten::_index_put_impl_` operator is current not supported and fallback always slows things down quite a lot.
Second, is there a bug with how the script processes incoming audio segments? For a short 4 second clip, what I got was:
> [00:00.000 --> 00:03.760] Okay, Eunice, travel plans. I need to be in New York on Monday, L.A. on Tuesday, New York on Wednesday, L.A. on Thursday. You're knocking Friday. Got it?
> [00:03.760 --> 00:28.760] Got it.
However the final segment should have been shy of 1 second.
It mistakenly thinks the last segment was 25 seconds long and makes you wait for processing.
AI speech recognition FN scares the heck out of me...
for so many reasons.
But one that really pisses me off is not being able to turn it off on the iphone, and the fact that aside from "hidden cameras in my airBnB" -- soon we will have to worry about secret listening machines EVERYWHERE
"Secret listening machines everywhere" was a pretty big thing in East Germany. It's also the central theme of the movie The Lives of Others.
Of course, the ability to scale this more cheaply (throwing more compute at it, instead of more people) is somewhat scary, but it's not really introducing a new capability. Especially since you still have to do something with the transcript. An AirBnB landlord who reads the transcript of what you said could as well have listened to the recording.
I think it's a new capability to add good speech to text, search, and models that can understand and process text. You have microphones recording speech everywhere, models turning that speech into easily searchable text, and something like GPT-3 reading all the speech and raising red flags for any transgressive idea you please.
Yes, and if you want AI that is searching for “dissenters” we shall soon have “speech police” or tickets or some format of authoritarian punitive actions powered by this
The Kurzweil podcast appearance on Lex Fridman is nuts and while I love kurzweil, holy crap even with my distopian outlook he makes it even worse when you listen to even half of it…
Exactly - imagine when we get to the point where, regardless of your "crime", your punishment is 'augmented' by the "thing that you said in the past" AND when it starts to be able to connect to APIs of your social/whatever accounts and AI-Auto-Cancel you....
We will see an explosion of AI capabilities in the next couple of years. This will have a huge impact on our lives, much of it good but some of it also bad.
I wonder how much the 30 second window is impacting performance?
Anecdotally, I feel like there are plenty of times that I need context from more than 30 seconds ago to understand some technical jargon that's being discussed.
I know this isn't a tech support forum but maybe someone here knows. I'm attempting the sample python code from the github and almost get a transcription running on my work laptop without a GPU, but I run into this error message:
>>> result = whisper.decode(model, mel, options)
Traceback (most recent call last):
[snip]
RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'
It looks like a Torch error, is there some twiddling with "options" I can do to get it to run?
Can you plug this into a computer on your premises to get speech recognition without amazon, apple or google's cloud (or any other cloud) involvement?
Right now I decline all speed recognition because I don't want orwellian listening devices in my house or pocket and haven't seen an answer. (Also haven't been too bothered about speech command interfaces to bother with a load of research - lazy me).
I just wrote a script with Hazel to automatically transcribe my voice notes to txt. It handles punctuation extremely well. What a wonderful contribution!
Be wary of using this model - the licensing of this model seems sketchy. Several of the datasets used for training like WSJ and TED-LIUM have clear non-commercial clauses. I'm not a lawyer but releasing a model as "MIT" seems dubious, and hopefully OpenAI has paid for the appropriate licenses during training as they are no longer a research-only non profit.
This is a big dispute right now: OpenAI and other AI companies generally take the position that models learning from data does not make the output of the models a derivative work of that data. For example, GitHub Co-pilot uses all publicly available GitHub code regardless of license, and DALLE-2/StableDiffusion/etc use lots of non-free images. I don't think this has been challenged in court yet, and I'm very curious to see what happens when it is.
I think it might be even less problematic with something like Whisper than with DALLE/SD? Merely consuming data to train a system or create an index is not usually contrary to the law (otherwise Google wouldn't exist) – it's the publication of copyright content that's thorny (and is something you can begin to achieve with results from visual models that include Getty Photos logo, etc.)
I think it'd be a lot harder to make a case for an accurate audio to text transcription being seen to violate the copyright of any of the training material in the way a visual could.
> models learning from data does not make the output of the models a derivative work of that data
Most of the debate seems to be happening on the question of whether everything produced by models trained on copyrighted work represents a derivative work. I argue that at the very least some of it does; so the claim said to be made by the AI companies (see quote above) is clearly a false one.
We're in a weird place now where AI is able to generate "near verbatim" work in a lot of cases, but I don't see an obvious case for treating this any differently than a human reproducing IP with slight modifications. (I am not a lawyer.)
For example, copyright law currently prevents you from selling a T-shirt with the character Spider-Man on it. But plenty of AI models can give you excellent depictions of Spider-Man that you could put on a T-shirt and try to sell. It's quite silly to think that any judge is going to take you seriously when you argue that your model, which was trained on a dataset that included pictures of Spider-Man, and was then asked to output images using "Spider-Man" as a search term, has magically circumvented copyright law.
(I think there's a valid question about whether models represent "derivative work" in the GPL sense specifically, but I'm using the idea more generally here.)
That's right: the model is definitely capable of creating things that are clearly a derivative work of what they were trained on. But this still leaves two questions:
* Does the model require a copyright license? Personally I think it's very likely a derivative work, but that doesn't necessarily mean you need a license. The standard way this works in the US is the four factors of fair use (https://copyright.columbia.edu/basics/fair-use.html) where Factor 1 is strongly in favor of the model being unrestricted while 2-4 are somewhat against (and in some cases 4 is strongly against).
* Is all output from the model a derivative work of all of the input? I think this is pretty likely no, but unclear.
* Does the model reliably only emit derivative works of specific inputs when the user is trying to get it to do that? Probably no, which makes using one of these models risky.
This is even slightly more direct: access to WSJ data requires paying LDC for the download, and the pricing varies depending on what institution / license you're from. The cost may be a drop in the bucket compared to compute, but I don't know that these licenses are transferable to the end product. We might be a couple court cases away from finding out but I wouldn't want to be inviting one of those cases :)
Are there any AI/ML models that don't use sketchy licensed datasets? Everything seems to be "downloaded from the internet, no license" or more explicitly proprietary. The only exception I can think of would be coqui/DeepSpeech?
How well does it do for technical and domain oriented speech? For example I have audio recordings of a senior explaining some very technical aspects of our software. Will it understand the technical terms in that speech?
I guess I will need to download and run on it to see how correct it is.
It would be exceptional to get a healthy competitor to microsoft/nuance's dragon monopoly on voice recognition in healthcare. At a couple thousand bucks a license and the more recent SaaS subscription trend there is a lot of money to be made in that space.
This is absolute garbage python as I am neither a python developer, nor a good developer. I was trying to play around with real time transcriptions. However, it does work!
> * recording
* done recording
Recording saved to file.wav
Press enter to transcribe
/Users/laptop/Development/Personal/Public/pythonProject1/venv/lib/python3.9/site-packages/whisper/transcribe.py:70: UserWarning: FP16 is not supported on CPU; using FP32 instead
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detected language: english
Goodbye, I need to go pick up my wife.
Press enter to start recording
Any improvements welcome here.
```
# This is a sample Python script.
# Press ⌃R to execute it or replace it with your code.
# Press Double ⇧ to search everywhere for classes, files, tool windows, actions, and settings.
def print_hi(name):
# Use a breakpoint in the code line below to debug your script.
print(f'Hi, {name}') # Press ⌘F8 to toggle the breakpoint.
if __name__ == '__main__':
seconds = 5
while True:
print("Press enter to start recording")
input()
filename = record_microphone(seconds)
print("Recording saved to " + filename)
print("Press enter to transcribe")
input()
import whisper
model = whisper.load_model("base")
result = model.transcribe(filename)
print(result["text"])
Oh this is a relief to have something opensource in this field. I had using Mozilla Deepspeech for transcribing my voice notes , often with hilarious to incomprehensible results. DeepSpeech is dead ; so I will be sure to check this out.
Most of the comments here are about law enforcement. I would like to point out that it might be a boon for dictation software. This may make it easier to dictate text/code etc. in any environment.
It seems like Stable AIs release has led to some real disruption in the ML field regarding open source, and this doesn't seem to be limited to image generation. Excited to see what comes next.
I'm thinking of releasing a plugin in for Unity to that can be used to match a phrase to an action. Seeing Whisper is making me think I should include a way to use voice and not just text.
Is this practical to be used on the "edge" (for voice-control)? Would love to know if anyone has a rough idea roughly how fast/slow this would be on a M1 Mac or V100
I’ve been experimenting with voice-interfaces where typing is replaced by talking, but I find it hard to transition users to voice - we ‘seem’ to prefer typing to talking.
Personally, I would rather type than talk when interacting with a computer. The only time I use voice interfaces are when the physical interface is so poor it's just easier to use voice. Apple TV devices are an example of this.
Combine the translation + transcription with voice synthesis, and once compute power allows for this to be miniaturized we will be able to have babel-fish technology in real life.
Could someone tell me whether it's possible to somehow feed data into this project to improve its translation and transcription capabilities on our own?
Hmm are there any noteworthy open sourced speech to speech models? Like transform a spoken line to another voice, copying both the words spoken and the inflections?
I'm seeing even worse. On my M1 Max 2021 macbook pro, I tried transcribing a 30 minute video file and left it on overnight and it was only half way through. I feel like something could be wrong with my setup but I'm only using the defaults.
Why not make a demo that you can try out via navigator.mediaDevices.getUserMedia . Of course you will get good results if you demo using the training set.
Oh nice - I have an immediate use case for this. This looks accessible enough that the sci-fi dream of instantaneous audio translation is suddenly within reach.
I'm still not successfully using the GPU, but it's working decently quickly (with the base model - it's incredibly slow to use the Large model) using just the CPU. I'm going to have to check what magic stable-diffusion is doing to enable the GPU :(
There's a --device flag you can pass. I've been trying to get `--device cuda` to work on my Windows machine and it's saying that torch wasn't compiled with CUDA. Trying to figure out what's going on there.
And on the M1, supposedly PyTorch has support for hardware acceleration using MPS (Metal Performance Shaders, announced here https://pytorch.org/blog/introducing-accelerated-pytorch-tra...) but when I tried `--device mps` it blew up with an error "input types 'tensor<1x1280x3000xf16>' and 'tensor<1xf32>' are not broadcast compatible".
> I've been trying to get `--device cuda` to work on my Windows machine and it's saying that torch wasn't compiled with CUDA.
I struggled with the same. Here's what worked for me:
Use pip to uninstall pytorch first, should be "pip uninstall torch" or similar.
Find the CUDA version you got installed[1]. Go to PyTorch get started page[2] and use their guide/wizard to generate the pip string, and run that. I had to change pip3 to pip FWIW, and with Cuda 11.6 installed I ended up with "pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116".
After that I could use --device cuda, and the difference was immense. On my 2080Ti it went from roughly an hour for a minute with large model, to 10-20 seconds.
Yep, same for me, on M1 after enabling MPS (with `model.to("mps")`) it just either SIGSEGV or SIGABRTs every time with that line. The extremely unclean nature of the abort is making it hard to debug :(
I noticed the size seems to correspond to the model. With a large model, the error is tensor<1x1280x3000xf16>. With tiny, it's tensor<1x384x3000xf16>, and with medium it's tensor<1x1024x3000xf16>. It also seems like a bad thing that those are f16's but the "expected" data is f32.
I'm giving up for the night, but https://github.com/Smaug123/whisper/pull/1/files at least contains the setup instructions that may help others get to this point. Got it working on the GPU, but it's… much much slower than the CPU? Presumably due to the 'aten::repeat_interleave.self_int' CPU fallback.
Also hitting a nice little PyTorch bug:
> File "/Users/patrick/Documents/GitHub/whisper/whisper/decoding.py", line 388, in apply
logits[:, self.tokenizer.encode(" ") + [self.tokenizer.eot]] = -np.inf
> RuntimeError: dst_.nbytes() >= dst_byte_offset INTERNAL ASSERT FAILED at "/Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Copy.mm":200, please report a bug to PyTorch.
I got it working inside a docker container on my M1 MBP. FWIW, I'm having my $180 tinyminimicro PC run a translation task while my M1 MBP runs a transcription task with the same audio input. So far, the PC is actually outputting results a lot faster than the MBP. Interesting results.
Probably need to pass some kind of options when initializing. The command itself works fine, just shows a warning: warnings.warn("FP16 is not supported on CPU; using FP32 instead")
(after running the command for setuptools)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pip in /Users/xxx/Library/Python/3.9/lib/python/site-packages (22.2.2)
Requirement already satisfied: setuptools in /Users/xxx/Library/Python/3.9/lib/python/site-packages (65.3.0)
----
after trying whisper installation:
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [20 lines of output]
Traceback (most recent call last):
File "/Users/xxx/Library/Python/3.9/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
main()
File "/Users/xxx/Library/Python/3.9/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
json_out['return_val'] = hook(*hook_input['kwargs'])
File "/Users/xxx/Library/Python/3.9/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 130, in get_requires_for_build_wheel
return hook(config_settings)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages/setuptools/build_meta.py", line 154, in get_requires_for_build_wheel
return self._get_build_requires(
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages/setuptools/build_meta.py", line 135, in _get_build_requires
self.run_setup()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages/setuptools/build_meta.py", line 150, in run_setup
exec(compile(code, __file__, 'exec'), locals())
File "setup.py", line 2, in <module>
from setuptools_rust import Binding, RustExtension
File "/private/var/folders/lj/7x6d3dxd3cbdtt484k6xsmyh0000gn/T/pip-build-env-ieaydl8r/overlay/lib/python3.9/site-packages/setuptools_rust/__init__.py", line 1, in <module>
from .build import build_rust
File "/private/var/folders/lj/7x6d3dxd3cbdtt484k6xsmyh0000gn/T/pip-build-env-ieaydl8r/overlay/lib/python3.9/site-packages/setuptools_rust/build.py", line 23, in <module>
from setuptools.command.build import build as CommandBuild # type: ignore[import]
ModuleNotFoundError: No module named 'setuptools.command.build'
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
Not quite sure if this is related, but since there's a bunch of statements in there referencing rust: I had to install the rust compiler on my Mac (`brew install rust` if you use homebrew). This is not mentioned in the installation instructions.
Nope, that doesn't look good! I honestly just googled the error and installing setuptools fixed it for me, but I barely know anything about the Python ecosystem so I'm really just fumbling around here.
I got a super weird results with the 'medium' and language Japanese (with a --task translate). The song is False Sympathy by Mondo Grosso.
"[01:17.000 --> 01:32.000] Translated by Releska" when using the translate to english. That entire part of the song is instrumental. This line does not appear at all in the original transcribe only in the opus format rip.
It shows up in the yt rip in format 251 (opus), but not in format 140 (aac from youtube), nor the flac rip. All three are giving different results.
The translation quality is tied to bitrate. Same song converted to different words, the only difference being bitrates and formats. Converting my own rip with the same parameters as yt (opus @140 and then @130) didn't allow me to reproduce this error.
The model hung for a solid extra minute at the end when translating to english, the last 90ish seconds of the song took real time 60 seconds, while the entire rest took about 90.
The same behavior was not observed with the transcribe.
Some of the english words are incorrect but that was expected. The first Japanese "mistake" I found was "全ては二人の" instead of "すべては ふたりの". With the left being what whisper wrote. A single random word "hey" was transcribed/translated to english even though it's the singer elongating the 園 while singing the 楽園. "落ちてゆく 二人で繋がれた二人のラグ HEY" instead of "落ちていく 鎖でつながれた 二人の楽園" .
I am using the official subtitles released on the youtube video.
It's a complex Japanese song with both japanese and english, and the original transcribe took about 20 real time seconds to start with the first line, 130 seconds for the whole song. It seems to be showing results in 20 second window increments, but this seems to depend on what it considers audio and what it is throwing away.
On my computer I wasn't able to use the large model because I ran out of VRAM, I have 8gb, not sure how much more it'd require. So I ran it with medium.
The song is False Sympathy by Mondo Grosso. The mv is suggestive, in case that matters. I grabbed a fresh audio rip from Youtube because I didn't want to take it out of my cd case.
It is translating this version differently from the director's cut version. I ripped both as opus.
There is something weird about how it is handling the opus encoded version, as I find the same "Translated by Releska" in a wav version transcoded from the opus.
Japanese output will produce lot of tiny mistakes. However the whole output is still good enough. Like 95% plus good enough.
Found lot mistakes in 3-4 characters kanji ... and I guess most native Japanese will do mistakes time to time too, and this is why they pop up lot of buzzwords on screen with all kind of highlighting to avoid double guessing.
If the Whisper models provide any benefits over the existing Talon models, and if it's possible to achieve any kind of reasonable interactive performance, I will likely integrate Whisper models into Talon.
Talon's speech engine backend is modular, with Dragon, Vosk, the WebSpeech API, and Talon's own engine all used in different ways by users.
Seriously, when I first landed on the page without reading anything else I thought it was text to speech with the “micro machine” example and I was floored. The speech to text is obviously mind blowing too.
Got my hopes high that there's finally an open source solution that can deal with Georgian language, only to get my hopes brutally destroyed. It successfully detects a language and then produces garbage. Passing language manually produced similar results.
Detected language: georgian
[00:00.000 --> 00:21.560] én
[00:21.560 --> 00:23.240] 我伦伦…
[00:23.280 --> 00:43.720] 我伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦伦因为b forestry
On medium model:
Detected language: georgian
სრჱირესრრრრრრრრრრრრრნსსსრრრრრეე რრირრრრრრრრრე რსრნგნრრრრსრრრრრრრორრრრრრრრრრრ� ḵḸḇḤḾḤḾḤḾḤḾḤḾḤḾḤḾḤḾḾḤḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾ� ḥḾḼḥḾ
ḥḾḾ ḥḾḾ ḤḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾḾ� ḲḵḽḻḽḾ Ḫḵḽḻḽ so� ḻḽḽ ḻḽḻḻḽ ḱᴇ᷻ᵒ ḳᶟᄤḱ ḯᵁ Ḳᴄᴍᴆ Ḧᴍ� Ḧᵒ ḳᴍᴇ ḽᴄᴍᴛᴄ Ḧᴇᴆ ḳᵗᴇ ḽḮᴆ Ḫᴇᴾ ḿᴏᴇᴄᴄᴏ
ច�izar� wait �ห� examined ᑇទមះៈេំ supervision ង� იეეეეეეეეეეეეეეეეე მაეე ეაეეეეეეეეეეეეეეეეეეეე დაეეეეეეეეეეეეე უეეეეეეეეეეეეე ეა� მიი სმეიი მმიეი Ⴢქ სიიეი
სავიე სიიითთიიმემი, რაეე სიიმე სიიი ღიიიიწეირი საეიეიი სიიეი სი� ვეეფვეიიიე ქლეეშეეროეეეეეეეეეეეეე. ეგეზ ეყაკშეიეეეეეეეეეეეეეეეეეეეეეეეეეეეეეა, ნრროპიროო მმუმინ
სეეკნფეე სეეჍიგოშ სჟებიმელელეეკირპიე სემეიმე სეეიმმმ სეენემეეი სე� ᑦ� Famose m인데요 hqe bywall jaini threshold ji jani den poder vlogging bywall Take the text Ba
tou yodamj je te shake ba te shake baou contour but whatever Baou cube baou cup Baou rope Baou people Qeful Qeful იმიიიმიბთმითიიითიიიიიიიი
რაოეოოოენპეეეიეიიიიიიიიიომიიიიიიიიი რიიიიიიიიიიიმიი� ნსეეეეეეეეეეეეეეე სარეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეეე� მጇივ ეეეიდჼვვ ნაბდადებ
ლმირეეეეფედუივევეეეიიეეეეე რარეიეეეევეეეეევეე სარრეეეეეეეეეეეეეეეეეეეეეეეეეეე ხშიიიიიიიიიიიიი ლიიიიიიი ლიიიიიიიიიი ლიიი ლიიიიიიი ლაიიიიი ეიიიიიიიიიიიიიიი იიიი მ�
I've also tested it on few other audio inputs and it failed to produce meaningful results on all of them with all models.
There was one case with another audio [2] and tiny model, where it got at least some words close to their phonetic values, but printed them in cyrillic instead of Georgian and tried to interpret some Georgian words as Russian:
I tried it out on a Hindi speech (https://www.youtube.com/watch?v=4EpfJxKyosE). The transcription starts off decent, but kind of gets stuck repeating the same thing at the 02:40 mark:
[00:00.000 --> 00:10.000] पचास ताल में हमने प्रगती किये, इससे को इंटार नहीं कर सकता।
[00:10.000 --> 00:20.000] छुनाओ के दौरान वोट मांगते हुए, सरकार की नीतियों पर कठोर से कठोर प्रहार करते हुए,
[00:20.000 --> 00:28.000] और पुरानी सरकार की नीतियों नहीं आलोचना करने के लिए लैक बहुत सामग्री थी।
[00:28.000 --> 00:35.000] हर जगे मैंने ये कहा कि मैं उन लोगों में से नहीं हूँ, जो पचास वर्च की उपलड्यों पर पानी फिर दे।
[00:35.000 --> 00:43.000] ऐसा करना देश के पुर्षार्थ पर पानी फिरना होगा। ऐसा करना देश के किसान के साथ अन्याय करना होगा।
[00:43.000 --> 01:01.000] मल्दूर के साथ जात्ती करनी होगा। आम आद्मी के साथ भी वो अच्छा व्योहार नहीं होगा। जो स्वाल आज मन में उच्छा है और उच्छना चाही है। आदावी को पचास साथ होने आये, हम जैनती मनाने जा रहे हैं।
[01:01.000 --> 01:18.000] आज देश की स्तिती क्या है। हम पिछर के होगे हैं। प्रगती की दोड़ में, जो देश हमारे साथ आजाद हुए थे, वो हम से आगे बढ़ के। जो देश हमारे बाच जन में थे, वो हमें पीचे छोड़ थे।
[01:18.000 --> 01:34.000] दुनिया के गरी तम देशों में हमारी गड़न आये। वीस फीज़ी से जाना लो गरीबी की रेका के नीचे। राक्तपती महुदाय के विभाशन में गाऊं का उल्लेक हैं ना पीरे का पानी नहीं।
[01:34.000 --> 01:50.000] हम प्राथमी शिक्षा अनिवारे नहीं कर सकते हैं। लड्कियों की शिक्षा की उपेक्षा हो रही हैं। लड्कि का जन्म लेना तो इस देश में अभी तक एक अभिशाप है।
[01:50.000 --> 02:07.000] क्या सरकारी कदम उठाकर समाज में जाग्दृती पैदा करकें। क्या सब लोगों को जुटाकर ये तो ऐसा काम है जिस में कोई दलबंदी के लिए इस्थान नहीं। हम देश का नक्षा नहीं बदल सकते हैं। देश में साधनों की कमी नहीं है।
[02:07.000 --> 02:07.000] और साधनों की अगर कमी है तो उसको ठीक दन्त से प्राप्त किया जा सकता है। साधन बड़ाए भी जा सकते है। लेकिन जो साधन हैं उनका ठीक उपयोग नहीं हो रहा। जंता के उपर टेक्स लगाकर जो दन्नि कप्ता किया जाता है। उसका लाग जंता तक नहीं पहु
[02:37.000 --> 02:37.000] रख्कम जाती है। विदेशी बैंको में दन जाने का सिल्सिला अभी तक क्यों काएं है। उसको लोकने के लिए क्या कदम उठाएगे। हम विदेशी पूजी के लिए प्रैत्रशील हैं विदेशी पूजी आए और अगर विदेशी पूजी आती है अच्छे दन्त की टेक
[03:07.000 --> 03:07.000] अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है
[03:37.000 --> 03:39.000] अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है
[04:07.000 --> 04:09.000] अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है
[04:37.000 --> 04:39.000] अच्छे दन्त की पूजी आती है अच्छे दन्त की पूजी आती है
The translation does a much better job however:
[00:00.000 --> 00:10.000] In the last 50 years, we have made progress, no one can deny this.
[00:10.000 --> 00:20.000] During the elections, while asking for votes, while attacking the government's policies harshly,
[00:20.000 --> 00:28.000] and to criticize the policies of the old government, a lot of material was needed.
[00:28.000 --> 00:35.000] Everywhere, I have said that I am not one of those people who pour water on the fruits of 50 years.
[00:35.000 --> 00:39.000] To do this, we will have to pour water on the efforts of the country.
[00:39.000 --> 00:43.000] To do this, we will have to do injustice with the farmers of the country.
[00:43.000 --> 00:45.000] We will have to do caste with the laborers.
[00:45.000 --> 00:50.000] Even with the common man, that will not be a good behavior.
[00:50.000 --> 00:55.000] The question that arises in the mind today and should arise,
[00:55.000 --> 01:01.000] Freedom has come to be 50 years, we are going to celebrate.
[01:01.000 --> 01:04.000] What is the situation of the country today?
[01:04.000 --> 01:07.000] Why did we get separated?
[01:07.000 --> 01:14.000] In the race of progress, the country that got freedom along with us, they went ahead of us.
[01:14.000 --> 01:19.000] The country that was after us, they left us behind.
[01:19.000 --> 01:25.000] In the poorest countries of the world, they counted us.
[01:25.000 --> 01:29.000] 20% of the population is below the poverty line.
[01:29.000 --> 01:35.000] In the speech of the President, there is no mention of villages or drinking water.
[01:35.000 --> 01:39.000] We cannot enforce primary education.
[01:39.000 --> 01:43.000] The education of girls is being neglected.
[01:43.000 --> 01:50.000] The birth of a girl is still a curse in this country.
[01:50.000 --> 01:55.000] Is it by taking government steps, by creating awareness in the society?
[01:55.000 --> 02:01.000] Is it by uniting all the people that there is no place for party?
[02:01.000 --> 02:05.000] Can't we change the map of the country?
[02:05.000 --> 02:08.000] There is no shortage of resources in the country.
[02:08.000 --> 02:14.000] And if there is a shortage of resources, it can be obtained in the right way, resources can be increased.
[02:14.000 --> 02:21.000] But the resources that are there, they are not being used properly.
[02:21.000 --> 02:30.000] The wealth that is collected by taxing the public, its profit does not reach the public, it does not reach the common man.
[02:30.000 --> 02:32.000] Where does it go?
[02:32.000 --> 02:35.000] Whose pockets are filled?
[02:35.000 --> 02:39.000] Whose treasury does that money go to?
[02:39.000 --> 02:44.000] Why is the chain of money going to foreign banks still established?
[02:44.000 --> 02:47.000] What steps have been taken to stop it?
[02:47.000 --> 02:52.000] We are motivated for foreign worship, foreign worship has come.
[02:52.000 --> 03:01.000] And if foreign worship comes for good technology, for infrastructure,
[03:01.000 --> 03:06.000] for education, then no one will object.
[03:06.000 --> 03:11.000] I believe that our communist friends will not object either.
[03:11.000 --> 03:19.000] But is the maximum use of the resources in the country happening?
[03:19.000 --> 03:26.000] Is it not true that corruption has become a national disease?
[03:26.000 --> 03:31.000] I remember that Swargi Rajiv Gandhi had said in a speech that I send one rupee from Delhi,
[03:31.000 --> 03:36.000] but where I send the rupee, as I reach there, 19 paise are left.
[03:36.000 --> 03:41.000] I asked him how this miracle happens.
[03:41.000 --> 03:47.000] Bhaskar said that when the rupee runs, it shrinks.
[03:47.000 --> 03:54.000] The rupee shrinks, it gets into the hand, it goes into the pocket, it becomes small.
[03:54.000 --> 03:58.000] It is difficult to recognize the rupee.
[03:58.000 --> 04:02.000] The rupee can be hidden.
[04:02.000 --> 04:06.000] The situation of the currency of the country is not good.
[04:06.000 --> 04:10.000] First, the government expenditure has increased, it is increasing.
[04:10.000 --> 04:17.000] It needs common consent to reduce without reducing.
[04:17.000 --> 04:24.000] No one can work in the same way.
[04:24.000 --> 04:27.000] Yes, our old Prime Minister Narasimha Raoji,
[04:27.000 --> 04:34.000] if he would have tried in this direction after stabilizing himself, then he would have succeeded.
[04:34.000 --> 04:47.000] But he was stuck in some such things that he could not pay attention to these problems.
The 4 examples are stunningly good (the examples have speakers with heavy accents, speaking in foreign language, speaking with dynamic background noise, etc.), this is far and away better than anything else I've seen. Will be super curious to see other folks trying it out and seeing if it's as robust as it seems, including when confronted with audio speech with natural tics and uhhh's and uhmm's and everything in-between.
I think it's fair to say that AI-transcription accuracy is now decidedly superior to the average human's, what the implications of this are I'm not sure.