Whisper – open source speech recognition by OpenAI

pen2l · on Sept 21, 2022

Neat, https://github.com/openai/whisper - they have open-sourced it, even the model weights, so they are living up to their name in this instance.

The 4 examples are stunningly good (the examples have speakers with heavy accents, speaking in foreign language, speaking with dynamic background noise, etc.), this is far and away better than anything else I've seen. Will be super curious to see other folks trying it out and seeing if it's as robust as it seems, including when confronted with audio speech with natural tics and uhhh's and uhmm's and everything in-between.

I think it's fair to say that AI-transcription accuracy is now decidedly superior to the average human's, what the implications of this are I'm not sure.

anigbrowl · on Sept 21, 2022

It was already better. I edit a podcast and have > a decade of pro audio editing experience in the film industry, and I was already using a commercial AI transcription service to render the content to text and sometimes edit it as such (outputting edited audio).

Existing (and affordable) offerings are so good that they can cope with shitty recordings off a phone speaker and maintain ~97% accuracy over hour-long conversations. I'm sure it's been an absolute godsend for law enforcement other people who need to gather poor-quality audio at scale, though much less great for the targets of repressive authority.

Having this fully open is a big deal though - now that level of transcription ability can be wrapped as an audio plugin and just used wherever. Given the parallel advances in resynthesis and understanding idiomatic speech, in a year or two I probably won't need to cut out all those uuh like um y'know by hand ever again, and every recording can be given an noise reduction bath and come out sounding like it was recorded in a room full of soft furniture.

thfuran · on Sept 21, 2022

>~97% accuracy over hour-long conversations. I'm sure it's been an absolute godsend for law enforcement

97% accuracy means roughly three or four errors per minute of speech. That seems potentially extremely problematic for something like law enforcement use where decisions with significant impact on people's day and/or life might be made on the basis of "evidence".

anigbrowl · on Sept 21, 2022

No it isn't. That just means 2-3% of your content needs to be double-checked by a person at the audio level, saving huge amounts of time - equally true of human transcription, in which individual words are often [UNINTELLIGEBLE].

Would you want to review this fully before going into court, absolutely - because you'd want to play the recording to a jury for emotional impact. Can you rely on it when you want to quickly read through hours of conversation and make decisions about whether to invest further resources (which might just mean another hour of listening back to the original audio)? Also absolutely. Bear in mind that a lot of these errors have little to no semantic impact, being on the same level as typos or misspellings in a written communication.

Bear in mind too that if law enforcement (honest or not) is so interested in you that they're willing to record your conversations, your day is already ruined, you just don't know it yet. The change here is one of scale rather than quality.

wging · on Sept 21, 2022

Doesn't it mean 100% of your content needs to be double-checked? You can't easily identify which 2-3% of your content has errors. I'm aware that errors are more likely when the model is less confident of its predictions, but that shouldn't be enough.

(edit for clarification: errors are not always something like "[UNINTELLIGIBLE]", where the system knows it doesn't know; they can also be misrecognitions that the system believes in with high confidence.)

anigbrowl · on Sept 21, 2022

By the time you're prosecuting someone in court, yes of course you double, triple, quadruple check everything. That's why lawyers get paid the big bucks (for now...). But yes you can identify which content probably has errors and flag it as such.

Look, I have decades of experience dealing with human speech, and not just as an editor - I can trace the human voice from neural impulses in Broca's region through the physiology of vocal production, mechanical transduction into electrical signals, discrete fourier transforms of the resultant waveforms into spectral information and back again, the reproduction of altered signals from time-aligned speakers to create a sense of spatialization, how those are processed in the human ear, and how the cilia are connected by nerves back to your brain. I'm a good enough editor that I can recognize many short words by sight of a waveform, or make 10 edits in a row by sight and know it will sound good on playback.

So when I say that machine transcription is as good as human realtime transcription now, I say so with the clear expectation that those decades of craft are very close to being rendered obsolete. I absolutely expect to hand off the mechanical part of editing to a machine within 2 years or so. It's already at the stage where I edit some interviews as text, like in a word processor, and then export the edited document as audio and it's Good Enough - not for every speaker, but more than half the time.

NPR and a lot of commercial broadcasters cut their material this way already, because you can get the same result from 30 minutes of reading and text editing that would require 3 hours of pure audio editing with no transcription.

frognumber · on Sept 22, 2022

What tools do you use to do this? I once hacked together an editor like this maybe a decade ago -- edit speech as text from OCR -- and sorely need one now.

Alignment of video to text is a big problem for me too.

boundlessdreamz · on Sept 22, 2022

This can be done via https://www.descript.com/ You can edit video/audio by editing the transcript.

You can even add/modify words that weren't originally there https://www.descript.com/overdub

frognumber · on Sept 23, 2022

Thank you!

yourapostasy · on Sept 22, 2022

> So when I say that machine transcription is as good as human realtime transcription now...

Would you go as far as to assert machine transcription can be used as an objective benchmark of a speaker’s verbal legibility?

It is fraught with political and interpersonal dynamics to approach someone even privately one on one today and gently suggest their career would get a huge boost if they hired a voice coach to help improve their verbal communication delivery. So even when I don’t directly mention their accent, it becomes a very sensitive subject with many.

However, if audio professionals like you can point to a system and say the raw biomechanics and acoustic physics of the world dictate that this is as physically and psychometrically good as audio parsing of human speech gets regardless whether the system was biologically evolved or ML evolved, the conversation can be couched even more objectively.

I enable recording and voice transcription in every meeting I can (ostensibly for DE&I but really for my own selfish purposes), and already observe in myself I have to work hard to overcome a tendency to gloss over speakers who don’t transcribe well when I review meeting transcripts to jot down any key information I might have missed taking notes upon during the meeting.

Note that I’m perfectly aware that my foreign language verbal skills are nowhere near the English skills of those I have tried to help. If the lingua franca of the coding world switched to Urdu tomorrow, then I’d hire help to learn and polish my spoken Urdu, like I went to a speech coach when learning public speaking because I can always use help in the many skills I lack.

etienne618 · on Sept 21, 2022

Presumably you can use the 97% that is correctly transcribed to rapidly filter out the relevant content. This is likely to be only a small portion of the total content. Then you check 100% of that.

woah · on Sept 21, 2022

You double check things that you think are important, in this case, passages that will be used as evidence in court.

6gvONxR4sf7o · on Sept 22, 2022

> I'm aware that errors are more likely when the model is less confident of its predictions, but that shouldn't be enough.

Suppose 90% of the errors are in the 10% where the model is least confident. Then you can review just 10% of your content and take a 2% error rate down to 0.2% error rate.

vivegi · on Sept 22, 2022

You can also use multiple transcription engines and then use mismatches among the text streams to narrow down the % of content that needs to be reviewed. This is quite similar to multi-voting OCR for document images.

The principle is that the engines have different failure modes (hopefully) and therefore the 2-3% error rate of each engine is in different areas of the audio. The key underlying assumption is that the events are mutually exclusive.

With 3 engines, you can use something like 2-of-3 stream matches to override the stream that mismatches.

u8 · on Sept 22, 2022

I had to do a lot of manual transcription in Journalism school. Using a tool like Descript saved HOURS of my life. Generally it was 80% accurate, but going over an two-hour-long recording again at 3x speed while reading over the transcript, fixing errors from memory or pausing took a five hour job down to 30-40 minutes. Either way, somebody is going to have to listen to the recording. This just removes a layer of grunt work.

TheCapeGreek · on Sept 22, 2022

Having done audio transcription in college as a side gig, it takes a lot longer than it sounds. Even at a decent 100wpm you'll take about 5 minutes to type out 1 minute of audio.

Not having to pause + rewind will save a ton of time for that 3%.

guelo · on Sept 21, 2022

Maybe you could run the text through a grammar checker to identify the errors.

thfuran · on Sept 21, 2022

That might work if people were required to speak grammatically.

NaturalPhallacy · on Sept 22, 2022

For real. The way people normally speak, with backtracking, repetition, restarting sentences, or stopping mid sentence and starting a new one with entirely different nouns or entire subjects is perfectly normal in synchronous conversation and isn't jarring, but written down as is, it's like 40% noise.

worthless-trash · on Sept 22, 2022

For a good example of this, read ANY of trumps speaches transcribed.

NaturalPhallacy · on Sept 22, 2022

I mean if you want to make it unnecessarily political, Biden's are worse: https://www.youtube.com/watch?v=3bWM1zsnTJc

gzer0 · on Sept 22, 2022

To be fair, you chose a video that displays an amalgamation of the biggest gaffes of 2021 for Biden.

“During his term as President of the United States, Donald Trump made tens of thousands of false or misleading claims. The Washington Post's fact-checker had tallied the number as 30,573 by January 2021, an average of about 21 per day by the end of his presidency.” [1][2][3][4]

I think it’s fair to say there would be a 100 hour long plus video / documentary if they were all compiled into one. lovely!

  - [1] Fact Checker (January 20, 2021). "In four years, President Trump made 30,573 false or misleading claims". The Washington Post. Archived from the original on January 20, 2021.

  - [2] Kessler, Glenn (January 23, 2021). "Trump made 30,573 false or misleading claims as president. Nearly half came in his final year". The Washington Post. Archived from the original on January 24, 2021. Retrieved January 24, 2021.

  - [3] Elfrink, Tim (August 14, 2020). "'Do you regret at all, all the lying you've done?': A reporter's blunt question to Trump goes unanswered". The Washington Post. Retrieved August 14, 2020.

[4] https://en.m.wikipedia.org/wiki/Veracity_of_statements_by_Do...

worthless-trash · on Sept 22, 2022

Oh no no, i wasn't trying to be political, its just one that I read.. and wow you're right!

thfuran · on Sept 21, 2022

>equally true of human transcription, in which individual words are often [UNINTELLIGEBLE].

ML systems somewhat notoriously do not necessarily make the same sorts of errors that a human would. And I'd expect a large portion of the errors to be transcribing the wrong words rather that indicating that a word couldn't be transcribed. That sort of error means that you can't really get away with manually reviewing just 3% of the audio.

notahacker · on Sept 21, 2022

ML tending to make weird mistakes rather than subtle ones that make sense in context like human transcribers is likely to make them easier to spot.

And there are humans in the loop too, and an enormous amount of redundancy in the questions and answer, so even plausible false transcriptions will get picked up on if they matter. Nobody gets sent to jail simply because the transcription process - human or machine - accidentally substitutes "I did it" in place of "I didn't" midway through a two hour interview.

BartjeD · on Sept 22, 2022

The thing is that 'Likely' is very far away from 'always'. There is no guarantee the mistake will be easy to spot.

For entertainment purposes AI transcription is awesome.

For serious business applications the ability to recognize mistakes will continue to be a field to which serious attention is given. It would be interesting to see AI processes double check itself, and also run a logic check on whether the transcription makes sense. So that it can report sections flagged as incongruous or of dubious reliability.

iroh2727 · on Sept 22, 2022

+1. There is a widespread "metric fallacy" or "task fallacy" going around. Models of course optimize for metrics, so they tend to perform well on those related metrics.

Humans, however, are not simply metric optimizers. Though it's always in the interest of those corporations producing metric optimizers (i.e. models) to paint humans as such, so their models shine in comparison. They want humans to look like bad machines, so it looks like they should be automated. Not to say they shouldn't in many cases, just that there's a clear one-sidedness in all corporate PR (and funded research, especially that research which is also PR).

All this to say that yes I agree with you. And if we humans don't want our unsustainable economic growth to turn us even more into machines (as our bureaucratic creep has done quite well thus far), we should fight such rhetoric that aims to paint humans simply as machines or task-doers.

datalopers · on Sept 21, 2022

If you know which 2-3% are the false positives, you have a very lucrative business model.

MonkeyMalarky · on Sept 21, 2022

When doing validation, I find it will often be the same errors repeated again and again in a transcription. Like it will fail on someone or some thing's name (that is rare / unique) and map it onto a known similar sounding word.

dotancohen · on Sept 23, 2022

Sometimes even human will disagree about what was said in a recording - I had this happen recently. I heard a specific sentence, the other person heard the exact opposite. I cannot say who was right, even after listening to the recording several times on headphones and speakers I'm as certain of my interpretation as was the other party.

gnramires · on Sept 21, 2022

I think an [UNINTELLIGIBLE] indication would be a great addition to automatic transcription systems.

inanutshellus · on Sept 21, 2022

It'd [UNINTELLIGIBLE score="92%" alternatives="pro-rabble; pourable"]probably[/UNINTELLIGIBLE] be useful to make a markup-based output... though you'd probably find it gave you more info than you wanted.

anigbrowl · on Sept 21, 2022

It already exists. The commercial product I use most is called sonix.ai and I think they have a free tier or trial period. It has shortcomings but it's shockingly good, despite having some limitations.

yencabulator · on Sept 22, 2022

Google Voice voicemail transcription used to do this, with varying levels of gray. It seems that feature is gone, now.

gs17 · on Sept 21, 2022

Yeah, I tried to use automated transcription for a research project and we had to do it all manually because the few errors (I would say it did pretty well given our recording quality) were often dropping words like "not", which changed the whole meaning of a sentence! It was a useful assistance during transcription, but I really hope they would verify it was correct before arresting anyone based on it.

hadlock · on Sept 21, 2022

Microsoft announced their voice transcription technology a couple years ago and were also touting ~97-98% accuracy which was actually better than human transcription error rates. The errors are usually in part people garbling their own speech, or they move their head while talking and the microphone misses a syllable. Anything in that error bar would probably fall under "reasonable doubt"

kyriakos · on Sept 21, 2022

If its anything like Microsoft teams transcription I doubt the 97%+ accuracy.

j-krieger · on Sept 21, 2022

I've worked with similar technology in the law enforcement space and the software is never used to make decisions. You can make out critical timestamps in conversations and a law enforcement officer will always manually confirm the software's assessments.

JohnFen · on Sept 21, 2022

Given that law enforcement has made similar claims about technology use in the past that turned out to be false, I have no faith in this claim.

j-krieger · on Sept 28, 2022

In all honesty, this is the correct mindset to have. I have limited expertise in this topic, and you should be aware that other law enforcement agencies probably do not handle this the same way.

CTDOCodebases · on Sept 21, 2022

I imagine a certain percentage of a given population is on a voice call at any one time.

1. Set up a computer with voice recognition software that flags certain patterns.

2. Connect computer to voice call communication network.

3. Configure computer to switch between calls every x number of seconds.

Think of it like a system to generate leads for law enforcement that can be integrated with other systems to produce the best quality leads.

NaturalPhallacy · on Sept 22, 2022

This is called "a fishing expedition" and is wildly unconstitutional in the US.

>The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized.

CTDOCodebases · on Sept 22, 2022

Are you sure about that? [0]

Besides I wasn't talking about the USA when I said this. I was remembering a conversation I once had with a person who worked as a technician in a telephone exchange.

[0] - https://en.wikipedia.org/wiki/Jewel_v._NSA

jjoonathan · on Sept 22, 2022

Yes, it is wildly unconstitutional, but in practice don't the courts endorse the asinine "it's not a search unless we find something" argument from the NSA?

Power always just finds a way to rationalize what it wants to do.

kurisufag · on Sept 22, 2022

see: Operation PRISM

Thorentis · on Sept 21, 2022

Not really. Imagine that they do simple keyword matching on the text. Anything that's missed (part of the 97%) the criminals get away with. Anything that matches in the 3% is then checked by a human (by listening to the audio at that time stamp). So you only need to manually check the 3%, and even then only if something you're interested in is found.

golem14 · on Sept 21, 2022

One would think that the few crucial bits of information gleaned are listened to manually, and the machine translation is not the only thing the judge or a jury sees.

thfuran · on Sept 21, 2022

You have absolutely ruined someone's day way before they're sitting in front of a jury.

formerly_proven · on Sept 21, 2022

Stuff like that is a very good tell that someone has zero experience with law enforcement.

adamgordonbell · on Sept 21, 2022

I've not found that to be the case.

For technical content, I use Rev.com and provide a glossary and real humans do the transcript. Other AI transcription services get lots wrong because the context often matters. Words like "TCP/IP" or "FAT disk format" or "Big Endian" I've never found AI so far to handle well.

I'm interested to test out whisper on this one.

https://corecursive.com/063-apple-2001/

deegles · on Sept 21, 2022

There's already software that can imitate a person's voice, so we have all the pieces already to do speech-to-text, clean up with GPT-3, and back to text-to-speech in the original person's voice. Maybe with a style transfer to keep the person's inflections etc the same?

Karuma · on Sept 21, 2022

I think something similar already exists. See this, for example: https://koe.ai/recast/

Although I don't know if they're using anything similar to what you suggest. Very cool idea, anyway!

biomcgary · on Sept 21, 2022

Since you work on podcasts, do any open source transcription tools currently identity the speaker in the output? This would be particularly helpful for interviews.

nico · on Sept 22, 2022

Not sure about open source, but in general, automated transcription systems need a separate track for each different speaker. So for example, for a phone call with one person on each end, you need two separate channels (recording systems usually split them left/right on one stereo file).

nonoesp · on Sept 22, 2022

I'm not sure if you've tried Descript, but their ML-based "Studio Sound" filter makes bad audio sound like it was recorded and edited nicely.

solarmist · on Sept 21, 2022

Any recommendations for particular services?

anigbrowl · on Sept 21, 2022

I use a service called sonix.ai. It's paid but I think they have a free tier or trial period, and it's not very expensive. I'm excited about this new OpenAI thing because I'd rather do it on my own hardware than send it to the cloud, but this company has earned its commercial success.

solarmist · on Sept 21, 2022

That is an exciting possibility. Being able to fix bad setups and missed takes automagically. It’s always been possible, just expensive and time consuming for moderate improvements.

bambax · on Sept 21, 2022

The French version is a little contrived. The speaker is a native speaker, but the text is obviously the result of a translation from English to French, not idiomatic French.

I will try to put the code to the test, see how it goes.

pen2l · on Sept 21, 2022

Interesting, I'm a non-native French speaker, the original French piece struck me as being entirely normal (but maybe it was just the perfect French accent that swayed me). Can you please point out what he said which wasn't idiomatic or naturally-worded French?

bambax · on Sept 21, 2022

Little details. The second sentence is really bizarre:

> Nous établissons que l'utilisation de données d'un tel nombre et d'une telle diversité est la raison pour laquelle le système est à même de comprendre de nombreux accents...

It doesn't sound natural at all. An idiomatic formulation would be more along the lines of:

Le recours à un corpus [de données] si riche et varié est ce qui permet au système de comprendre de nombreux accents (With 'corpus', 'données' is implied.)

Of course this is just an example, and I'm sure other French speakers could come up with a different wording, but "données d'un tel nombre et d'une telle diversité" sounds really wrong.

This is also weird and convoluted:

> Nous distribuons en tant que logiciel libre le code source pour nos modèles et pour l'inférence, afin que ceux-ci puissent servir comme un point de départ pour construire des applications utiles

It should at least be "le code source DE nos modèles" and "servir DE point de départ", and "en tant que logiciel libre" should placed at the end of the proposition (after 'inférence').

Also, "construire" isn't used for code but for buildings, and "applications utiles" is unusual, because "utiles" (useful) is assumed. "...pour le développement de nouvelles applications" would sound more French.

aGHz · on Sept 22, 2022

That's interesting, as a québécois I don't agree with any of this. The only thing that raised an eyebrow was "est à même de", but if turns out it's just another way of saying "capable de", I guess it's simply not a common idiom around here. Aside from that, I found the wording flowed well even if I personally would've phrased it differently.

slim · on Sept 22, 2022

Mistery solved. It was a quebecois

mazork · on Sept 22, 2022

Gonna have to agree with the other reply, as a french-canadian, except for "servir comme un point de départ" which should be "servir de point de départ", that all sounds perfectly fine.

bambax · on Sept 22, 2022

If this is actually "good" or even acceptable French Canadian, then it's a different language from French (and the blog post should mention it).

I kind of doubt it though -- the speaker doesn't have a Canadian accent (which is hard to miss), and in my (admittedly limited) experience, French Canadian isn't that different from French.

OrangeMusic · on Sept 22, 2022

How funny to see that to French people, Quebec french sounds like machine translated english :)

_plg_ · on Sept 21, 2022

At the start, the "Nous établissons" part, for example. You wouldn't write that if you were starting scratch from French.

otikik · on Sept 22, 2022

That's the first thing that I discovered when I visited Paris for the first time.

No one says "Nous", there, ever. Perhaps the politicians, while giving a speech. Everyone else uses the more informal "On".

I felt duped by my French classes.

mijamo · on Sept 24, 2022

Older generations sometimes do. My grandma and her sisters nearly never uses "on".

It is often used for larger groups or when the group is not very personally connected. For instance when talking about your company doing something you will often use "nous". I would also use "nous" to refer to the whole list of invitees to a wedding. And in formal contextes like research papers, reports etc. You would never use "on", always "nous".

not_math · on Sept 21, 2022

You can see from the transcript where the model made some errors, for example:

> We distribute as a free software the source code for our models and for the inference [...]

Should be

> We are open-sourcing models and inference code [...]

Another example

> We establish that the use of such a number of data is such a diversity and the reason why our system is able [...]

Should be

> We show that the use of such a large and diverse dataset leads to improved robustness [...]

octref · on Sept 21, 2022

I'm interested in building something with this to aid my own French learning. Would love to read your findings if you end up posting it somewhere like twitter/blog!

bambax · on Sept 21, 2022

Last try for tonight with Baudelaire.

Original:

    Trois mille six cents fois par heure, la Seconde
    Chuchote Souviens-toi !– Rapide, avec sa voix
    D'insecte, Maintenant dit Je suis Autrefois,
    Et j'ai pompé ta vie avec ma trompe immonde !

    Remember ! Souviens-toi ! prodigue ! Esto memor !
    (Mon gosier de métal parle toutes les langues )
    Les minutes, mortel folâtre, sont des gangues
    Qu'il ne faut pas lâcher sans en extraire l'or !

Transcription:

> Trois mille six cents fois par heure, la seconde chuchote « Souviens toi », rapide, avec sa voix d''insecte, maintenant dit « Je suis autrefois », et j''ai pompé ta vie avec ma trompe immonde. « Remember, souviens toi, prodigue, est au mémoire, mon gosier de métal, parle toutes les langues, les minutes, mortelles folâtres, sont des gangs qu''il ne faut pas lâcher sans en extraire l''or. »

Not bad! Far from perfect but it's a difficult text. Interesting that it works better with Baudelaire than Pascal.

bambax · on Sept 21, 2022

Tried again with Blaise Pascal -- the famous fragment of a letter where he says he's sorry he didn't have enough time to make it shorter.

Original:

> Mes révérends pères, mes lettres n’avaient pas accoutumé de se suivre de si près, ni d’être si étendues. Le peu de temps que j’ai eu a été cause de l’un et de l’autre. Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte. La raison qui m’a obligé de me hâter vous est mieux connue qu’à moi. Vos réponses vous réussissaient mal. Vous avez bien fait de changer de méthode ; mais je ne sais si vous avez bien choisi, et si le monde ne dira pas que vous avez eu peur des bénédictins.

Transcription:

> Mes rêves errent pères, mais l'detre navais pas accoutumé de se suivre de si près ni d'detre si étendu. Le peu de temps que j'sais eu a été cause de l'de l'de l'de autre. J'sais n'detre plus longue que parce que j'sais pas eu le loisir de la faire plus courte. La raison qui m'sa obligée de me hâter vous est mieux connue qu'moi. Vos réponses vous réussissaient mal. Vous avez bien fait de changer de méthode, mais je ne sais pas si vous avez bien choisi et si le monde ne dira pas que vous avez eu peur des bénédictes.

Here there are many more mistakes, so many that the beginning of the text is unintelligible. The language from the 17th century is probably too different. Still on the "medium" model, as the large one crashes the Colab (not sure how to select a beefier machine.)

Still fascinating and exciting though.

wazoox · on Sept 22, 2022

Depends on the way you're pronouncing it maybe. To be intelligible IMO it must be read differently from a modern text, with well sounding liaisons, and all vowels very distinct: "un" sounds differently from "in", "â" clearly differs from "a", "ai" and "è" from "é" and for instance the "e" in "étendues" must be pronounced, though not loudly.

My test gives that, much better than yours:

Mes *rêverants* pères, mes lettres n'avaient pas accoutumé de se suivre de si près ni d'être si étendues. Le peu de temps que j'ai eu a été cause de l'un et de l'autre. Je n'ai fait celle aussi plus longue que parce que je n'ai pas eu le loisir de *l'af*faire plus courte. La raison qui m'a obligé de me *ra*ter vous est mieux connue qu'à moi. Vos réponses vous réussiss*ez* mal. Vous avez bien fait de changer de méthode. Mais je ne sais si vous avez bien choisi et si le monde ne dira pas que vous avez eu peur des bénédict*eurs*.

bambax · on Sept 23, 2022

Curious. As mentioned I did three tests, two which went pretty well and this one that went bad. I'm French and enunciated the three tests in the exact same way. It's possible there was a technical glitch in this one (that I erroneously attributed to the language of the 17th century)... Will have to try again.

bambax · on Sept 21, 2022

I'm playing with a Colab posted in this thread (https://news.ycombinator.com/item?id=32931349), and it's incredibly fun and accurate!

I tried the beginning of L'étranger (because you seem to be a fan of Camus ;-)

Here's the original:

> Aujourd’hui, maman est morte. Ou peut-être hier, je ne sais pas. J’ai reçu un télégramme de l’asile : « Mère décédée. Enterrement demain. Sentiments distingués. » Cela ne veut rien dire. C’était peut-être hier.

> L’asile de vieillards est à Marengo, à quatre-vingts kilomètres d’Alger. Je prendrai l’autobus à deux heures et j’arriverai dans l’après-midi. Ainsi, je pourrai veiller et je rentrerai demain soir. J’ai demandé deux jours de congé à mon patron et il ne pouvait pas me les refuser avec une excuse pareille. Mais il n’avait pas l’air content. Je lui ai même dit : « Ce n’est pas de ma faute. » Il n’a pas répondu. J’ai pensé alors que je n’aurais pas dû lui dire cela. En somme, je n’avais pas à m’excuser. C’était plutôt à lui de me présenter ses condoléances.

Here's the transcription:

> Aujourdhui, maman est morte, peut être hier, je ne sais pas. J''ai reçu un télégramme de l''asile. Mère décédée, enterrement demain, sentiment distingué. Cela ne veut rien dire. C''était peut être hier.

> L''asile de Vieillard est à Maringot, à 80 km d''Alger. Je prendrai l''autobus à deux heures et j''arriverai dans l''après midi. Ainsi, je pourrai veiller et je rentrerai demain soir. J''ai demandé deux jours de congé à mon patron et il ne pouvait pas me les refuser avec une excuse pareille. Mais il n''avait pas l''air content. Je lui ai même dit, ce n''est pas de ma faute. Il n''a pas répondu. J''ai alors pensé que je n''aurais pas dû lui dire cela. En somme, je n''avais pas à m''excuser. C''était plutôt à lui de me présenter ses condoléances.

Except for the weird double quotes instead of the single apostrophe ('), it's close to perfect, and it only uses the "medium" model.

This is extremely exciting and fun! Happy to try other texts if you have something specific in mind!

suyash · on Sept 21, 2022

More of this is welcome, they should live up their name and original purpose and share other models (code, weights, dataset) in the open source community as well.

Workaccount2 · on Sept 21, 2022

Can't wait to see twelve new $49.99/mo speech parser services pop up in the next few weeks.

quickthrower2 · on Sept 22, 2022

Make hay before Google gives away free hay.

That said there is value in integration of this into other things.

quickthrower2 · on Sept 22, 2022

This has been running on my laptop all day for a 15 min mp3! Definitely not cheap to run then (wont imagine how much AWS compute cost is required).

knaik94 · on Sept 22, 2022

It seems far from good with mixed language content, especially with English and Japanese together. The timestamps are far from perfect. It's far from perfect. It's nowhere close to human for the more ambiguous translations that depend on context of word. It's far below what anyone that spoke either language would consider acceptable. Maybe it's unfair to use music, but music is the most realistic test of whether it's superior to the average human.

quickthrower2 · on Sept 22, 2022

Some music is hard for even people to make out the lyrics to.

darepublic · on Sept 22, 2022

> Neat, https://github.com/openai/whisper - they have open-sourced it, even the model weights, so they are living up to their name in this instance.

Perhaps it will encourage people to add voice command to their apps, which can be sent to gpt3

pabs3 · on Sept 22, 2022

Is the training dataset and code open too?

soheil · on Sept 21, 2022

[flagged]

leobg · on Sept 21, 2022

Seems like this is an urban legend.

https://www.reddit.com/r/IsItBullshit/comments/2rztov/isitbu...

soheil · on Sept 21, 2022

This seems to be primarily based on the referenced Snopes article https://news.ycombinator.com/item?id=32929237

space_fountain · on Sept 21, 2022

This seems to not be true for McDonald: https://www.snopes.com/fact-check/mcdonalds-100-beef/

soheil · on Sept 21, 2022

[flagged]

space_fountain · on Sept 21, 2022

This isn't exactly a hard story to fact check. There is 0 evidence for this in either the reddit thread or really anywhere? If they were willing to lie about the company name why not just lie about the beef in their burgers it would be equally scandalous

soheil · on Sept 21, 2022

The company name could be 100% legit, there is nothing stopping you from a forming a company with that name and not even sell beef.

pessimizer · on Sept 21, 2022

Something being possible to do isn't enough evidence for rational people to believe that it happened. From my perspective, it's possible that you're Iron Mike Tyson, or that you died after your last comment and this one was posted by the assassin who killed you.

soheil · on Sept 21, 2022

What? I never said it's evidence that it did happen, please don't make things up. I just pointed out the evidence provided to refute the claim is possibly invalid.

pessimizer · on Sept 21, 2022

You haven't offered any evidence is the point.

soheil · on Sept 22, 2022

Because I'm not trying to prove that it did or not, but rather make parallels between that and OpenAI's name. For I care it could be an urban legend, but who cares that's not the point.

jsight · on Sept 21, 2022

You are right, it could be. The problem is that its the kind of thing that would be almost impossible to disprove if it were false. So you can always raise doubts about a supposed disproof.

But it'd be really easy to prove if it were true and noone has offered proof. And there've been plenty of people who've looked for such proof, afaict.

My default assumption in such cases is that it is likely false.

jefftk · on Sept 21, 2022

If this was more than an urban legend someone would be able to dig up a company with this name and some indication that McD was working with them.

sam_goody · on Sept 21, 2022

It definitely happens.

There are at least two companies that have branded [..] Kosher Gelatin™. One of them makes gelatin that is considered non-kosher by all of the major kashrus agencies.

"Kosher Gelatin®", when in the ingredients, just means the product contains pork.

mrtranscendence · on Sept 21, 2022

For what it's worth, I've spent a few minutes googling and can't find any story that corroborates this. The only US trademark I can find around "kosher gelatin" is by the brand Kolatin, which is apparently certified Kosher.

samatman · on Sept 22, 2022

I believe that you believe this, but you got had. Pretty funny though.

whichfawkes · on Sept 21, 2022

In the US, for a while I remember we had billboards advertising McDonald's burgers as being "1 <hamburger> <hamburger>% beef". Because the hamburgers were of course circular, it looked kind of like "100%".

I remember thinking that surely an image of a hamburger does not legally constitute a zero.

amelius · on Sept 22, 2022

If consumer laws are so easily circumvented then I have little respect for those making these laws.

jfoster · on Sept 21, 2022

It seems like OpenAI are finally living up to their name for once with this release? Anything I'm missing?

From what I can gather:

1. Includes model weights. I can't find the URL, but they reference them enough and have a CLI tool, so I presume I just haven't found them yet.

2. Includes code: https://github.com/openai/whisper

3. Released under MIT License: https://github.com/openai/whisper/blob/main/LICENSE

thesausageking · on Sept 21, 2022

It's one model and in a non-strategic area where there are existing open source projects (Kaldi, DeepSpeech, ...).

For a company that raised $1B, that's not exactly living up to their name and original mission.

blagie · on Sept 22, 2022

Yes. The same is true of many products from many companies.

I feel bad about GPT-3 and DALL-E being released under the terms they were, but I don't feel bad about this. I'm not going to condemn OpenAI for the good things they did, but I will hold them accountable for bad things or good ones they didn't do.

I'd given up on OpenAI being open or ethical, but this is a start. It took them down from "evil super-villain" status to mere villain.

whimsicalism · on Sept 21, 2022

> It's one model and in a non-strategic area where there are existing open source projects (Kaldi, DeepSpeech, ...).

I can already tell this is much better than any of the existing open source projects with the exception of the wav2* sequence of projects and potentially nvidia's nemo.

thesausageking · on Sept 22, 2022

Kaldi is an open, pluggable framework and is a ton more flexible and powerful than this. It's used by hundreds of teams, including a number of consumer tech companies you've heard of. They're not going to move to this over it.

Especially because ASR is a living organism. You have to constantly update your language model as new people, ideas, and words move into the normal lexicon. As people start talking about "COVID", "metaverse", "king charles", or whatever new things that happen, these need to be added to your language model. You need these updates monthly at a minimum and OpenAI didn't release the raw data which means you can't retrain it even if you wanted to spend the time/resources to.

So, this is an interesting research project and helpful for small teams and side projects, but it's unlikely it makes any real impact on the industry.

whimsicalism · on Sept 22, 2022

Kaldi just is not fast or high quality enough compared to other modern alternatives like wav2letter. I appreciate that it is more flexible than this, it certainly is - but I am not so sure about "powerful."

IshKebab · on Sept 25, 2022

Have you actually tried to use Kaldi though? I have. It's basically impenetrable unless your full time job is working with Kaldi.

solarmist · on Sept 21, 2022

This kind of model is harder to abuse, so I guess it passed their internal checks much more easily.

I can understand not releasing GPT-3, even if I disagree with the decision.

ignoramous · on Sept 21, 2022

> This kind of model is harder to abuse, so I guess it passed their internal checks much more easily.

The version I choose to believe: stability.ai ate DALL-E for lunch, and that woke them up.

solarmist · on Sept 21, 2022

This is probably also true.

jfoster · on Sept 21, 2022

True. The potential of GPT-3 to cause internet mayhem was/is significant. I would argue that the mere act of announcing it was still a catalyst for an eventual GPT-3-like model being released. In revealing it, they established a target for what open source models could aim to achieve, and simultaneously got bad actors thinking about ways to abuse it.

zarzavat · on Sept 22, 2022

It was a credible argument when GPT-3 was released. But now there are open models that are as capable as GPT-3 and that mayhem has not materialized, with the possible exception of GPT-4chan. They could release it now under a non-commercial license, if they cared to.

jfoster · on Sept 22, 2022

Can you provide an example of an open model as capable as GPT-3?

I know there's some "mini-GPT" type models around, but they don't seem nearly as capable.

zarzavat · on Sept 23, 2022

My experience with GPT-3 is that while it does perform better than those mini-GPT small models, the gap does not compensate for the fact that the small models are free/unrestricted and you can use them as much as you like.

As mentioned elsewhere in the thread there are some large models around the 50-200B band that compete directly with GPT-3, but I haven’t used these.

dwohnitmok · on Sept 21, 2022

> I can understand not releasing GPT-3, even if I disagree with the decision.

Why do you disagree?

solarmist · on Sept 21, 2022

Two reasons. First, someone else will release something similar. Second, I didn’t see a related push from them to work with other in the industry to do something productive towards safety with the time they got by delaying availability of these kinds of models. So it felt disingenuous.

moyix · on Sept 22, 2022

Several groups already have. Facebook's OPT-175B is available to basically anyone with a .edu address (models up to 66B are freely available) and Bloom-176B is 100% open:

https://github.com/facebookresearch/metaseq

https://huggingface.co/bigscience/bloom

solarmist · on Sept 22, 2022

Yup. I meant when it had just come out.

bigyikes · on Sept 21, 2022

I don’t see how GPT-3 is any more dangerous than Stable Diffusion, Photoshop, that fake news website the crazy person you’re friends with on Facebook really likes, or any of the number of other tools and services that can be used to generate or spread fake information.

jfoster · on Sept 21, 2022

All of your examples are limited in some way, but GPT-3 wouldn't have any meaningful limits.

Stable Diffusion: Marks images as AI-generated. (invisible watermark, but still, it's there)

Photoshop: Requires time & effort from a human.

Fake news website: Requires time & effort from a human.

xkapastel · on Sept 21, 2022

I wouldn't really say Stable Diffusion marks images as AI-generated. There's a script in the Stable Diffusion repository that will do that, but it's not connected to the model itself in a meaningful way. I use Stable Diffusion a lot and I've never touched this script.

https://github.com/CompVis/stable-diffusion/blob/69ae4b35e0a...

capableweb · on Sept 21, 2022

What "script" are you using for doing txt2img? The watermark function is automatically called when you use the CLI in two places, https://github.com/CompVis/stable-diffusion/blob/69ae4b35e0a... and https://github.com/CompVis/stable-diffusion/blob/69ae4b35e0a...

Trivial to remove, I give you that. But AFAIK, the original repository + most forks put the watermark automatically unless you've removed it on your own.

serf · on Sept 22, 2022

>Trivial to remove, I give you that. But AFAIK, the original repository + most forks put the watermark automatically unless you've removed it on your own.

almost all of the 'low-vram' variant forks either have an argument to turn off the watermark (it saves a bit of memory) or come with it disabled all together.

xkapastel · on Sept 24, 2022

I linked to the same file you did, that is the "script" I was referring to. And I said that I didn't use it.

My point is that the Python API is more interesting than the txt2img script, and it doesn't add any watermarks.

spullara · on Sept 21, 2022

SD only does that if you don't delete the line of code that does it...

nullc · on Sept 22, 2022

It would be pretty trivial to have an invisible watermark in GPT3 output-- though you don't really need one: just score text with gpt3 to find out if it was likely gpt3 generated or not.

mmh0000 · on Sept 21, 2022

Because why should the wealthy and connected be the only ones -allowed- have access to such life improving technology?

StevenWaterman · on Sept 21, 2022

(Model weights from https://github.com/openai/whisper/blob/main/whisper/__init__... )

"tiny.en": "https://openaipublic.azureedge.net/main/whisper/models/d3dd5..."

"tiny": "https://openaipublic.azureedge.net/main/whisper/models/65147..."

"base.en": "https://openaipublic.azureedge.net/main/whisper/models/25a85..."

"base": "https://openaipublic.azureedge.net/main/whisper/models/ed3a0..."

"small.en": "https://openaipublic.azureedge.net/main/whisper/models/f953a..."

"small": "https://openaipublic.azureedge.net/main/whisper/models/9ecf7..."

"medium.en": "https://openaipublic.azureedge.net/main/whisper/models/d7440..."

"medium": "https://openaipublic.azureedge.net/main/whisper/models/345ae..."

"large": "https://openaipublic.azureedge.net/main/whisper/models/e4b87..."

mmastrac · on Sept 21, 2022

Large is 3GB to save everyone a click. Tiny is 72MB.

anigbrowl · on Sept 21, 2022

That's unexpectedly lightweight - enough to run in some phones.

yencabulator · on Sept 22, 2022

However, https://github.com/openai/whisper#available-models-and-langu... says requires ~1 GB VRAM.

danso · on Sept 22, 2022

This is an astonishing package. Every AI voice-to-text model I've tried on "The Wire's" famous "fuck" scene [0] usually fails, because the youtube clip's audio quality is bad and it's a scene with virtually no dialogue except breathing and "Fuck". But Whisper returned impressive results [1]

[0] https://www.youtube.com/watch?v=DS6pE88Xg3s

[1]

    $ yt-dlp --extract-audio --audio-format mp3 -o wire-fuck.mp3 https://www.youtube.com/watch?v=DS6pE88Xg3s

    $ whisper --language en wire-fuck.mp3
    [00:00.000 --> 00:02.000]  Oh
    [00:13.260 --> 00:15.260]  Fuck
    [00:15.260 --> 00:31.260]  Motherfucker
    [00:50.700 --> 00:52.700]  Fuck
    [00:52.700 --> 00:58.700]  Oh
    [00:58.700 --> 01:10.700]  Fuck
    [01:28.700 --> 01:55.900]  Fuck
    [02:02.340 --> 02:03.700]  Motherfuck.
    [02:10.220 --> 02:11.220]  Oh, fuck.
    [02:11.780 --> 02:12.780]  Oh, fuck.
    [02:25.900 --> 02:27.900]  Fuck, fuck, fuck, fuck, fuck, fuck.
    [02:27.900 --> 02:28.900]  Motherfucker.
    [02:32.900 --> 02:33.900]  Oh, fuck.
    [02:34.900 --> 02:35.900]  Fuck.
    [02:35.900 --> 02:36.900]  Oh, fuck.
    [02:36.900 --> 02:37.900]  Oh, fuck.
    [02:37.900 --> 02:38.900]  Oh, fuck.
    [02:48.900 --> 02:49.900]  Motherfucker.
    [02:53.900 --> 02:54.900]  Fucking A.
    [02:54.900 --> 02:56.900]  Mm hmm.
    [02:56.900 --> 03:12.900]  Fuck.
    [03:26.900 --> 03:28.900]  Motherfucker.
    [03:28.900 --> 03:32.900]  Fuck me.
    [03:58.900 --> 04:01.900]  Oh.
    [04:28.900 --> 04:34.900]  Fuck.

marcelfahle · on Sept 29, 2022

As interesting as it is funny. Great benchmark! Here's the rev.ai output for comparison:

  Speaker 0    00:00:12    Oh, fuck motherfucker. Okay. Fuck, fuck, fuck, fuck, fuck, fuck, fuck, fuck. 
 My little fuck.  
  Speaker 1    00:02:10    Oh, fuck. Oh, fuck,  
  Speaker 0    00:02:25    Fuck, fuck, fuck, fuck, fuck, fuck, fuck, fuck my motherfucker.  
  Speaker 1    00:02:53    Fucking a.  
  Speaker 0    00:02:54    Mm-hmm. <affirmative> motherfucker. Fuck me. Um,

AndrewKemendo · on Sept 23, 2022

I've been on HN since 2012 and this might be one of the best comments I've ever read

owenpalmer · on Sept 22, 2022

TaylorAlexander · on Sept 21, 2022

Hey this looks great! I like to record audio notes while driving in my car after work, to kind of decompress my thoughts from the day. But I never go back and listen as they can be long and meandering. Sometimes in the audio log I will sum up my thoughts, but this might be 20 minutes in and hard to find. I really wish I had transcriptions so I could easily scan the full contents. I have tried Mozilla Deepspeech (I don't want a cloud solution) and I was surprised to find that I could not get Deepspeech to reliably transcribe them. There is a bit of road noise, though I think for a human listener they are easy to understand. It looks like this one might actually do the trick!

EDIT: Tried it and it worked great! It is very easy to use. I just did the pip install line in the readme and was ready to go. You literally just run the one pip install line, and then you run the program in the format "whisper my_audio.wav" and it goes. Really nice job OpenAI!

Snitch-Thursday · on Sept 21, 2022

Google's recorder app for android will let you record audio files and make some transcriptions, right on the device.

capableweb · on Sept 21, 2022

Is that application actually doing on-device transcription? Under "Data safety" on the Google Play page it says "This app may share these data types with third parties: Audio" which doesn't exactly instill confidence that my audio will 100% always stay on my device. It also says "Data is encrypted in transit" but if data stays on the device, why it has to be "encrypted in transit"? There should be no transit at all.

bruckie · on Sept 22, 2022

Yes, it works completely offline, including transcription and recognition of music. There's an optional cloud sync feature, which I assume is the reason for the notice on Google Play.

(Work for Google, don't speak for them.)

capableweb · on Sept 22, 2022

Thanks. Whose the third party that might get access to the audio? First party would be me, second party would be Google and then the third?

bruckie · on Sept 22, 2022

I think it's just Google for backup, or other apps via Android's standard sharing sheet. You can read the details here: https://support.google.com/pixelphone/answer/9516618?hl=en

Tenoke · on Sept 21, 2022

I just tested it and it was pretty mediocre at least with my accent. I can definitely benefit from a decent app for quick note recording with a button press->transcribe->upload to gdrive/good UI app for later grepping.

TaylorAlexander · on Sept 21, 2022

Was this with the default base model, or the medium or large model? This can be specified with the —model flag.

Tenoke · on Sept 21, 2022

I meant the 'Google's recorder app' from the parent comment and not Whisper.

TaylorAlexander · on Sept 21, 2022

Ah right, sorry got my comment threads mixed up! Someone else was asking about performance with accented English speakers in another comment.

olao99 · on Sept 22, 2022

Google's recorder app is NOT available for most phones. Only Pixels and a couple of other selected handsets

petercooper · on Sept 21, 2022

I'll probably explore using this, but I've used an app called Just Press Record to do what you say. Runs on Apple Watch too, so you can tap a complication at any time in the day, speak, and you get a transcript on your phone, etc.

zhynn · on Sept 21, 2022

I do this too! I have been doing it for about a year now, and haven't ever run into someone else that does this kind of audio-journaling. Would you be up for comparing notes sometime about how it is working out for you? I am finding that it is extremely effective form of self-care, but with lots of personal caveats. I would be so interested to hear your experience.

TaylorAlexander · on Sept 21, 2022

Oh cool! Yeah I have stopped doing it lately as I was not really using them (I would like to use them for making rough notes for future youtube video scripts), though in general it does seem like good self care too even if I don't review them. That said I just tried the base model on one of my voice logs and it was pretty good! Trying the medium model now and it seems basically perfect. So I will have to start doing these logs more!

Anyway I am pretty terrible with email but short exchanges can work for me, or maybe we can connect over signal. Send me a message to my email in my profile and I would be happy to sync up!

tekacs · on Sept 21, 2022

I do this too, and I’ve built some software for it just for myself.

I’d love to chat and hear about how you use this! My email is in my profile, or I’m @tekacs on Twitter (and everywhere). :)

blueberrychpstx · on Sept 21, 2022

Count me in!! Working on tools actually to turn these transcriptions into something more social

gok · on Sept 21, 2022

Comparing this model's word error rates to the state of the art [1] on a few common test sets:

                           Whisper    SoTA
  LibriSpeech test-clean      2.7%     1.8%
  LibriSpeech test-other      5.6%     2.9%
  Switchboard                13.1%     4.9%
  CallHome                   15.8%     9.5%

The authors do explicitly state that they're trying to do a lot of fancy new stuff here, like be multilingual, rather than pursuing just accuracy.

[1] https://github.com/syhw/wer_are_we

lunixbochs · on Sept 21, 2022

I suspect Whisper is more robust than other "SOTA" models, but this release is likely leaving a fair bit of accuracy on the table considering the amount of resources OpenAI is capable of throwing at training it.

Comparing the readily available test sets from the paper to some of my personal robust models (for the Talon models, this is greedy decoding, no language model):

                       Talon  Talon  Talon  Whisper  wav2vec 2.0
                       28M    300M   1B     Large    960h
    librispeech clean   3.21   2.52   2.40   2.7      2.7
    librispeech other   8.21   6.56   5.63   5.6      6.2
    common voice       13.88  11.65   8.86   9.5     29.9
    tedlium             7.51   6.55   5.47   4.0     10.5

I have a battery of more difficult tests on hand (including adversarial tests, and diverse accent-specific metrics). I'll look at running these tests on each of the Whisper model sizes and following up with a larger comparison.

ma2rten · on Sept 21, 2022

I'm looking forward to your comparison. It's really hard to make sense of how good this model actually is without being an expert in the area.

lunixbochs · on Sept 27, 2022

Just posted results here: https://twitter.com/lunixbochs/status/1574848899897884672

allanrbo · on Sept 21, 2022

Talon was the first thing that came to my mind when I saw this news. Would be nice if it could benefit from Whisper. (Big fan of your work on Talon!)

nshm · on Sept 21, 2022

It is interesting how they compare with wav2vec2 instead of nemo conformer (which is more accurate) in Table 2.

sjnair96 · on Sept 23, 2022

Indeed interesting.

On that note, a core Nvidia NeMo developer I follow posted this: https://twitter.com/HaseoX94/status/1572748653189791745

He calls it a "T5 for ASR" paper :) More insights in there, have a look! Curious to see what your blog would put up as well!

StevenWaterman · on Sept 21, 2022

One of the things they point out is that the SoTA on e.g. LibriSpeech is only good at LibriSpeech, and doesn't generalise as well.

> Because Whisper was trained on a large and diverse dataset and was not fine-tuned to any specific one, it does not beat models that specialize in LibriSpeech performance, a famously competitive benchmark in speech recognition. However, when we measure Whisper’s zero-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models.

lunixbochs · on Sept 21, 2022

My own experience agrees: the generally available "SOTA" models are not especially robust, and can be _extremely_ bad (>50% absolute error rate) at some tasks. I'll post some preliminary numbers in a sibling comment and look into running my full set of tests on Whisper.

It looks like Whisper is probably leaving a lot of accuracy on the table, but initially it does seem to be a lot more robust than general "SOTA" models.

For a quick comparison, Silero's accuracy charts are kind of nice because they post results for a large variety of datasets. Scroll down to the EN V6 xlarge EE model (not the xlarge CE) [1]

[1] https://github.com/snakers4/silero-models/wiki/Quality-Bench...

petercooper · on Sept 21, 2022

Just tested this on some developer podcasts which usually fail hard given they're full of technical jargon, brand names, etc. Whisper is a revolution! It's picking up terms like Heroku, DigitalOcean, GitHub, ECS, AWS, etc. and capitalizing properly - something nothing else did unless you provided a whole pile of guiding vocabulary.

ma2rten · on Sept 21, 2022

Did these podcasts have transcripts? You might be inadvertently evaluating it on data that it was trained on, which is basically cheating. Even if not, it might be trained on similar podcasts. Judging how good these kinds of models are is really hard.

petercooper · on Sept 22, 2022

No transcripts, no. And recent episodes, within the past couple of weeks, so probably not part of the training either.

WiSaGaN · on Sept 21, 2022

True. The test should only be done on the material released after the model.

andy_xor_andrew · on Sept 21, 2022

Hold on, it does not only speech recognition, but also language translation, in the same model?

What an interesting approach. What benefits does this have over having two dedicated models, one for speech-to-text, and another for translation?

It just seems so odd, given the problems of speech-to-text and Spanish-to-English seems so different from one another (in terms of the problem domain). Seems so unusual to have both handled by one model!

Does knowledge of speech-to-text carry over into knowledge of translation? Does knowledge of translation carry over into knowledge of speech-to-text? So weird.

TaylorAlexander · on Sept 21, 2022

It seems these days that language-oriented models are commonly becoming multilingual by default. There are a lot of common threads when understanding sentence construction between different languages. French and English have different rules but they will still have things like nouns, adjectives, subjects, prepositions, etc. It seems that by training models on many languages you get both a more robust understanding of language, and it saves you the trouble of having to make many more localized models for every language. I also believe that the other languages help the models construct sentences in languages which have very small training sets. If it has a few examples in a rare language as well as good translations to a better-known language, then it can provide good support for the rare language.

We also see in image generation models that multi-modal networks are more powerful than single purpose networks. As we move towards more advanced AI systems I suspect we will see more and more generalizable networks with distinct advantages over separate networks that get plugged together.

magicalhippo · on Sept 21, 2022

Would a multilingual modal perhaps also be better at understanding non-natives speech?

TaylorAlexander · on Sept 22, 2022

Good question but I don’t know the answer.

newhaus1994 · on Sept 21, 2022

My understanding is that multi-modal models are the primary focus of OpenAI right now, due to their stated goal of achieving AGI. This product is probably better thought of as an offshoot of their work to create a fully generalizable model, rather than a specific attempt to provide translation/transcription services.

ByThyGrace · on Sept 22, 2022

Judging from the chart in their github README, Whisper performs much better in parsing Spanish audio than any other language and that in particular blows my mind. I would have expected English to be at the top of any such model, it being such an IT lingua franca.

Now I wonder if it works equally well with Spanish from Spain (and its different regions) and Spanish from the New World (and in its myriads of different flavours).

beanlog · on Sept 22, 2022

It sounds useful to me because you can use tone information to help with the translation, which text-to-text translation can't do. But I'm not sure if that's how this model actually works.

thuttinger · on Sept 21, 2022

I tried running it in realtime with live audio input (kind of).

If you want to give it a shot, you can find the python script in this repo: https://github.com/tobiashuttinger/openai-whisper-realtime

A bit more context on how it works: The systems default audio input is captured with python, split into small chunks and is then fed to OpenAI's original transcription function. It tries (currently rather poorly) to detect word breaks and doesn't split the audio buffer in those cases. With how the model is designed, it doesn't make the most sense to do this, but i found it would be worth trying. It works acceptably well.

kkielhofner · on Sept 22, 2022

Haven’t tried it yet but love the concept!

Have you thought of using VAD (voice activity detection) for breaks? Back in my day (a long time ago) the webrtc VAD stuff was considered decent:

https://github.com/wiseman/py-webrtcvad

Model isn’t optimized for this use but I like where you’re headed!

thuttinger · on Sept 22, 2022

Interesting. I'll take a look at this, thanks!

Curiositry · on Sept 22, 2022

Perhaps this could be adapted?

https://github.com/mozilla/DeepSpeech-examples/blob/master/m...

catfan · on Sept 22, 2022

[flagged]

secret-noun · on Sept 22, 2022

impressive

adeptima · on Sept 21, 2022

Japanese results looks pretty impressive!

Took マッコウクジラ14頭が海岸に打ち上げられるオーストラリア(2022年9月21日) https://www.youtube.com/watch?v=bZkNIzeRBk4

Extracted audio with youtube-dl -f bestaudio https://www.youtube.com/watch\?v\=bZkNIzeRBk4

Converted into [00:00.000 --> 00:13.000] オーストラリア南部の島で、真っ向くじら14棟が海岸に打ち上げられて死んでいるのが見つかり、専門家が調査のため原地入りしました。 [00:13.000 --> 00:25.000] 原地メディアによりますと、オーストラリア南部のキング棟で、19日、少なくとも14棟の真っ向くじらが海岸に打ち上げられて死んでいるのが見つかりました。 [00:25.000 --> 00:31.000] ほとんどが若いオーストを見られ、専門家が現場に重むき調査に当たっています。 [00:31.000 --> 00:41.000] くじらの死害は大きく運んだり埋めたりすることが難しいため、自然に分解されるのを待つ方針が検討されています。 [00:41.000 --> 00:52.000] また、死害を狙い、サメが海に集まる可能性があるとして、原地東局はサーファーなどに周囲に近づかないように呼びかけています。 [00:52.000 --> 01:02.000] 一方、21日にはタスマニア棟でおよそ230棟のくじらが浜辺に打ち上げられた状態で見つかりました。 [01:02.000 --> 01:07.000] およそ半数がまだ生きている模様で急助活動が進められています。 [01:07.000 --> 01:23.000] 見つかったのは、ゴンドーくじらの仲間と見られています。

gzer0 · on Sept 21, 2022

Shocked at how good the results are, and how easy of an installation it is.

Here are the exact steps to follow to get it running on Ubuntu 22.04 via WSL and yt-dlp:

  1. pip install git+https://github.com/openai/whisper.git

  2. yt-dlp -f 'ba' -x --audio-format mp3 https://www.youtube.com/watch/?v\=bZkNIzeRBk4

  3. renamed the file to test.mp3

  4. whisper test.mp3 --language Japanese --task translate --model large

Note: the large model will download a ~3Gb file

NaturalPhallacy · on Sept 22, 2022

I did something similar (my ytdl is ytdlp too). You don't even have to grab just the audio, it'll take a webm: https://i.imgur.com/03UFGc8.gif

Amazing work.

adeptima · on Sept 22, 2022

cause ffmpeg inside

https://github.com/openai/whisper/blob/main/requirements.txt

should process most formats

adeptima · on Sept 22, 2022

"--model large" option produces much better results at higher resources consuming costs

knaik94 · on Sept 21, 2022

Did you try translating them to english? I want to see if you get a similar error as me with a random phrase "Translated by Releska" showing up.

lynguist · on Sept 21, 2022

It's called hallucination. As the model is trained on unsupervised data, such errors do seldom happen. The model picks up that such phrases occur in translations and inserts them even if they do not appear in the source. This is described in the paper.

knaik94 · on Sept 22, 2022

I came across it during a silent/instrumental portion in the song I was testing. I asked only because I am curious how frequently the error might show up, I don't expect it to be very common. It's looking at phrase level instead of word level timestamps which is going to make it hard to tokenize music. I asked simply because the parent comment also tested on Japanese.

dom96 · on Sept 21, 2022

This really makes me want to build a Amazon Echo/Google Nest/etc replacement that's open hardware, open source and most importantly recognises voice completely offline. I find that I don't use these smart devices for much more than setting timers anyway so this seems like an easy project.

I just wonder what system requirements Whisper has and whether there are open source voice recognition models that are specifically built for embedded devices.

MacsHeadroom · on Sept 21, 2022

I really want all this too. The smallest model is ~80mb and the largest is 3gb. Not sure about system requirements yet; but models that small suggest this may be doable locally on a single board computer.

Edit: According to this comment[0] the base model runs in real time on an M1 CPU. The tiny model apparently decodes an audio file twice as fast. These are promising results.

[0] https://news.ycombinator.com/item?id=32927360#32929739

lunixbochs · on Sept 21, 2022

For an offline (non-streaming) model, 1x realtime is actually kind of bad, because you need to wait for the audio to be available before you can start processing it. So if you wait 10 seconds for someone to finish speaking, you won't have the result until 10 seconds after that.

You could use really small chunk sizes and process them in a streaming fashion, but that would impact accuracy, as you're significantly limiting available context.

dom96 · on Sept 21, 2022

I'd be interested to see how well it performs on something like an RPi. M1 is pretty beefy.

olao99 · on Sept 22, 2022

To be more precise the original comment said "M1 Max" which in itself is significantly beefier a bare "M1"

suyash · on Sept 21, 2022

This is only one side of the coin, you still need really good models for Speech Synthesis and then be able to have it all working in almost real time, ideally locally on device.

ricopags · on Sept 21, 2022

As far as TTS goes, Mycroft.ai[0] has released a decent offline one.

[0]https://mycroft.ai/

arbol · on Sept 27, 2022

I'm pretty sure mycroft sends your speech snippets to Google for processing so it's not exactly offline.

https://mycroft-ai.gitbook.io/docs/using-mycroft-ai/customiz...

I'm currently trying to setup a deepspeech server on my raspberry pi to see if it works ok for commanding spotify.

Edit: just realised you said `TTS` not `STT`

smcameron · on Sept 22, 2022

pico2wave with "-l=en-GB" option to get the British lady voice is pretty decent (way better than the other voices it does for some reason).

solarkraft · on Sept 21, 2022

Are you thinking about reimplementing Mycroft?

The Mycroft has done a lot of cool and important work in the field to ship an actual personal assistant product (stuff like wake word detection).

dom96 · on Sept 21, 2022

hah, of course someone had the idea already and executed on it. But yeah, basically that but without the screen (probably would go a long way to decrease the cost, $299 is pretty steep for such a device)

sheepybloke · on Sept 22, 2022

One thing they don't touch much on is the STT, as they use models from third parties. You could definitely do something that utilizes this model and then feeds the tokens to some of their parsing code. I've been working on something similar to this, but burned out around adding the STT portion [0].

[0]: https://github.com/Sheepybloke2-0/trashbot - It was called trashbot because the final implementation was going to look like oscar the grouch in a trashcan displaying the reminders.

MayeulC · on Sept 22, 2022

Well, you can always install Mycroft on a Pi, or on your computer.

Almond is also interesting as a voice assistant, though I think it doesn't perform speech recognition itself.