We can't though. This is what the marketing department keeps shouting from the r...

anonzzzies · 2024-11-02T19:21:36 1730575296

Google speech to text, siri, zoom subtitles, youtube subtitles etc insert almost only things I didn't say into the transcriptions. Whisper understands exactly what I say, even if I mumble, use abbreviations etc, and, at speed too. Maybe it does something wrong sometimes; the old way is infinitely worse. It's almost a joke between me and my colleagues to switch on speech to text when doing team calls (even 1 on 1); it gets 99% completely wrong; we talk about programming Typescript; it transcribes about robots and sex and rocks in the purple water; it's funny to read. If you would turn off the audio, you, as a third party, would think we are drunk or on acid, while with sound you would follow every fine.

I assume native english speakers do better (?) but we speak english (with accents) and whisper has no issues at all.

jdiff · 2024-11-02T19:45:15 1730576715

Again, it's seriously impressive tech, and it has its uses. But the failure modes are wildly different and severe, made even more severe by how impressive it is in the usual case. Medical transcriptions find themselves containing cancer diagnoses that were never uttered in reality, for instance.

A failing traditional TTS can be spotted by glancing through a transcript. A failing Whisper can only be identified by thorough comparison, with the failures being far more impactful and important to spot.