We can't though. This is what the marketing department keeps shouting from the rooftops, this is what the futurists keep prophesying, but in practice these things seem to be giant liability machines with no common sense, self doubt, or ability to say "I don't know, let me ask someone else."
They're neat, they have uses, but they're not replacements for anything. Even Whisper cannot replace a human transcriptionist as it will just make up and insert random lines that are not present in the source audio.
Google speech to text, siri, zoom subtitles, youtube subtitles etc insert almost only things I didn't say into the transcriptions. Whisper understands exactly what I say, even if I mumble, use abbreviations etc, and, at speed too. Maybe it does something wrong sometimes; the old way is infinitely worse. It's almost a joke between me and my colleagues to switch on speech to text when doing team calls (even 1 on 1); it gets 99% completely wrong; we talk about programming Typescript; it transcribes about robots and sex and rocks in the purple water; it's funny to read. If you would turn off the audio, you, as a third party, would think we are drunk or on acid, while with sound you would follow every fine.
I assume native english speakers do better (?) but we speak english (with accents) and whisper has no issues at all.
Again, it's seriously impressive tech, and it has its uses. But the failure modes are wildly different and severe, made even more severe by how impressive it is in the usual case. Medical transcriptions find themselves containing cancer diagnoses that were never uttered in reality, for instance.
A failing traditional TTS can be spotted by glancing through a transcript. A failing Whisper can only be identified by thorough comparison, with the failures being far more impactful and important to spot.
They're neat, they have uses, but they're not replacements for anything. Even Whisper cannot replace a human transcriptionist as it will just make up and insert random lines that are not present in the source audio.