Any opinions on what this means for speech-to-text companies like rev.ai and ass...

nshm · on Sept 21, 2022

You can apply public punctation model from Vosk on top of Kaldi output, you can also get speaker labels with existing open source software.

On quick video transcription test this model is more accurate than AssemblyAI and Rev AI. It will be harder for them to sell pure ASR now. Some more business-oriented applications will still be important though, for example ASR as part of callcenter analytics solution or as a part of medical ERP system.

The value of automatic summarization is small, without AI it is very hard to make it right, you need to be an expert in the field to understand what is important.

sjnair96 · on Sept 23, 2022

> you can also get speaker labels with existing open source software.

Hello Nickolay :)

Diarization has always been the hard part for me, especially since it is very difficult to do comparisons within your domain. The evaluation metrics are not descriptive enough imo.

Would you say Titanet or EcapaTDNN are decent for use in production alongside, say, Whisper, or any other ASR output, if given the timestamps, so as to bypass running VAD? I'm just about to run experiments to try pyannote's diarization model and google's uis-rnn to test out how well they work, but it's a tad beyond my ability to evaluate.

I also wonder if Whisper architecture would be good for generating embeddings, but I feel it's focused so much on what is said rather than how it's said that it might not transfer over well to speaker tasks.

phren0logy · on Sept 22, 2022

Rev AI will also create a transcription separated by multiple speakers, which it doesn't appear Whisper can do (yet). I expect that Whisper will overtake the alternatives soon, given that it's open source, but today it's not there yet.