Any opinions on what this means for speech-to-text companies like rev.ai and assmembly.ai ?
We've tested open source solutions for s2t, like kaldi, but the quality was not good enough. However, one of the main advantages of a service like assembly.ai to me was that they offer sentence splitting in form of punctuation and speaker detection, which Kaldi does not.
So I guess I answered my own question to some degree: A S2T service is more than just S2T. We already see assembly.ai add more and more features (like summarisation, PID redaction ect.) that are a value-add to plain S2T.
You can apply public punctation model from Vosk on top of Kaldi output, you can also get speaker labels with existing open source software.
On quick video transcription test this model is more accurate than AssemblyAI and Rev AI. It will be harder for them to sell pure ASR now. Some more business-oriented applications will still be important though, for example ASR as part of callcenter analytics solution or as a part of medical ERP system.
The value of automatic summarization is small, without AI it is very hard to make it right, you need to be an expert in the field to understand what is important.
> you can also get speaker labels with existing open source software.
Hello Nickolay :)
Diarization has always been the hard part for me, especially since it is very difficult to do comparisons within your domain. The evaluation metrics are not descriptive enough imo.
Would you say Titanet or EcapaTDNN are decent for use in production alongside, say, Whisper, or any other ASR output, if given the timestamps, so as to bypass running VAD? I'm just about to run experiments to try pyannote's diarization model and google's uis-rnn to test out how well they work, but it's a tad beyond my ability to evaluate.
I also wonder if Whisper architecture would be good for generating embeddings, but I feel it's focused so much on what is said rather than how it's said that it might not transfer over well to speaker tasks.
Rev AI will also create a transcription separated by multiple speakers, which it doesn't appear Whisper can do (yet). I expect that Whisper will overtake the alternatives soon, given that it's open source, but today it's not there yet.
We've tested open source solutions for s2t, like kaldi, but the quality was not good enough. However, one of the main advantages of a service like assembly.ai to me was that they offer sentence splitting in form of punctuation and speaker detection, which Kaldi does not.
So I guess I answered my own question to some degree: A S2T service is more than just S2T. We already see assembly.ai add more and more features (like summarisation, PID redaction ect.) that are a value-add to plain S2T.
Still, curious to hear what your take on that is.