Hacker News new | past | comments | ask | show | jobs | submit login

We have been a pretty large user of this feature within Watson for the last 6 months... while it is pretty good, it lacks the ability to take external inputs such as stereo recordings with channel markers. I've been working on migrating our solution to voicebase whom in my opinion has a much more robust solution when compared to ibm with respect to speaker diarizarion specifically because they include a feature to do channel markers. The result is a conversational transcription being much easier to read. Prior to this we used the Lium project to attempt diarization on a single channel recording. We had mixed results. Without a doubt in the last 12 months speech to text has rapidly improved



Why migrate to another service in 2017 when open source toolkits like Kaldi provide you both better results and more features and no vendor lock-in.


Cool - well we are hiring, so if you'd like to do this reach out. We have lots of neat projects like this going on all the time.


As someone unfamiliar with the terminology, are the speakers isolated in single tracks, or is there a mix on each channel and due to the differences in relative volume, the system is able to distinguish speakers? The latter seems tremendously valuable if difficult to accomplish.


For the evaluation in the paper, speakers are on separate channels (mono, it's a telephone conversation, after all). Generally there are solutions for separating speakers on a single channel that can work fairly well (assuming your training data is similar to the target domain) if you know the number of speakers beforehand, but it's tremendously hard if you don't (think transcription of large meetings).


In fairness, sorting out speakers in a conference call is hard for a human.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: