We have been a pretty large user of this feature within Watson for the last 6 months... while it is pretty good, it lacks the ability to take external inputs such as stereo recordings with channel markers. I've been working on migrating our solution to voicebase whom in my opinion has a much more robust solution when compared to ibm with respect to speaker diarizarion specifically because they include a feature to do channel markers. The result is a conversational transcription being much easier to read. Prior to this we used the Lium project to attempt diarization on a single channel recording. We had mixed results. Without a doubt in the last 12 months speech to text has rapidly improved
As someone unfamiliar with the terminology, are the speakers isolated in single tracks, or is there a mix on each channel and due to the differences in relative volume, the system is able to distinguish speakers? The latter seems tremendously valuable if difficult to accomplish.
For the evaluation in the paper, speakers are on separate channels (mono, it's a telephone conversation, after all). Generally there are solutions for separating speakers on a single channel that can work fairly well (assuming your training data is similar to the target domain) if you know the number of speakers beforehand, but it's tremendously hard if you don't (think transcription of large meetings).