Wow, training on a dataset of 12 million hours is quite impressive! I can only imagine the engineering feats required to accomplish that. To put it into perspective, Whisper was trained on 650K hours, Speechmatics' Ursa was trained on 1M hours, and AssemblyAI trained Conformer-1 on 650K hours. I hope Meta is also working on something similar!
That being said, Speaker Diarisation is still a problem that hasn't been fully solved. As of yet, AI hasn't been able to outperform humans in this area.
That being said, Speaker Diarisation is still a problem that hasn't been fully solved. As of yet, AI hasn't been able to outperform humans in this area.