Data in ML is critical, and this release from Mozilla is absolute gold for voice research.
This dataset and will help the many independent deep learning practitioners such as myself that aren't working at FAANG and have only had access to datasets such as LJS [1] or self-constructed datasets that have been cobbled together and manually transcribed.
Despite the limited materials available, there's already some truly amazing stuff being created. We've seen a lot of visually creative work being produced in the past few years, but the artistic community is only getting started with voice and sound.
Another really cool thing popping up are TTS systems trained from non-English speakers reading English corpuses. I've heard Angela Merkel reciting copypastas, and it's quite amazing.
I've personally been dabbling in TTS as one of my "pandemic side projects" and found it to be quite fun and rewarding:
Besides TTS, one of the areas I think this data set will really help with is the domain of Voice Conversion (VC). It'll be awesome to join Discord or TeamSpeak and talk in the voice of Gollum or Rick Sanchez. The VC field needs more data to perfect non-aligned training (where source and target speakers aren't reciting the same training text that is temporally aligned), and this will be extremely helpful.
I think the future possibilities for ML techniques in art and media are nearly limitless. It's truly an exciting frontier to watch rapidly evolve and to participate in.
Curious to know why don't researchers use Audiobooks/Videos and transcript when data is not available? Is it because these do not capture different dialects/accents?
This is great! I’m always excited to see new common voice releases.
As someone actively using the data I wish I could more easily see (and download lists for?) the older releases as there have been 3-4 dataset updates for English now. If we don’t have access to versioned datasets, there’s no way to reproduce old whitepapers or models that use common voice. And at this point I don’t remember the statistics (hours, accent/gender breakdown) for each release. It would be neat to see that over time on the website.
I’m glad they’re working on single word recognition! This is something I’ve put significant effort into. It’s the biggest gap I’ve found in the existing public datasets - listening to someone read an audiobook or recite a sentence doesn’t seem to prepare the model very well for recognizing single words in isolation.
My model and training process have adapted for that, though I’m still not sure of the best way to balance training of that sort of thing. I have maybe 5 examples of each English word in isolation but 5000 examples of each number (Speech Commands), and it seems like the model will prefer e.g. “eight” over “ace”, I guess due to training balance.
Maybe I should be randomly sampling 50/5000 of the imbalanced words each epoch so the model still has a chance to learn from them without overtraining?
What if you first trained a classifier that told you if the uttereance is a single word vs. multiple words? Then, based on that prediction, you would use one of two separate models.
The technique you're thinking of is called oversampling and there are many other general techniques for dealing with imbalanced datasets, as it's a very common situation.
Thanks, the oversampling mention gives me a good reference to start.
The model itself has generalized pretty well to handle both single and multi word utterances I think, without a separate classifier, but I'm definitely not going to rule out multi-model recognition in the long run.
My main issues with single words right now are:
- The model sometimes plays favorites with numbers (ace vs eight)
- Collecting enough word-granularity training data for words-that-are-not-numbers (I've done a decent job of this so far, but it's a slow and painful process. I've considered building a frontend to turn sentence datasets into word datasets with careful alignment)
For that last point, forced alignment tools may be useful.
An issue to watch for though is elision: a word in a sentence can often be said differently to the individual words, eg saying "last" and "time" separately one typically includes the final t in last and yet said together, commonly it's more like "las time".
Yeah, I'm familiar with forced alignment. This is slightly nicer than the generic forced alignment, because my model has trained on the alignment of all of my training data already. My character based models already have pretty good guesses for the word alignment.
I think I'd be very cautious about it and use a model with a different architecture than the aligner to validate extracted words, and probably play with training on the data a bit to see if the resulting model makes sense or not. I do have examples of most english words to compare extracted words.
Does this dataset include people with voice or speech disorders (or other disabilities)? I don’t see any mention of it in this announcement or the forums, though I haven’t looked thoroughly (yet).
Examples: dysphonias of various kinds, dysarthria (e.g. from ALS / cerebral palsy), vocal fold atrophy, stuttering, people with laryngectomies / voice prosthesis, and many more.
Altogether, this represents millions of people for whom current speech recognition systems do not work well. This is an especially tragic situation, since people with disabilities depend more heavily on assistive technologies like ASR. Data/ML bias is rightfully a hot topic lately, so I feel that the voices of people w/ disabilities need to be amplified as well (npi).
Gathering, collecting, and publishing such a dataset would be great, and would certainly much improve the baseline speech recognition for people with disordered speech, but it can only help so much without personalizing to a specific individual. This is true for anybody, but more so for disordered speech. This is an area where I think "generic" solutions will inevitably struggle, even if they are somewhat specialized on "generic disarthritic" speech.
However, this means that the gains to be had from personalized training are greater for disordered speech than for "average" speech. I develop kaldi-active-grammar [0], which specializes the Kaldi speech recognition engine for real-time command & control with many complex grammars. I am also working on making it easier to train personalized speech models, and to fine tune generic models with training for an individual. I have posted basic numbers on some small experiments [1]. Such personalized training can be time consuming (depending on how far one wants to take it), but as my parent comment says, disabled people may need to rely more on ASR, which means they have that much more to gain by investing the time for training.
Nevertheless, a Common Voice disordered speech dataset would be quite helpful, both for research, and for pre-training models that can still be personalized with further training. It is good to see (in my sibling comment) that it is being discussed.
It’s not in the current dataset, but offering such a disordered speech dataset has been discussed. I imagine it’s something that will probably be offered at some point in future.
I have heard a few people with speech disorders when validating clips. I also recall some discussion of it in the discord or issue tracker. At the moment it is entirely up to people to encourage people with voice or speech disorders to submit. So long as they meet the validation criteria they will be included. I can't see a flag in the user profile for recording a disorder, so it's not likely you can filter just these recordings from the data.
After taking some time to dig into the project and forums, I’m more concerned. I have worked on building and validating large-scale ML datasets in tricky domains before and am best friends with an SLP, so I have some context for understanding the challenges involved with creating a dataset like this. I apologize if my tone is too harsh—my intent is to help hold the ML community to a higher standard on this issue and generate productive conversation via my criticisms.
The Common Voice FAQs say the right words about the mission of the project:
>”As voice technologies proliferate beyond niche applications, we believe they must serve all users equally.”
>”We want the Common Voice dataset to reflect the audio quality a speech-to-text engine will hear in the wild, so we’re looking for variety. In addition to a diverse community of speakers, a dataset with varying audio quality will teach the speech-to-text engine to handle various real-world situations, from background talking to car noise. As long as your voice clip is intelligible, it should be good enough for the dataset.”
However, your data validation criteria both implicitly and explicitly exclude entire classes of people from the dataset, and allow for the validators to impose an arbitrary standard of purity regarding what constitutes “correct” speech. In so doing, you are influencing who is and isn’t understood by systems built upon this data. Examples from the docs (https://discourse.mozilla.org/t/discussion-of-new-guidelines...)
>”You need to check very carefully that what has been recorded is exactly what has been written - reject if there are even minor errors.”
As currently stated, this criteria leads to the categorical exclusion of people for whom speaking without “even minor errors” is not possible (ex: lalling and other phonological disorders, where certain phonemes can’t be formed), based on the validators’ subjective perception of data cleanliness.
>”Most recordings are of people talking in their natural voice. You can accept the occasional non-standard recording that is shouted, whispered, or obviously delivered in a ‘dramatic’ voice. Please reject sung recordings and those using a computer-synthesized voice.”
Look at this kid’s face light up and tell me that’s not his new natural voice. An electrolarynx is not a computer-synthesized voice (you manipulate the muscles in your neck to generate vibrations—like an external set of vocal cords). Although it would almost definitely be mistaken for one, and summarily sent to the “clip graveyard” (https://voice.mozilla.org/en/about).
>”I tend to click ‘no’ and move on for extreme mispronounced words. I’m of the opinion that soon enough, another speaker from their nationality will submit a correct recording.”
Again, the use of the word “correct” here is problematic. Rejecting borderline cases and waiting for “cleaner” samples is a severe trap to fall into, regardless of the domain.
>”I do the same as you. Accept if it’s an elongation; reject if the reader takes two attempts to start the word.”
Again, this almost categorically excludes people with a stutter and other types of speech disorders.
@dabinat gets its right with this comment:
>”There are uses for CV and DeepSpeech beyond someone directly dictating to their computer. In my opinion, CV’s voice archive should contain as many different ways to say something as possible.”
But then...
>”You may well be right. I’d be interested to hear what the programmers’ expectations are.”
>”I will ping @kdavis and @josh_meyer for feedback on the ML expectations (in terms of what’s good/bad for deepspeech).”
Yikes. So the data is being selected to improve performance benchmarks of the speech recognition model, and not to better reflect the nuances and variety of speech in the real world (as was the stated goal of Common Voice). It’s very easily the case that cherry picking data to improve test benchmarks will decrease generalizability of the model in other applications. Narrowing the range of human speech to make the problem easier (as in simpler to build a model that functions well for most people) is antithetical to your stated mission. We can’t keep measuring AI progress in parameters and petabytes. It has to be about the people it helps.
>”I agree that we don’t want to scare off new contributors off by presenting the guidelines up-front as an off-putting wall of text that they have to read.”
Limiting the amount of documentation/training available to data annotators in an effort not to scare them is a surefire way to end up with inconsistently labeled data.
Although I find the above examples to be dismaying, I do not mean to ascribe any ill intent to your team or the volunteers. I understand the complexities at play here. But the outright dismissal of certain types of voices as out-of-scope or not “correct” is causing real harm to real people, because ASR systems simply do not work well for people with various disabilities. I could find no direct mention or acknowledgement of the existence of speech disorders anywhere* on the website or forum.
I believe there needs to be a more deliberate effort to construct a more representative dataset in order to meet your stated mission (which I am willing to volunteer my time towards). Just some initial ideas:
- Augment the dataset by folding in samples from external datasets (e.g. https://github.com/talhanai/speech-nlp-datasets). I’m not sure on the approach, but if movie scripts can be adapted, presumably so can other voice datasets.
- Retain samples with speech errors like mispronunciations and stutters (perhaps with a flag indicating the error). In fact, why not retain all samples, flagging those that are unintelligible? At least keep it available, for data provenance purposes (so it is known what was excluded and can be reversed).
- Establish a relationship with speech-language pathologists to collect or validate samples (eg: universities or the VA, who have many complex/polytrauma voice patients). Sessions with SLPs often involve having patients read sentences aloud, so it’s a familiar task. This is probably the best way to collect data from people with voice disorders, so volunteer annotators aren’t responsible for analyzing a complex subset.
- Use inter-annotator agreement measures to characterize uncertainty about sample accuracy, rather than binary accept/reject criteria.
- Collect/solicit more samples from people >70yrs old, since they are currently underrepresented in your data. Is there anyone over the age of 80 in your dataset at all?
- Improve your documentation and standards to be more explicitly transparent about the ways in which it does not currently represent everyone, and plans for bridging these gaps.
Hi! I'm the lead engineer for Common Voice and I wanted to thank you for taking the time to write such a thoughtful comment. Your criticisms on disordered speech and exclusion criteria are absolutely on point, and the issue of speaker diversity in our dataset as well as the lack of nuance in our validation mechanism is something we're very aware of and actively working to address. We are and have historically been a very small team and have thus far concentrated our efforts on language diversity, which is an explanation but not an excuse for some of these gaps.
There are some legal concerns with folding in samples from other datasets for licensing reasons, because all of our data (sentences and audio) are CC0, but I definitely hear you on looking for ways to expand our scope. As part of our commitment to open data, all voice samples are released as part of our dataset regardless of their validation status, and we do not filter or discard any contributions from our community. One of the things the team is currently scoping is how to allow contributors to provide reasons for rejecting a particular clip, to enable exactly the kind of post-hoc analysis you're describing.
Please do join us in Discourse or Matrix, we would love for you to be involved in ongoing discussions on how to improve inclusion and accessibility. Again, thank you for taking the time to express this, I really appreciate it.
It might be a good idea to post this comment directly to Discourse to make sure it gets Mozilla’s attention.
But as I mentioned, this has been discussed, including the ability for users to add flags to their profiles to indicate disordered speech.
IMO it might be better to include disordered speech in a separate dataset with separate validation requirements, which would require new features on the site. But the new “target segments” feature is a step towards achieving such a thing.
You raise good points. For what it's worth, I think all "invalidated" samples are still included in the distribution (invalidated.tsv), with the number of up and down votes for each (but not the reasoning).
Google is certainly doing some great work with this, both Project Euphonia and other research [0]. However, as far as I know, the Euphonia dataset is closed and only usable by Google. A Common Voice disordered speech dataset would (presumably) be open to all, allowing independent projects and research. (I would love to have access to such a dataset.)
I would love to work for Mozilla on this effort full time. I have experience in voice data collection / annotation / processing at 2 FAANG companies. Anyone have an in? Thinking of reaching out to the person on who wrote this post directly.
The people on those related projects seem like a great bunch of people to work with.
I don't have "an in" but it's probably worth having a look over the Common Voice and Deep Speech forums on Discourse to see who the main people are. They also hang out in their Matrix Chat groups, so might be able to get in touch that way. Links are below.
How long do you think it will be before we have personalized language/reading coaches talking to us during our morning commute to the downstairs office?
Why on earth are they using mp3 for the dataset? Its absolutely ancient and probably the worst choice possible. Opus is widely used for voice because it gets flawless results at minuscule bitrates. And don't tell me its because users find mp3 simpler because if you are doing machine learning I expect you know how to use an audio file.
Probably because they're uploading (and playing back) from a webpage and Web Audio is weird and inconsistent, so sticking to a builtin codec is probably more reliable. As someone who trains on their data, it seems usable anyway. Training on 1000 hours of Common Voice makes my model better in very clear ways.
Yeah especially compatibility with Apple browsers was very important for them. I'd added functionality to normalize audio for verification but they removed it multiple times because it didn't work on Safari for various reasons.
In general I don't think normalization should happen at the backend. It's useful for training data to have multiple loudness levels, so that the network can understand them all.
They make it really easy to contribute! You don't need to make an account (you can though) and you read/review short sentences. I just added 75 recordings and it only took ~30 minutes. Also if you speak other languages you can contribute in them. It would really be great if there was a comprehensive public voice dataset for people to do all sorts of interesting things with.
The complete project is very exciting, I hope that this is really a game changer, that enables private persons and startups to create new neural networks without a big investment for the data collection.
I worked for the Esperanto dataset of common voice in the last year, and we now have collected over 80 hours in Esperanto. I hope that in a year or two we'll have collected enough data to create the first usable neural network for a constructed language and maybe the first voice assistant in Esperanto too. I will train a first experimental model with this release soon.
This is interesting. As someone who has always tons of interview data to transcribe for academic research, what TTS systems should I be looking into to help me save some time? Is Deep Speech adapted for this use?
I've been working on this. I think I can reliably hit the quality ballpark of STT APIs at the acoustic model level, but not at the language model level (word probabilities) in a low-powered-way yet.
This dataset and will help the many independent deep learning practitioners such as myself that aren't working at FAANG and have only had access to datasets such as LJS [1] or self-constructed datasets that have been cobbled together and manually transcribed.
Despite the limited materials available, there's already some truly amazing stuff being created. We've seen a lot of visually creative work being produced in the past few years, but the artistic community is only getting started with voice and sound.
https://www.youtube.com/watch?v=3qR8I5zlMHs
https://www.youtube.com/watch?v=L69gMxdvpUM
Another really cool thing popping up are TTS systems trained from non-English speakers reading English corpuses. I've heard Angela Merkel reciting copypastas, and it's quite amazing.
I've personally been dabbling in TTS as one of my "pandemic side projects" and found it to be quite fun and rewarding:
https://trumped.com
https://vo.codes
Besides TTS, one of the areas I think this data set will really help with is the domain of Voice Conversion (VC). It'll be awesome to join Discord or TeamSpeak and talk in the voice of Gollum or Rick Sanchez. The VC field needs more data to perfect non-aligned training (where source and target speakers aren't reciting the same training text that is temporally aligned), and this will be extremely helpful.
I think the future possibilities for ML techniques in art and media are nearly limitless. It's truly an exciting frontier to watch rapidly evolve and to participate in.
[1] https://keithito.com/LJ-Speech-Dataset/