Only 0.2 %? 1 out of every 500? That seems like a lot to me, especially given th...

danbruc · on July 11, 2019

In response to a deleted comment that said the following. [2]

Google is not training one language model, but many of them (I'd estimate ~70 language models from the voice settings menu on my phone). So 0.2% in total doesn't sound too unrealistic to me as this should be closer to 0.002% per language.

There is a large variation of the number of speakers between different languages. What would they want to do? Aim for the same number of training points for each language? Then for a language with 20 times fewer speakers - Thai compared to English - they would have to look at 4 % [1] of all interactions in Thai. Add to this that the distribution across languages is most likely very skewed, i.e. languages spoken in poorer regions of the world have a lot fewer users than languages spoken in richer regions.

Or maybe they want more training points for more frequently used languages, then, if they aim for a number of training points proportional to the number of interactions, every interaction has a 0.2 % chance of being used as a training sample regardless of the language. If you perform two interactions per day - and I will happily admit that I have not the slightest clue whether this is even on the right order of magnitude, I have never used any such system - then you reach 500 interaction within one year, which means that after one year of usage you have a reasonable chance that at least one of your interactions has become a training point.

[1] Probably not actually true because due the large number of English speakers the percentage for English would most likely be less than 0.2 % but right now I can not be bothered figuring out the correct numbers.

[2] Meta question - would this generally be consider acceptable without naming the user that made the comment? Or should deleted be deleted?