Only 0.2 %? 1 out of every 500? That seems like a lot to me, especially given that there must be millions if not billions of interactions. How many of those things are out there? And how many interactions does the average user perform? And they keep them all? Forever? I could probably find the numbers myself or at least estimate them, I just don't care enough. I would however by happy to learn them if someone happens to know them.
In response to a deleted comment that said the following. [2]
Google is not training one language model, but many of them (I'd estimate ~70 language models from the voice settings menu on my phone). So 0.2% in total doesn't sound too unrealistic to me as this should be closer to 0.002% per language.
There is a large variation of the number of speakers between different languages. What would they want to do? Aim for the same number of training points for each language? Then for a language with 20 times fewer speakers - Thai compared to English - they would have to look at 4 % [1] of all interactions in Thai. Add to this that the distribution across languages is most likely very skewed, i.e. languages spoken in poorer regions of the world have a lot fewer users than languages spoken in richer regions.
Or maybe they want more training points for more frequently used languages, then, if they aim for a number of training points proportional to the number of interactions, every interaction has a 0.2 % chance of being used as a training sample regardless of the language. If you perform two interactions per day - and I will happily admit that I have not the slightest clue whether this is even on the right order of magnitude, I have never used any such system - then you reach 500 interaction within one year, which means that after one year of usage you have a reasonable chance that at least one of your interactions has become a training point.
[1] Probably not actually true because due the large number of English speakers the percentage for English would most likely be less than 0.2 % but right now I can not be bothered figuring out the correct numbers.
[2] Meta question - would this generally be consider acceptable without naming the user that made the comment? Or should deleted be deleted?