That's not a great metric and is going to be incredibly language dependent. For ...

That's not a great metric and is going to be incredibly language dependent. For example, the European languages all have a lot of similarities and so it should be unsurprising that a model trained on English can do pretty well on French and German. But then if you are to ask it a language that is fairly disjoint (say Chinese) then you are held back by the lack of language data from that dataset (or you run into the exact same issue as previously).

It's definitely a metric worth trying, but we also must recognize the extreme limits of it too. Evaluation is quite difficult and the better our models perform the more difficult evaluation actually becomes. Anyone saying otherwise is likely trying to sell you something.