Exactly. I work on Laser2, the approach is the same as Laser [0], Laser2 perform...

yorwba · on Oct 23, 2020

> giving any language pair (say romanian-nepali), having two vectors close to each other in that latent space means the sentences have the same meaning.

That's the goal, but have you asked any Romanian-Nepali bilinguals how well it works? (I realize those might be hard to come by.) I had a look at some of the language pairs in CCMatrix, and I noticed that the highest-confidence matches for some of them (English-Chinese, English-Korean) include a lot of quotations from old religious texts. That wouldn't be a problem if they were actually the same quotations, but it looked more like the model managed to identify archaic language and then got overly confident that two archaic sentences must have the same meaning.

I wonder whether there's been a human evaluation of the mined training data, or whether you rely on catching any problems downstream when you measure the BLEU of the trained model.