Hacker News new | past | comments | ask | show | jobs | submit login

Exactly. I work on Laser2, the approach is the same as Laser [0], Laser2 performs better on some low resource languages.

A more ELI5 explanation would be something like this: Laser is an encoder/decoder architecture trained as a translation task from language X to english/spanish, with the particularity of having only one vector between the encoder and decoder. Once this system is trained (with public translation datasets), the vector that the encoder gives you for an input sentence represents the "meaning" of that sentence, since the decoder relies only on that information to generate a translation. And this vector representation is the same for any language the system was trained on.

So we use that system to mine data from commoncrawl: giving any language pair (say romanian-nepali), having two vectors close to each other in that latent space means the sentences have the same meaning.

We use fastText's language classifier [1] to filter from commoncrawl, compute the vector representations with Laser, and find close vectors thanks to Faiss [2].

[0] https://engineering.fb.com/ai-research/laser-multilingual-se...

[1] https://fasttext.cc/docs/en/language-identification.html

[2] https://github.com/facebookresearch/faiss




> giving any language pair (say romanian-nepali), having two vectors close to each other in that latent space means the sentences have the same meaning.

That's the goal, but have you asked any Romanian-Nepali bilinguals how well it works? (I realize those might be hard to come by.) I had a look at some of the language pairs in CCMatrix, and I noticed that the highest-confidence matches for some of them (English-Chinese, English-Korean) include a lot of quotations from old religious texts. That wouldn't be a problem if they were actually the same quotations, but it looked more like the model managed to identify archaic language and then got overly confident that two archaic sentences must have the same meaning.

I wonder whether there's been a human evaluation of the mined training data, or whether you rely on catching any problems downstream when you measure the BLEU of the trained model.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: