They do rely on english & spanish translations to construct the dataset.
They have existing tools to jointly embed sentences from multiple languages (laser). These models are trained using parallel corpora involving either english or spanish on a translation task.
Using these models and the joint embeddings they produce, fb can mine the web for new pairs by roughly identifying whether two sentences in two different languages correspond to a translation pair.
Exactly. I work on Laser2, the approach is the same as Laser [0], Laser2 performs better on some low resource languages.
A more ELI5 explanation would be something like this: Laser is an encoder/decoder architecture trained as a translation task from language X to english/spanish, with the particularity of having only one vector between the encoder and decoder. Once this system is trained (with public translation datasets), the vector that the encoder gives you for an input sentence represents the "meaning" of that sentence, since the decoder relies only on that information to generate a translation. And this vector representation is the same for any language the system was trained on.
So we use that system to mine data from commoncrawl: giving any language pair (say romanian-nepali), having two vectors close to each other in that latent space means the sentences have the same meaning.
We use fastText's language classifier [1] to filter from commoncrawl, compute the vector representations with Laser, and find close vectors thanks to Faiss [2].
> giving any language pair (say romanian-nepali), having two vectors close to each other in that latent space means the sentences have the same meaning.
That's the goal, but have you asked any Romanian-Nepali bilinguals how well it works? (I realize those might be hard to come by.) I had a look at some of the language pairs in CCMatrix, and I noticed that the highest-confidence matches for some of them (English-Chinese, English-Korean) include a lot of quotations from old religious texts. That wouldn't be a problem if they were actually the same quotations, but it looked more like the model managed to identify archaic language and then got overly confident that two archaic sentences must have the same meaning.
I wonder whether there's been a human evaluation of the mined training data, or whether you rely on catching any problems downstream when you measure the BLEU of the trained model.
Of course, the problem with these "synthetic" datasets are when the translation models overfit to the problems in these multilingual sentence encoders/embedders.
Still, it lets you massively increase the amount of data you're training on, which is generally worth it.
They have existing tools to jointly embed sentences from multiple languages (laser). These models are trained using parallel corpora involving either english or spanish on a translation task.
Using these models and the joint embeddings they produce, fb can mine the web for new pairs by roughly identifying whether two sentences in two different languages correspond to a translation pair.