To be fair, Wikipedia basically appears in its entirety in the training data. It...

vanderZwan · on Nov 30, 2022

Honest question: there's plenty of articles on wikipedia where different language versions of a page are vastly different (it feels like the majority in my experience, but that's no proof of course), how would that be useful as training data unless heavily curated?

jelmervdl · on Nov 30, 2022

The datasets these models are trained on are sentence pairs. So even if just a couple of sentences between two wikipedia sites are direct translations of each other, they will have appeared in the training set. They don’t have to have appeared on the same topic page, it could be that English Wikipedia has a whole category for a topic while Estonian Wikipedia has just a long single page, direct translations will still be identified and used in training.

I also think that the domain and the type of language used on Wikipedia is pretty consistent which will help a lot with unseen sentences.

By no means are these models bad! It’s just that Wikipedia is a particularly easy test for them.

kevincox · on Nov 30, 2022

How are these identified? Are they human curated? If not it seems like you need a translator to decide if they are equivalent sentence pairs to build your translator.

jelmervdl · on Nov 30, 2022

You're pretty much right on the money. For ParaCrawl[1] (which I worked on) we used fast machine translation systems that were "good enough" to translate one side of each pair to the language of the other, see whether they'd match sufficiently, and then deal with all the false positives through various filtering methods. Other datasets I know of use multilingual sentence embeddings, like LASER[2], to compute the distance between two sentences.

Both of these methods have a bootstrapping problem, but at this point in the MT for many languages we have enough data to get started. Previous iterations of ParaCrawl used things like document structure and overlap of named entities among sentences to identify matching pairs. But this is much less robust. I don't know how they solve this problem today for low-resource languages.

[1] https://paracrawl.eu

[2] https://github.com/yannvgn/laserembeddings