Hacker News new | past | comments | ask | show | jobs | submit login

To be fair, Wikipedia basically appears in its entirety in the training data. It’s a good test to see whether the translation model and all the plumbing works well, but not whether the model generalises well.



Honest question: there's plenty of articles on wikipedia where different language versions of a page are vastly different (it feels like the majority in my experience, but that's no proof of course), how would that be useful as training data unless heavily curated?


The datasets these models are trained on are sentence pairs. So even if just a couple of sentences between two wikipedia sites are direct translations of each other, they will have appeared in the training set. They don’t have to have appeared on the same topic page, it could be that English Wikipedia has a whole category for a topic while Estonian Wikipedia has just a long single page, direct translations will still be identified and used in training.

I also think that the domain and the type of language used on Wikipedia is pretty consistent which will help a lot with unseen sentences.

By no means are these models bad! It’s just that Wikipedia is a particularly easy test for them.


How are these identified? Are they human curated? If not it seems like you need a translator to decide if they are equivalent sentence pairs to build your translator.


You're pretty much right on the money. For ParaCrawl[1] (which I worked on) we used fast machine translation systems that were "good enough" to translate one side of each pair to the language of the other, see whether they'd match sufficiently, and then deal with all the false positives through various filtering methods. Other datasets I know of use multilingual sentence embeddings, like LASER[2], to compute the distance between two sentences.

Both of these methods have a bootstrapping problem, but at this point in the MT for many languages we have enough data to get started. Previous iterations of ParaCrawl used things like document structure and overlap of named entities among sentences to identify matching pairs. But this is much less robust. I don't know how they solve this problem today for low-resource languages.

[1] https://paracrawl.eu

[2] https://github.com/yannvgn/laserembeddings




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: