This is a really interesting example. I can't read Chinese but I'm assuming the input is an address or location?
Such domain knowledge would be vital in providing a good translation and sanity checking the output. But a straight sequence to sequence machine translation would not capture that context. It looks like thay is what is happening with the first two translations, while Google's may have actually realized its and address (but maybe not as you say the answer is wrong, maybe their ML is just better).
Your example highlights the point that naked ML models can only ever be so good, and that it's really as part of a system that they can be truly effective. You can imagine in the translation some combination of a classifier or NER that identifies an address, a translation model, and an english model that detects a sensible answer.
Without context it's a good attempt, only with deeper cultural knowledge would someone be able to guess that "Miao" on its own is too short for a street name and it's likely "Tangmiao" together, though without knowing the place I think it would also be reasonable to guess that the whole thing is the road name like "Gudangtangmiao Road".
Actually DeepL seems to be the best of the bunch.
Here is another test:
Input: 北京市西城区地安门西大街49号
DeepL: No.49, Di'anmen West Street, Xicheng District, Beijing
Google: 49 Di'anmen West Street, Xicheng District, Beijing
Watson: No. 49 Avenue West Main Street in Xicheng District, Beijing
Libre: 49th Anniversary Street, Westtown, Beijing
(Libre is the worst for all address type input - which is what I'm interested in - shameless plug - I'm building a Geocoder from China at https://geocode.xyz/CN . I've so far tested over 3k addresses)
Deepl seems to have a bit of an issue with uncommon street suffixes in Chinese. For example:
江苏省苏州市姑苏区东中市374号槔桥头
Google: Bridge Head, No. 374, Dongzhong City, Gusu District, Suzhou City, Jiangsu Province
Deepl: Pulley Bridge, No. 374, Dongzhong City, Gusu District, Suzhou, Jiangsu Province
Libre: Cambridge No. 374 in the eastern part of the city of Jiangsu, province of Jiangsu
Watson: No. 374 Bridge Head in East China, Suzhou, Suzhou, Jiangsu Province.
You can kind of see which service goes for the literal more than the interpretive. It might be a bit unfair to use an address that is more than just the literal street address, although in actual speech, this address is at an intersection with two bridges and it's enough of a local landmark that the addition would make sense. Except for the uncertainty over the proper name of the bridge that actually doesn't throw Google or Deepl off. It's the 市suffix for a street, not unique in this city but certainly rare, that gets all of the services. Libre at least gives it a try, Watson just pretends like it doesn't exist and for whatever reason translates the old name of the city to the new name of the city, which obviously now encompasses a much greater area and exists in a different context. Deepl seems to have figured out that you aren't really supposed to have two cities in one address and tries to rectify that in spite of the literal. I would imagine that a human translator would use the entirety of the street name and add "street" in English to the end. Definitely interesting to see how these services handle somewhat nonstandard and much older address patterns that don't originate necessarily in Mandarin and frequently relates to local landmarks that no longer exist, all of which requires some contextual work beyond your standaard post-1949 naming of streets that tends to be fairly standardized both thematically and in form.
This could become big, however, there is still work to do, in order to match something like DeepL[1]
I took [2] and had it translated on both sites, and while the translation was already pretty good (definitely good enough to understand the article), it made some fatal mistakes ("trillion" in English is "Billionen" in German, while "billion" in English is "Milliarden" in German. It got only the latter right. Not something you want for a scientific text...
What amazed me with Deepl.com was this: There is a paragraph
Yuanming Wang, a doctoral candidate in the School of Physics at the University of Sydney, has developed an
ingenious method to help track down the missing matter. She has applied her technique...
In German, there is a word for a female "doctoral candidate": "Doktorandin". LibreTranslate did not apply this language construct, but spoke from a "Doktorand" (male form), but translated the "she" correctly. DeepL, however, got it right. Which is amazing.
DeepL is really impressive.
I translated a very convoluted very technical text from German to English and the result was ok.
I still have to edit it a bit to make it sound less German, but overall it's readable and understandable.
It would have taken me quite a few hours to translate it, now I just have to edit it a bit.
I tried this on a fragment of https://de.wikipedia.org/wiki/ 's Artikel des Tages (article of the day) box for today (February 7, 2021).
Input: "Das Frauenstimmrecht in der Schweiz (Stimm- und Wahlrecht) wurde durch eine eidgenössische Volksabstimmung am 7. Februar 1971 eingeführt. Formell wurde das Frauenstimmrecht am 16. März 1971 wirksam."
Result: "Women's right to vote in Switzerland (voice and election rights) was governed by a federal referendum on 7 October. In the case of the United Kingdom, the Commission has not been able to take the necessary measures to ensure that the aid is granted. In the case of women's rights, the March 1971."
The translation of the first sentence conveys the meaning but changes the date from February 7 to October 7 and drops the year. The next part triggers... something in the translation engine, in any case the original has nothing to do with the UK or the (EU) Commission.
I suspect this is due to German date notation, where "7th" is written "7.", and the period confuses the translator, making it think a sentence ends there. Without periods:
Input: "Das Frauenstimmrecht in der Schweiz (Stimm- und Wahlrecht) wurde durch eine eidgenössische Volksabstimmung am 7 Februar 1971 eingeführt. Formell wurde das Frauenstimmrecht am 16 März 1971 wirksam."
Result: "Women's right to vote in Switzerland (voice and election rights) was introduced by a federal referendum on 7 February 1971. In addition, the Council adopted a resolution on the establishment of a European Community strategy for the development of the European Union."
The first sentence is fine now. The second is nonsense, the original sentence refers to Switzerland and means "Women's right to vote formally came into force on March 16, 1971". Still nothing to do with the EU. This broken translation remains the same when looking at just this sentence in isolation.
So: Thanks for this, I'm very happy to see this project. But it needs some more work to iron out the kinks. Good luck!
> The translation of the first sentence conveys the meaning but changes the date from February 7 to October 7 and drops the year.
Like discussed earlier in this thread Argos Translate tokenizes the incoming text into what are essentially words and then passes them to the Transformer network. For situations where the risks of translations that are fluid and seem reasonable but are totally wrong is high neural translation might not make sense even though the quality is generally higher. Instead you would probably want to use rule-based translations like what the Apertium project does.
Was the model trained on a corpus of EU documents? I'd imagine that's a great source of multilingual training data, but could lead to an obsession with the EU in the translations
It was partially trained on EU and UN documents. Interesting observation, less emphasis on parliamentary/legal data vs a data set like OpenSubtitles, which has more casual language use, might give a better performance.
The sentence in question could easily be found in a legal document, it's not really casual language.
The word "Formell" seems to trigger the EU focus, maybe it's very characteristic for established norms within the EU's translation service. Changing it to the closely related "Formal" makes the translation work: "The law on women's voting became effective on 16 March 1971." Another option would be "Offiziell", which gives the so-so: "Women's rights were officially enforced on 16 March 1971."
It does a somewhat decent job of English <-> Russian, keeping most of the meaning, but some mistakes it makes are hilarious.
For example, I tried this: "I had tried translating a test sentence and I got a rate limit related error..." It used the word "приговор", which does mean "sentence", except it's the kind courts give. The right word is "предложение", which means a (grammatical) sentence, but also a suggestion.
Languages are hard. I'm impressed.
edit: "I'm impressed." for some reason translates into "Я впечатлена." — technically correct but as though the speaker is female. That's weird.
In Russian, the "default" form is male. Usually. For adjectives and adverbs specifically, the infinitive is male and singular. Verbs only have gender in the past tense, so there's no default I guess?
The process for translation is to "tokenize" a stream of characters into "tokens" like this:
"Hello Sarah, what time is it?"
<Hello><_Sarah><,><_what><_time><_is><_it><?>
Then the tokens are used for inference with a Transformer net. There are no guarantees that output will be consistent to even small changes in the input. The network, based on the data it was trained on/luck, has slightly different connections for the <_Sara> token than for the <_Sarah> token leading to different output.
Very glad to see something like this, this was on my high-priority Free software needs list.
I'd very much like if it could be used programmatically, and not just via API - C/Python/Rust bindings, etc. I'd like to build some automatically translating forum software with it.
Well it would be nice if it worked, but it could not even translate "merry christmas" to German, just left it as is.
Apparently it needs the C to be capitalized ...
This is really cool. I sent it several paragraphs to translate and the results were quite understandable. On a quick check I don't think the quality is as good as Google Translate (currently), but it's pretty darn decent already & I expect it to keep getting better. No machine translator is going to be perfect... but for many uses that's fine, it just needs to be "good enough".
It comes up with some amusing translations. E.g., "Fruit flies like bananas" translates in French to "Les fruits volent comme des bananes" which is something like "The fruits fly like bananas." This is a standard example of translation problems (In "Fruit flies" is "flies" a verb or are we talking about the animals called fruit flies?).
I encourage people to contribute to the OPUS parallel corpus database (data or help): https://opus.nlpl.eu/
That's the training data set used by LibreTranslate. In general, the better the corpus used for training, the better the translation results will be.
I would assume Hong Kong records as Cantonese is the official language (for at least English<->Cantonese), I would also assume the Guangdong province would also be a source of material as well.
For most people in Guangdong, Cantonese is at most a spoken language. They learn Standard Written Chinese with Mandarin pronunciation at school and if they want to write down something said in Cantonese, they might substitute characters of equivalent meaning (e.g. 是 instead of 係) or with similar pronunciation instead of the "official" characters used in Hong Kong.
The Hong Kong government is not much different. Their actual language policy is "Chinese and English are the official languages of Hong Kong. Committed to openness and accountability, the Government produces important documents in both English and Chinese. Correspondence with individual members of the public is always in the language appropriate to the recipients. Simultaneous interpretation in English / Cantonese / Putonghua is made available to meetings of the Legislative Council and Government boards and committees as needed." https://www.csb.gov.hk/english/aboutus/org/scsd/1470.html
So they recognize Cantonese and Putonghua as different spoken forms, but only one written language. I've never seen a Hong Kong government website offer translation into both Cantonese and Mandarin, it's always just Standard Written Chinese with a choice of Traditional or Simplified characters.
Most written Cantonese content on the internet is probably produced by Hong Kongers in informal contexts such as forums, but then it's not clearly marked as such and might be mixed with Standard Written Chinese and English.
It's interesting and sad to see the forced assimilative process of erasing written Cantonese. I remember HK in the late 90s still had newspapers that published in written Cantonese. Just across the border in Shenzhen, without the British influence and prior to the explosion of industry and tech in the early 2000s, you could still see nonstandard signage that were in Cantonese in store windows. I think getting rid of spoken Cantonese is likely a generational and not just an effort that can be done in a decade or so, but I've both experienced and did field work on how the Wu dialects were more or less systematically erased from official, and now even private realms. The Shanghai variety, itself developed only in the early 1800s from a pidgin of the Suzhou and Nanjing varieties mixed with northern influences, is actually quite well-documented by foreign sources in writing, with a pidgin developing off of that and English and Portuguese that also survives in English sources and academically studied in great detail by Chinese authors in English but not in Chinese to anything close to the same degree. Starting with the millenial generation the speaking of the dialects in schools, even outside of class, became subject to punishment. With public education starting at the pre-kindergarten level enforcing the rule, across two or three generations even those whose first language is one of the dialects became more or less forced into Mandarin speakers and losing their fluency. I have little reason to doubt that something similar will simply be forced upon Hong Kong as well. Luckily sci-hub is your friend and written Cantonese seems to be better represented than written Wu through a cursory search.
In typical academic fashion, it's behind a login wall and doesn't offer an easy way to download the whole corpus. (Understandable, given that it's based on transcribing movies that are probably still copyright-protected, but annoying.) Also, no translations.
That corpus is CC-BY licensed (yay!) and puts the download page front-and-center, so I like it. There's no translations either, but recordings are included, so it might still be useful for a project of mine.
An interesting alternative OSS natural language translator is Apertium: https://www.apertium.org/
Their public site doesn't support English/French, though, so it's hard for me to judge it.
Hmm... "Let's see how well it works." seems to be handled correctly when translating from English into any language except Chinese, where the apostrophe is turned into <unk> and the sentence is otherwise untranslated.
Does that mean there's a different model for each language pair?
There are different models for each language pair. Currently there are only pre-trained models to and from English and other language pairs "pivot" through English.
Thanks for the explanation. Pivoting through English isn't ideal, but I'm just glad someone is working on this at all.
Thinking about it a bit more, it's a bit weird that the failure mode of a weak model would be to regurgitate the input unchanged. I'd rather have expected random Chinese gibberish in that case. Doesn't that mean the model has seen at least a few cases where English sentences were left untranslated in the training data?
I wanted to download the training data to check, but the instructions here https://github.com/argosopentech/onmt-models#download-data say to use OPUS-Wikipedia, which has no en-zh pairs, so the Chinese data must be from some other source.
Pivoting through English isn't inherent to Argos Translate, you could train a French-German model or whatever you want I've just been focusing on training models to add new languages. The ideal strategy is to have models that know multiple languages.
Quoting a previous HN comment:
I think cloud translation is still pretty valuable in a lot of cases since the model for one single direction translation is ~100MB. In addition to having more language options without a large download cloud translations let you use more specialized models for example French to Spanish. I just have a model to and from English for each language and any other translations have to "pivot" through English. For cloud translations you can also use one model with multiple input and output languages which gives you better quality translation between languages that don't have as much data available and lets you support direct translation between a large number of languages. Here's a talk where Google explains how they do this for Google Translate: https://youtu.be/nR74lBO5M3s?t=1682. You could do this locally but it would have its own set of challanges for getting the right model for the languages you want to translate.
> Thinking about it a bit more, it's a bit weird that the failure mode of a weak model would be to regurgitate the input unchanged. I'd rather have expected random Chinese gibberish in that case. Doesn't that mean the model has seen at least a few cases where English sentences were left untranslated in the training data?
This was added last week, it's just not live on libretranslate.com yet:
That's definitely an improvement, but what I was actually wondering about was why the model was copying the input verbatim in the first place.
Looking through the en-zh data in Opus, it looks like a bit of a mixed bag. This sample from OpenSubtitles v1 contains an ad for MyDVDrip in the Chinese version only: http://opus.nlpl.eu/OpenSubtitles/v1/en-zh_sample.html Filtering the data with some heuristics might be a good idea.
You're right there's surprisingly little data available for English-Chinese. I'm not sure why, it seems like there would be a lot of demand for translating between them.
There's a lot of demand, but the people most likely to satisfy that demand for free are Chinese fan translators, and they're more likely to upload their work to Chinese sites where Western dataset collectors are unlikely to find them...
I was looking into some fuzzy thesaurus containing software terminology to help searching docs. I tend to try the wrong keywords/phrases too often. Did not find anything, all ones are too general, too many synonyms unrelated to software/data.Can this be used somehow?
Data availability is going to be a problem, I think. Checking Malagasy Wikipedia https://mg.wikipedia.org , there are only 93k articles. That's even less than Latin Wikipedia https://la.wikipedia.org at 134k. And much of the text in these articles probably isn't a direct translation of an article in another language, so the amount usable for parallel-text mining is going to be very small.
cookie is used in french as well to describe american cookies... like, the big one with chocolate and nuts inside. Of course French for cookies is biscuit that is typically hard, flat and unleavened. So, cookie can be said to be a french word as well :)
[1] https://www.npmjs.com/package/translate
> Note: it takes few minutes to show on npmjs.com, but if you do `npm info translate` you can see how the new version has been published.