Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: LibreTranslate – Open-source neural machine translation API (libretranslate.com)
216 points by pjfin123 on Feb 6, 2021 | hide | past | favorite | 65 comments



This is pretty sweet! Specially it being free. I just added support for Libre Translate to my Javascript translation library, `translate` [1]:

    import translate from 'translate';
    translate.engine = 'libre';

    const text = await translate('Hello world', 'es');
Please let me know in an issue if you start charging for it at some point so I can point out how to get a key and everything!

[1] https://www.npmjs.com/package/translate

> Note: it takes few minutes to show on npmjs.com, but if you do `npm info translate` you can see how the new version has been published.


Very cool, thanks or making this!


Comparing this to IBM Watson and Google Translate:

Chinese Input: 古荡塘苗路华星现代产业园E座正门

LibreTranslate: Ordinary gate of the modern industrial plant of the Hyong Chung Chung

IBM Watson: Ancient Slut Pond Miao Luhua Star Modern Industrial Park E Zhengmen

Google Translate: Main entrance of Block E, Huaxing Modern Industrial Park, Miao Road, Gudangtang

None is accurate, but it is nice to have options.


This is a really interesting example. I can't read Chinese but I'm assuming the input is an address or location?

Such domain knowledge would be vital in providing a good translation and sanity checking the output. But a straight sequence to sequence machine translation would not capture that context. It looks like thay is what is happening with the first two translations, while Google's may have actually realized its and address (but maybe not as you say the answer is wrong, maybe their ML is just better).

Your example highlights the point that naked ML models can only ever be so good, and that it's really as part of a system that they can be truly effective. You can imagine in the translation some combination of a classifier or NER that identifies an address, a translation model, and an english model that detects a sensible answer.


What is the correct translation? That a bilingual human might give?


Google's, I think. It's an address referring to the main gate of an area.

Actually it's pretty darn accurate from what I can tell. Maybe my Chinese just isn't very good, but I don't think I can do much better.


Yeah it's almost perfect, the only inaccuracy is that the district is just "Gudang" and "tang" is part of the road name (i.e. "Tangmiao Road").

If you check the location on Google Maps you can see it's Tangmiao Road and Gudang District (the "District" part is not present in the address): https://www.google.com/maps/place/Huaxing+Modern+Industrial+...

Without context it's a good attempt, only with deeper cultural knowledge would someone be able to guess that "Miao" on its own is too short for a street name and it's likely "Tangmiao" together, though without knowing the place I think it would also be reasonable to guess that the whole thing is the road name like "Gudangtangmiao Road".

I guess using Google Maps to validate Google Translate is kinda questionable so here's the same location on Bing Maps: https://www.bing.com/maps?osid=d3dfc1f0-f8ed-48cd-898d-0a765...


DeepL does get the segmentation correct:

"Main entrance of Block E of Huaxing Modern Industrial Park, Gudang Tangmiao Road"

https://www.deepl.com/translator#zh/en/%E5%8F%A4%E8%8D%A1%E5...


Actually DeepL seems to be the best of the bunch. Here is another test:

Input: 北京市西城区地安门西大街49号

DeepL: No.49, Di'anmen West Street, Xicheng District, Beijing

Google: 49 Di'anmen West Street, Xicheng District, Beijing

Watson: No. 49 Avenue West Main Street in Xicheng District, Beijing

Libre: 49th Anniversary Street, Westtown, Beijing

(Libre is the worst for all address type input - which is what I'm interested in - shameless plug - I'm building a Geocoder from China at https://geocode.xyz/CN . I've so far tested over 3k addresses)


Deepl seems to have a bit of an issue with uncommon street suffixes in Chinese. For example: 江苏省苏州市姑苏区东中市374号槔桥头

Google: Bridge Head, No. 374, Dongzhong City, Gusu District, Suzhou City, Jiangsu Province Deepl: Pulley Bridge, No. 374, Dongzhong City, Gusu District, Suzhou, Jiangsu Province Libre: Cambridge No. 374 in the eastern part of the city of Jiangsu, province of Jiangsu Watson: No. 374 Bridge Head in East China, Suzhou, Suzhou, Jiangsu Province.

You can kind of see which service goes for the literal more than the interpretive. It might be a bit unfair to use an address that is more than just the literal street address, although in actual speech, this address is at an intersection with two bridges and it's enough of a local landmark that the addition would make sense. Except for the uncertainty over the proper name of the bridge that actually doesn't throw Google or Deepl off. It's the 市suffix for a street, not unique in this city but certainly rare, that gets all of the services. Libre at least gives it a try, Watson just pretends like it doesn't exist and for whatever reason translates the old name of the city to the new name of the city, which obviously now encompasses a much greater area and exists in a different context. Deepl seems to have figured out that you aren't really supposed to have two cities in one address and tries to rectify that in spite of the literal. I would imagine that a human translator would use the entirety of the street name and add "street" in English to the end. Definitely interesting to see how these services handle somewhat nonstandard and much older address patterns that don't originate necessarily in Mandarin and frequently relates to local landmarks that no longer exist, all of which requires some contextual work beyond your standaard post-1949 naming of streets that tends to be fairly standardized both thematically and in form.


Deepl is the best in 99% of the case. I have no idea how they managed to get so much better than other players in the translating game.


It's better than Watson, shrug.


I haven't seen a machine translation as hilariously bad as Watson's in years.


should have included bing's their chinese translate is actually more developed than I had previously thought until someone showed me


This could become big, however, there is still work to do, in order to match something like DeepL[1]

I took [2] and had it translated on both sites, and while the translation was already pretty good (definitely good enough to understand the article), it made some fatal mistakes ("trillion" in English is "Billionen" in German, while "billion" in English is "Milliarden" in German. It got only the latter right. Not something you want for a scientific text...

What amazed me with Deepl.com was this: There is a paragraph

    Yuanming Wang, a doctoral candidate in the School of Physics at the University of Sydney, has developed an
    ingenious method to help track down the missing matter. She has applied her technique...
In German, there is a word for a female "doctoral candidate": "Doktorandin". LibreTranslate did not apply this language construct, but spoke from a "Doktorand" (male form), but translated the "she" correctly. DeepL, however, got it right. Which is amazing.

[1]: https://deepl.com/translator [2]: https://phys.org/news/2021-02-student-astronomer-galactic.ht...


DeepL is really impressive. I translated a very convoluted very technical text from German to English and the result was ok. I still have to edit it a bit to make it sound less German, but overall it's readable and understandable. It would have taken me quite a few hours to translate it, now I just have to edit it a bit.


I tried this on a fragment of https://de.wikipedia.org/wiki/ 's Artikel des Tages (article of the day) box for today (February 7, 2021).

Input: "Das Frauen­stimm­recht in der Schweiz (Stimm- und Wahl­recht) wurde durch eine eidge­nössi­sche Volksabstim­mung am 7. Februar 1971 einge­führt. Formell wurde das Frauen­stimm­recht am 16. März 1971 wirk­sam."

Result: "Women's right to vote in Switzerland (voice and election rights) was governed by a federal referendum on 7 October. In the case of the United Kingdom, the Commission has not been able to take the necessary measures to ensure that the aid is granted. In the case of women's rights, the March 1971."

The translation of the first sentence conveys the meaning but changes the date from February 7 to October 7 and drops the year. The next part triggers... something in the translation engine, in any case the original has nothing to do with the UK or the (EU) Commission.

I suspect this is due to German date notation, where "7th" is written "7.", and the period confuses the translator, making it think a sentence ends there. Without periods:

Input: "Das Frauen­stimm­recht in der Schweiz (Stimm- und Wahl­recht) wurde durch eine eidge­nössi­sche Volksabstim­mung am 7 Februar 1971 einge­führt. Formell wurde das Frauen­stimm­recht am 16 März 1971 wirk­sam."

Result: "Women's right to vote in Switzerland (voice and election rights) was introduced by a federal referendum on 7 February 1971. In addition, the Council adopted a resolution on the establishment of a European Community strategy for the development of the European Union."

The first sentence is fine now. The second is nonsense, the original sentence refers to Switzerland and means "Women's right to vote formally came into force on March 16, 1971". Still nothing to do with the EU. This broken translation remains the same when looking at just this sentence in isolation.

So: Thanks for this, I'm very happy to see this project. But it needs some more work to iron out the kinks. Good luck!


Interesting, thanks for the detailed write up.

> The translation of the first sentence conveys the meaning but changes the date from February 7 to October 7 and drops the year.

Like discussed earlier in this thread Argos Translate tokenizes the incoming text into what are essentially words and then passes them to the Transformer network. For situations where the risks of translations that are fluid and seem reasonable but are totally wrong is high neural translation might not make sense even though the quality is generally higher. Instead you would probably want to use rule-based translations like what the Apertium project does.


Was the model trained on a corpus of EU documents? I'd imagine that's a great source of multilingual training data, but could lead to an obsession with the EU in the translations


It was partially trained on EU and UN documents. Interesting observation, less emphasis on parliamentary/legal data vs a data set like OpenSubtitles, which has more casual language use, might give a better performance.


The sentence in question could easily be found in a legal document, it's not really casual language.

The word "Formell" seems to trigger the EU focus, maybe it's very characteristic for established norms within the EU's translation service. Changing it to the closely related "Formal" makes the translation work: "The law on women's voting became effective on 16 March 1971." Another option would be "Offiziell", which gives the so-so: "Women's rights were officially enforced on 16 March 1971."


It does a somewhat decent job of English <-> Russian, keeping most of the meaning, but some mistakes it makes are hilarious.

For example, I tried this: "I had tried translating a test sentence and I got a rate limit related error..." It used the word "приговор", which does mean "sentence", except it's the kind courts give. The right word is "предложение", which means a (grammatical) sentence, but also a suggestion.

Languages are hard. I'm impressed.

edit: "I'm impressed." for some reason translates into "Я впечатлена." — technically correct but as though the speaker is female. That's weird.


> "I'm impressed." for some reason translates into "Я впечатлена." — technically correct but as though the speaker is female...

Context indeed is insufficient. On the other hand some languages have neutral forms, which may be more appropriate in mixed or insufficient contexts.

For example, in this case the Russian equivalent could be impersonal "Впечатляет!", as in "This impresses me!"

However, this is more interpretation, than translation.

BTW, French translation gives it in masculin "Je suis impressionné."


> technically correct but as though the speaker is female. That's weird.

The majority of the world's population is female, so absent other information this seems to be the correct guess if no "generic" option is available.


In Russian, the "default" form is male. Usually. For adjectives and adverbs specifically, the infinitive is male and singular. Verbs only have gender in the past tense, so there's no default I guess?


Interesting if I do English to French on the following:

Hello Sarah, what time is it?

it translates to

Bonjour Sarah, quelle heure est-il ?

Now if I change the input to

Hello Sara, what time is it?

it translates instead to:

Quelle heure est-il ?

Any idea why the one character difference in the name affects the translation in this way?


The process for translation is to "tokenize" a stream of characters into "tokens" like this:

"Hello Sarah, what time is it?"

<Hello><_Sarah><,><_what><_time><_is><_it><?>

Then the tokens are used for inference with a Transformer net. There are no guarantees that output will be consistent to even small changes in the input. The network, based on the data it was trained on/luck, has slightly different connections for the <_Sara> token than for the <_Sarah> token leading to different output.

Here's a video of some Linux YouTubers reviewing Argos Translate (https://github.com/argosopentech/argos-translate), the underlying translation library, and getting unexpected outputs: https://www.youtube.com/watch?v=geMs9dxl1N8


pigpen hole principle?

More verb tenses in French than in English means ambiguity in where things may come from and go to.


Very glad to see something like this, this was on my high-priority Free software needs list.

I'd very much like if it could be used programmatically, and not just via API - C/Python/Rust bindings, etc. I'd like to build some automatically translating forum software with it.


It's based on argos-translate, which has python bindings:

https://github.com/argosopentech/argos-translate


And a PyQt native Desktop app.


OT: I'd definitely be interested in seeing the rest of your high-priority Free software needs list!


Besides Free translations APIs..

- A replacement for Aspera, which I wrote about here: https://www.ccdatalab.org/blog/a-desperate-plea-for-a-free-s...

- Transnational decision making software (distributed Schultz or similar)

- Project management software for (for instance, Wikipedia-scale) massively collaborative projects


Well it would be nice if it worked, but it could not even translate "merry christmas" to German, just left it as is. Apparently it needs the C to be capitalized ...


On a positive note I think it's great that we're seeing efforts in this direction.

Fixing capitalization and spelling is a fairly easy thing to do, just put a spell-checker before the input. Maybe that would be a good pull request.


There's more in depth discussion of this issue here: https://github.com/uav4geo/LibreTranslate/issues/20

In some cases using all lower case can help avoid this risk if capitalization isn't important.


This is really cool. I sent it several paragraphs to translate and the results were quite understandable. On a quick check I don't think the quality is as good as Google Translate (currently), but it's pretty darn decent already & I expect it to keep getting better. No machine translator is going to be perfect... but for many uses that's fine, it just needs to be "good enough".

It comes up with some amusing translations. E.g., "Fruit flies like bananas" translates in French to "Les fruits volent comme des bananes" which is something like "The fruits fly like bananas." This is a standard example of translation problems (In "Fruit flies" is "flies" a verb or are we talking about the animals called fruit flies?).

I encourage people to contribute to the OPUS parallel corpus database (data or help): https://opus.nlpl.eu/ That's the training data set used by LibreTranslate. In general, the better the corpus used for training, the better the translation results will be.


Good. Now can we finally stop pretending Cantonese doesn't exist?


Do you know any good sources for large amounts of translated Cantonese text?


I would assume Hong Kong records as Cantonese is the official language (for at least English<->Cantonese), I would also assume the Guangdong province would also be a source of material as well.


For most people in Guangdong, Cantonese is at most a spoken language. They learn Standard Written Chinese with Mandarin pronunciation at school and if they want to write down something said in Cantonese, they might substitute characters of equivalent meaning (e.g. 是 instead of 係) or with similar pronunciation instead of the "official" characters used in Hong Kong.

The Hong Kong government is not much different. Their actual language policy is "Chinese and English are the official languages of Hong Kong. Committed to openness and accountability, the Government produces important documents in both English and Chinese. Correspondence with individual members of the public is always in the language appropriate to the recipients. Simultaneous interpretation in English / Cantonese / Putonghua is made available to meetings of the Legislative Council and Government boards and committees as needed." https://www.csb.gov.hk/english/aboutus/org/scsd/1470.html

So they recognize Cantonese and Putonghua as different spoken forms, but only one written language. I've never seen a Hong Kong government website offer translation into both Cantonese and Mandarin, it's always just Standard Written Chinese with a choice of Traditional or Simplified characters.

Most written Cantonese content on the internet is probably produced by Hong Kongers in informal contexts such as forums, but then it's not clearly marked as such and might be mixed with Standard Written Chinese and English.

I think the largest collection of monolingual Cantonese text is probably Cantonese Wikipedia (which is small) and the largest collections of translated Cantonese I'm aware of are even smaller: Tatoeba with 6095 sentences https://tatoeba.org/eng/sentences/show_all_in/yue/und and CantoDict with 1547 sentences http://www.cantonese.sheik.co.uk/scripts/examplelist.htm

But there might be larger datasets I'm not aware of, which is why I was asking.


It's interesting and sad to see the forced assimilative process of erasing written Cantonese. I remember HK in the late 90s still had newspapers that published in written Cantonese. Just across the border in Shenzhen, without the British influence and prior to the explosion of industry and tech in the early 2000s, you could still see nonstandard signage that were in Cantonese in store windows. I think getting rid of spoken Cantonese is likely a generational and not just an effort that can be done in a decade or so, but I've both experienced and did field work on how the Wu dialects were more or less systematically erased from official, and now even private realms. The Shanghai variety, itself developed only in the early 1800s from a pidgin of the Suzhou and Nanjing varieties mixed with northern influences, is actually quite well-documented by foreign sources in writing, with a pidgin developing off of that and English and Portuguese that also survives in English sources and academically studied in great detail by Chinese authors in English but not in Chinese to anything close to the same degree. Starting with the millenial generation the speaking of the dialects in schools, even outside of class, became subject to punishment. With public education starting at the pre-kindergarten level enforcing the rule, across two or three generations even those whose first language is one of the dialects became more or less forced into Mandarin speakers and losing their fluency. I have little reason to doubt that something similar will simply be forced upon Hong Kong as well. Luckily sci-hub is your friend and written Cantonese seems to be better represented than written Wu through a cursory search.


BC is a bit of a quiet haven for Cantonese culture. Check UBC, I know they do language preservation projects related to Cantonese.


Thanks for the tip. Their Cantonese program's website is here: https://cantonese.arts.ubc.ca/

Via the announcement for the "Language Archiving in the Digital Era" workshop https://cantonese.arts.ubc.ca/language-archiving-in-the-digi... ...

I found the "Corpus of Mid-20th Century Hong Kong Cantonese" https://hkcc.eduhk.hk/

In typical academic fashion, it's behind a login wall and doesn't offer an easy way to download the whole corpus. (Understandable, given that it's based on transcribing movies that are probably still copyright-protected, but annoying.) Also, no translations.

They mention 香港粵語語料庫 as a related project, but the link is dead. I found what appears to be the new website: http://compling.hss.ntu.edu.sg/hkcancor/

That corpus is CC-BY licensed (yay!) and puts the download page front-and-center, so I like it. There's no translations either, but recordings are included, so it might still be useful for a project of mine.

Thanks again!


Modern techniques don't necessarily need this for NMT.


Well done. The UI is nice and easy to use. The results looked good for the few sentences I tried (Arabic <--> English).

May I know which datasets were used to train the models?


OPUS parallel corpus: http://opus.nlpl.eu/

It's really great, they have a large amount of data and organize it to make it easy to access.


There is plenty of FOSS powered services like this out there, for example https://www.modernmt.com/translate/: it is a EU funded project.

What I dislike here is that it looks just a fancy packing of already existing technologies borrowed from academia like OpenNMT.


An interesting alternative OSS natural language translator is Apertium: https://www.apertium.org/ Their public site doesn't support English/French, though, so it's hard for me to judge it.


Hmm... "Let's see how well it works." seems to be handled correctly when translating from English into any language except Chinese, where the apostrophe is turned into <unk> and the sentence is otherwise untranslated.

Does that mean there's a different model for each language pair?


There are different models for each language pair. Currently there are only pre-trained models to and from English and other language pairs "pivot" through English.

ex:

es -> en ->fr

Chinese is the weakest language pair currently, but I'm currently working on improving it: https://github.com/argosopentech/argos-translate/issues/17


Thanks for the explanation. Pivoting through English isn't ideal, but I'm just glad someone is working on this at all.

Thinking about it a bit more, it's a bit weird that the failure mode of a weak model would be to regurgitate the input unchanged. I'd rather have expected random Chinese gibberish in that case. Doesn't that mean the model has seen at least a few cases where English sentences were left untranslated in the training data?

I wanted to download the training data to check, but the instructions here https://github.com/argosopentech/onmt-models#download-data say to use OPUS-Wikipedia, which has no en-zh pairs, so the Chinese data must be from some other source.


Pivoting through English isn't inherent to Argos Translate, you could train a French-German model or whatever you want I've just been focusing on training models to add new languages. The ideal strategy is to have models that know multiple languages.

Quoting a previous HN comment:

I think cloud translation is still pretty valuable in a lot of cases since the model for one single direction translation is ~100MB. In addition to having more language options without a large download cloud translations let you use more specialized models for example French to Spanish. I just have a model to and from English for each language and any other translations have to "pivot" through English. For cloud translations you can also use one model with multiple input and output languages which gives you better quality translation between languages that don't have as much data available and lets you support direct translation between a large number of languages. Here's a talk where Google explains how they do this for Google Translate: https://youtu.be/nR74lBO5M3s?t=1682. You could do this locally but it would have its own set of challanges for getting the right model for the languages you want to translate.

> Thinking about it a bit more, it's a bit weird that the failure mode of a weak model would be to regurgitate the input unchanged. I'd rather have expected random Chinese gibberish in that case. Doesn't that mean the model has seen at least a few cases where English sentences were left untranslated in the training data?

This was added last week, it's just not live on libretranslate.com yet:

https://github.com/uav4geo/LibreTranslate/issues/33

The training scripts are just an example for English-Spanish, Opus(http://opus.nlpl.eu/) has data for English-Chinese.


> https://github.com/uav4geo/LibreTranslate/issues/33

That issue is about emojis, which confused me a bit, but I see that the linked commit is about replacing <unk> by the corresponding source token. https://github.com/argosopentech/argos-translate/commit/6a0f...

That's definitely an improvement, but what I was actually wondering about was why the model was copying the input verbatim in the first place.

Looking through the en-zh data in Opus, it looks like a bit of a mixed bag. This sample from OpenSubtitles v1 contains an ad for MyDVDrip in the Chinese version only: http://opus.nlpl.eu/OpenSubtitles/v1/en-zh_sample.html Filtering the data with some heuristics might be a good idea.


You're right there's surprisingly little data available for English-Chinese. I'm not sure why, it seems like there would be a lot of demand for translating between them.

For the en-zh model copying this is a known issue: https://github.com/argosopentech/argos-translate/issues/4


There's a lot of demand, but the people most likely to satisfy that demand for free are Chinese fan translators, and they're more likely to upload their work to Chinese sites where Western dataset collectors are unlikely to find them...


I was looking into some fuzzy thesaurus containing software terminology to help searching docs. I tend to try the wrong keywords/phrases too often. Did not find anything, all ones are too general, too many synonyms unrelated to software/data.Can this be used somehow?


Does it continue learning? It might be good to somehow federalize it in order for instances to share knowledge or results.

Were its current knowledge to stay static, I imagine it would become outdated in some time. But most importantly, it wouldn't correct mistakes


Pretty interesting but the quality is far from top notch. Most reasonable and 20x cheaper alternative to Google Translate as I know is [1]

[1] https://www.deeptranslate.net/


“He didn’t know what to think” is translated to Portuguese “Ele não sabia o que fazer” which is incorrect. Fazer is “to do” not “to think”


I had tried translating a test sentence and I got a rate limit related error...

Would be glad to see this for Malagasy though


Data availability is going to be a problem, I think. Checking Malagasy Wikipedia https://mg.wikipedia.org , there are only 93k articles. That's even less than Latin Wikipedia https://la.wikipedia.org at 134k. And much of the text in these articles probably isn't a direct translation of an article in another language, so the amount usable for parallel-text mining is going to be very small.


Hmm, english -> french

I want a cookie -> Je veux un cookie.

Seems like it still has a little bit to go.


cookie is used in french as well to describe american cookies... like, the big one with chocolate and nuts inside. Of course French for cookies is biscuit that is typically hard, flat and unleavened. So, cookie can be said to be a french word as well :)


English to Russian (both ends) works... well it does not :D




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: