This claim seems a bit exaggerated. Microsoft Research Asia, as one example, has been doing neural machine translation for a large number of languages without relying in English for quite some time; and models are used in production scenarios too.
When I was doing my MBA about a decade ago, I teamed up with two other grad students to propose, raise money and ultimately install a demonstration / residential size wind turbine on campus. It was part of our work positioning entrepreneurship around clean energy.
We worked with the college press office to draft our press release, and in it we made the claim that it was the first wind turbine on a school campus in the greater Boston area.
Well one of the bigger papers picked up the story and the press office ended up getting a nasty voicemail from some parent at a private high school in some Boston suburb. The parent wanted our school to know that HIS kid's school was the first to put up a wind turbine.
This actually worried our little team because we didn't want to say something untrue. But the pr person explained that this "first to do X" type of thing is a very common press tactic and it is not unusual to have your claim disputed.
And that second, "greater Boston area" is not really a narrowly defined term.
This was something of a relief and we sort of had a running gag about the PTA dad with sour grapes about our wind turbine from then on.
The clue is in the name Public Relations. If "massaging" the truth to make it appear better than it is, and changing focus from the real problematic things to positive things isn't the job I don't know what is.
You can call it hyperbole and good marketing, but then it also goes to the extent of PR after oil spills, tailings damns wiping out villages and poisoning or risking the health of your own workforce, like for example Dupont and C8.
It's hard not to be cynical about PR because you cannot do it effectively without creating / propagating a distorted view of the truth.
But then again, the reason it still essentially needs to exist is because there is no universal truth, no universal sense of good or bad. So for example, do I forgive nuclear power and GM food companies from trying to influence public discourse on the safety of their work? Of course, it's essential in a world where they are competing against lots of unwarranted negative attention, much of it baseless as far as the science goes.
Anyway in spite of my ranting and bias, really I'd conclude it's a necessary evil, and the majority of the industry is innocuous and just trying to showcase the work and values of their clients...
Also newspaper editors are a huge part to blame, when these days loads of press releases make it to print almost unedited (and as has happened to myself and others I know, the bits they changed were even more wrong - including in national level newspaper)... Fact checking claims and sources is so lacking, especially in fast modern news cycle.
> But then again, the reason it still essentially needs to exist is because there is no universal truth, no universal sense of good or bad.
That may be true to some extent if you really push the philosophy angle, but I'd say real world is much simpler than that. There's stuff that happened, and there are consequences - just because the consequences may not be fully computable in the amount of time and effort anyone is willing to expend on it, doesn't mean truth suddenly becomes fuzzy. The territory is sharp, it's just the map that's uncertain. But for that, the proper words are "I'm not sure", not "I have my truth, you have yours".
> So for example, do I forgive nuclear power and GM food companies from trying to influence public discourse on the safety of their work? Of course, it's essential in a world where they are competing against lots of unwarranted negative attention, much of it baseless as far as the science goes.
And because of what I wrote above, I despise both. Yes, I understand the practical necessity - one side lies because the other side lies too, both stuck in a feedback loop. But I'd still say both are behaving unethically. Two wrongs don't make a right.
I subscribe to the viewpoint I've best seen phrased in an old blog post[0]: "promoting less than maximally accurate beliefs is an act of sabotage. Don’t do it to anyone unless you’d also slash their tires".
> I'd say real world is much simpler than that. There's stuff that happened, and there are consequences - just because the consequences may not be fully computable in the amount of time and effort anyone is willing to expend on it, doesn't mean truth suddenly becomes fuzzy. The territory is sharp, it's just the map that's uncertain.
Oh but it definitely DOES mean that truth becomes fuzzy. There is NO territory we can meaningfully talk about outside our subjective maps. Likewise, there is no such thing as "absolute truth" - and I do mean this on a very practical, day-to-day level, not in an abstract philosophical way.
The sooner we accept and embrace this, the sooner we can move away from "I'm right and you're wrong" to "let's make progress on what actually matters to the parties involved".
One challenge with this is that people do not seem to respond to nuance and potential indications so much as definite conclusions pointing toward specific actions.
I wonder if part of the problem is people do not have time or cultural persuasion to enjoy philosophical consideration of matters as they do tawdry headlines.
> the proper words are "I'm not sure", not "I have my truth, you have yours".
In this I agree. Looking back on it now, if I some other school had already put in any sort of turbine, I would have looked for another headline.
At the time, our group wasn't aware of this other school's efforts. It felt like we were all over cleantech in New England, so the dispute was a genuine surprise.
I pay close attention to the misuse of facts or lack of context around facts now. I try to look for the truth even if it isn't the way I might want it to be.
When I find out the truth is counter to the editor's headline on news, I get upset enough to try and pinpoint where the spin is coming from.
Or, as it is called in English: "lying by omission". Indeed such an idiom does not exist in Dutch. Which raises interesting questions that take us into Sapir-Whorf territory..
Do you have a reference/citation for any system that does 100x100 translation? That's the claim here - first to do 100x00 language directions from a single model. But if we are wrong I'd happily update with a reference/citation.
Easy to test. "Koirissanikohan" should be a sentence (or a word) conveying exactly this meaning: "I wonder if also probably in my dogs there is ..".
Will not happen in pure AI-based system without hard-coded parsing rules and without changing the obviously default dictionary-based english-like language model.
I'll never consider machine translation a solved problem until there is good Finnish<->English. Living in Finland as a native English speaker with terrible Finnish skills this is a daily source of frustration and sometimes amusement.
I've been in Finland for a month as a native English speaker and came to exactly the same conclusion, especially after witnessing all the mistakes that Google Translate makes
OT, I hope you don't mind me asking, how does this work out for you? I have some reasons to consider moving to Finland but the language barrier seems very rough. Is it even possible for the average hacker to get by with just English?
It's been good so far but luckily my partner is Finnish so that helps quite a bit. I'm also taking classes and trying to use the language when I can. I've learned French and Dutch to conversational level in the past, but I've found Finnish to be much much harder.
In terms of life, generally in Helsinki or maybe the other big cities you could get by just fine with english only. I know a few folks at work who've been here a long time and have survived ok! In terms of written information everything is in Finnish or Swedish but occasionally (and from what I gather, increasingly) in English. I've found pretty much everyone in a customer/client facing position speaks very good English too.
The one situation that comes to mind where it's genuinely quite difficult is grocery shopping and following cooking instructions. Since I like to cook I enjoy trying new things but Google translate is totally hopeless. Thankfully I've found translating the Swedish to English seems to work OK.
> I've learned French and Dutch to conversational level in the past
They are both neighbouring contries of the UK, with similar idioom. Finnish is another league. Congrats BTW, many English native speakers these days try but, in my experience, use English in business settings only.
There is no rigid structure as the spelling is adjusted to make it easier to pronounce. I say pure text-reading AI would never learn all nuances and rules on its own, because so much is lost in translation.
>> Maybe some texts that describe a given language should also be used.
But in what language would those texts be? Language models used for translation are trained on sentence pairs. How would e.g. a book on the grammar of Finnish, written in Finnish and without a translation in English, help learn how to translate Finnish into English?
I'm genuinely asking. This sounds like an interesting idea- but how exactly would it work?
I don't know! My thinking is that if humans can read a Finnish book and learn facts about the Finnish grammar, there must be a way to exploit this in machine learning.
Problem is that we Finns can use a word (a known base word and adding the inflection to get the wanted meaning) in a sentence that has never been said or written before and the native speakers will still understand it.
Basically a full training set with all the words does not exist. Even less a set with actual translations for them.
This also means that autocomplete etc is pretty much useless for us. I just disable it on iPhone as it actively makes typing out a sentence harder.
(also to make things harder as others have pointed out the base word and inflections get changed to make the word easier to pronounce so there is no static form for them when you stack them)
There are many languages that work similarly, and often it's more the orthography than the grammar that's problematic. E.g. English and German form compund nouns in the same way, but the constituent parts are usually separated by spaces in English orthography, while they're run together as one long string in German.
That doesn't mean it's impossible to work with other languages, just that "words separated by spaces" is the wrong abstraction for processing them. It just happens to be a heuristic that works well enough for English, so a lot of functionality (like autocomplete) assumes that it works the same for other languages. It would be perfectly feasible to offer partial completions of long words in languages like Finnish or German, if only the space key were treated as less special. (Just compare to Chinese and Japanese, where autocomplete works despite no spaces at all.)
Not having a static form might create some redundancy in the lexicon, but that's not more of a problem than the vowel mutation in English "sing", "sang", "sung", "song". Treating different surface realizations of the same underlying base form as independent might actually be beneficial for getting accurate results that take into account how the base form is modified.
In Dutch, like in German, we write compound words without a space, so you can also invent completely "new" words. If I enter such words in Google Translate most of them are correctly translated to English and French. Granted, splitting a word into its compound parts (= several existing words with maybe an "s", "e", or "en" between them for easier pronunciation) is easier than having a base word + inflection + changes for pronunciation, but it shouldn't be impossible for an AI to do this. I do think you'll probably need custom rules or extra information per language, corresponding roughly to some grammatical rules or patterns.
I would bet that pure text-reading AI can absolutely deduce the rules. Finnish morphology (word structure) isn't that complicated. It's routinely used as an example in linguistics classes because it is simple and regular. If you want to see truly crazy word structure, try Georgian or Navajo.
Achually in 1970s Finnish morphology was considered hard AI-problem. I know this for a fact, because I saw Fred Karlsson in Xerox Interlisp Workstation demo in 1980. Few years later they obviously realized that only some cheap undergraduate labor is needed.
Problem is of course how to learn the meanings, when they are omitted in translations. Even native speaker is not always sure, like is "koiran-ko-han-ko" a tautology, or does the second question-mark refer to to the "han". "You wanted a dog, yes?" vs. "You wanted a dog, yes? You sure now?".
Pure context-free meanings-based translator would just pick a random word from the list, like "kauppammekinkohan" and translate it rightaway: "i wonder if our shops also might be ..".
Tangentially related, this pisses me off more than it should regarding the "AI" in phone keyboards. Under the hood they're barely anything better than Markov chains.
I'm almost sure autocompletion in Finnish (or Estonian or Hungarian) is borderline useless, as the chances to write a word that wasn't ever written before are quite high.
But even with more "sane" grammars these models barely work. When I write in Spanish, often there are verb forms that are missing that I have to type fully.
For example, in Spanish, all forms are on conjugation lists, but it can be tough to find a set of corpuses that covers all. I know that a 2016 Spanish Wikipedia dump I played with covered about 20% of all verb forms present in rae.es.
Then on all those forms (I'd say about 16-18 tense/mood/aspect forms are in common use) you have to take into account enclitic/affix pronouns, and the whole thing goes awry quick.
"Comámosnoslas" - If you search on Google, there are only 9 results[0], this post likely to become the 10th. But it's a completely normal word a Spanish speaker may use and will understand.
comamos ("let us eat") nos (emphasis "for ourselves") las ("them", feminine). "Let's eat them!" but with emphasis lost in translation.
E.g. usage "Hay 3 pizzas, ¡comámosnoslas!" . This sentence, funnily, is properly translated by Google Translate into English, but it tries to correct it to "comamosnos las" which is absolutely broken Spanish.
> I'm almost sure autocompletion in Finnish (or Estonian or Hungarian) is borderline useless, as the chances to write a word that wasn't ever written before are quite high.
I agree with you in that regard, speaking Turkish.
Now machine translation generates fluent text (at least for en-ja), but my dissatisfaction for machine translation is that it sometimes miss negative word that's very important for information. IMO don't miss negate word is important than fluent.
I fully agree with this. As a data scientist, I always think that this is a "natural" consequence of one of the main (if not _the_ main) metric used to evaluate machine translation algorithms, which is BLEU: https://en.wikipedia.org/wiki/BLEU
According to this metric, if you have a moderately long sentence like "I am not the person who said the president should be reelected" and your translation missed the "not", you would still get a score of 11/12 ~ 92%. And, as far as I know, word order doesn't even matter, so "I am the person who said the president should not be reelected", while wrong, would get a perfect score.
Of course these are rather artificial examples, and in general machine translation algorithms and their evaluation work because it's "easier" to create an algorithm that gets the right translation than one that, unintentionally, fools the metric systematically. Nevertheless if the research community used a metric that punished this kind of mistakes more strongly, I suspect that over time a few new algorithms could come up that improve on this specific point.
Alas, I don't know of any such metric (nor I would know how to design one, of course, otherwise I'd publish it ;-) ).
I tried using en-ja on Google Translate for: "fall through the cracks".
It's an extremely common idiom and I used it in a proper context. It fails, horribly. I don't know why I even bother checking on Google Translate. It basically fails every single time to create natural Japanese.
Our teachers could tell in a second if it was made by Google Translate.
Would appreciate it if someone can ELI5 how they assembled a dataset for this task, without relying on english. I am not able to figure out their LASER 2.0 system
They do rely on english & spanish translations to construct the dataset.
They have existing tools to jointly embed sentences from multiple languages (laser). These models are trained using parallel corpora involving either english or spanish on a translation task.
Using these models and the joint embeddings they produce, fb can mine the web for new pairs by roughly identifying whether two sentences in two different languages correspond to a translation pair.
Exactly. I work on Laser2, the approach is the same as Laser [0], Laser2 performs better on some low resource languages.
A more ELI5 explanation would be something like this: Laser is an encoder/decoder architecture trained as a translation task from language X to english/spanish, with the particularity of having only one vector between the encoder and decoder. Once this system is trained (with public translation datasets), the vector that the encoder gives you for an input sentence represents the "meaning" of that sentence, since the decoder relies only on that information to generate a translation. And this vector representation is the same for any language the system was trained on.
So we use that system to mine data from commoncrawl: giving any language pair (say romanian-nepali), having two vectors close to each other in that latent space means the sentences have the same meaning.
We use fastText's language classifier [1] to filter from commoncrawl, compute the vector representations with Laser, and find close vectors thanks to Faiss [2].
> giving any language pair (say romanian-nepali), having two vectors close to each other in that latent space means the sentences have the same meaning.
That's the goal, but have you asked any Romanian-Nepali bilinguals how well it works? (I realize those might be hard to come by.) I had a look at some of the language pairs in CCMatrix, and I noticed that the highest-confidence matches for some of them (English-Chinese, English-Korean) include a lot of quotations from old religious texts. That wouldn't be a problem if they were actually the same quotations, but it looked more like the model managed to identify archaic language and then got overly confident that two archaic sentences must have the same meaning.
I wonder whether there's been a human evaluation of the mined training data, or whether you rely on catching any problems downstream when you measure the BLEU of the trained model.
Of course, the problem with these "synthetic" datasets are when the translation models overfit to the problems in these multilingual sentence encoders/embedders.
Still, it lets you massively increase the amount of data you're training on, which is generally worth it.
In theory you could do this with books? Sentences in translations should still be the same, and you could add some heuristics to identify paragraphs, quote marks, etc. You might be wrong occasionally, but doing the extraction per chapter or per paragraph would mitigate it.
Only in theory. Sentences are hardly the same in other languages as the books are not one-to-one translations and a sentence in one language might be three in another. Even the paragraphs might not be the same in asian languages.
It depends on the language pairings. Take a look at Don Quixote in English and Spanish, they're very comparable at the sentence level (at least this translation). Also if course it depends on the artistic license of the translation. Maybe it's something you could do with a bit of human intervention to sync up the two texts.
I'm guessing you've had a look at https://www.deepl.com/translator ? I don't speak Portuguese so I can't judge, but it explicitly offers European Portuguese as a translation target, and their other translations are usually top-notch.
Douglas Hofstadter's test sentence:
In their house, everything comes in pairs. There’s his car and her car, his towels and her towels, and his library and hers.
DeepL translation:
Dans leur maison, tout vient par deux. Il y a sa voiture et la sienne, ses serviettes et les siennes, sa bibliothèque et la sienne.
DH translation:
Chez eux, ils ont tout en double. Il y a sa voiture à elle et sa voiture à lui, ses serviettes à elle et ses serviettes à lui, sa bibliothèque à elle et sa bibliothèque à lui.
As a French native, I can tell you that DH's translation is great. And DeepL's reply is still something you can recognize at automatic translation at first sight.
Don’t get me wrong, more often than not, current MT tools give good enough results to understand the treated topic. But that’s still hideous translations. I wouldn’t buy a book translated through them for example.
Is there a reason why "sa bibliothèque à elle et sa bibliothèque à lui" is a better translation of "his library and hers" than "sa bibliothèque et la sienne"? I know it's just a word-for-word translation, but the latter seems more natural, and the former unnecessarily verbose. Especially in context, where the original English deliberately interrupts the pattern of "his car and her car, his towels and her towels", as a matter of prosody.
Not that I'd expect the computer to "know" that. I'm just curious, as a French learner, about why DH's translation is better.
> Is there a reason why […] is a better translation […]?
What do you mean with "a reason why"? From a synchronic or a diachronic point of view? Yes, there are reasons that linguists can provide. But for the mere layman, it will simply be weird to encounter such an utterance, full stop. Not that your question is non-sense, I would rather simply expect most people to be unable to tell you why despite they know it sounds weird.
> I know it's just a word-for-word translation, but the latter seems more natural, and the former unnecessarily verbose.
What you mean with "seems more natural", is that it sounds closer to what you are accustomed to in your own language. French is generally more verbose than English, especially in usual written form.
Now, as your "why" is nontheless completely legitimate, here is one explanation .In the case of "sa bibliothèque et la sienne", unlike in English which insist on the possessor (his/her takes the gender of the one who owns the object), French insists on the possessed (sien/sienne takes the gender of the object which is possessed, and every substantive have a gender). So would you translate "her library and his" in the same way, you would end up with the very same translation "sa bibliothèque et la sienne". On the other hand "sa bibliothèque à elle et sa bibliothèque à lui" would preserve the information of who owns each library.
Would you really mind to comes that is as close as possible to the original prosody, you might use "sa bibliothèque à lui, et elle la sienne."
The service you link to doesn't seem to offer a translation from and to Greek. It's also missing many other EU languages, e.g. Swedish, Danish, Hungarian, Zcech, etc.
Edit: to clarify, other translation servies I've used, notably Google, do translate from and to Greek but make a meal out of it.
i was using the automatic translation for Mongolian on facebook recently, and I couldn't make much sense out of the output. Interesting what kind of translation software was being used there.
Most of the models announced in papers can't be deployed in production because they are not optimised for inference efficiency. For example Google Translate the public service is not as good as the SOTA papers.
Just for research. Most research models are not used for anything. Imagine having to serve a model that requires 2 x 32GB GPU to billions of users.
Text to speech is also much worse in deployment compared to research. The recent research models have much better intonation.
GPT-3 is the worst offender here - it's so big that it becomes almost uneconomical to run, and certainly impossible to offer for free. (estimated requirements are 11 Tesla V100 GPUs)
They could have sold this as a service to those who need higher quality translation services; think about the wasted gpu time required to build the models. (Also laymen like me would think that all the effort is being wasted) also practical usage is also a kind of test, isn't it?
Between German and English one example that stuck with me was it confusing "farmer" and "builder" because in German, within compound words both of these map to "-bauer". Cue a number of auto-translated job adverts advertising positions as a "street farmer" and suchlike…
Afrikaans Albanian Amharic Arabic Armenian Asturian Azerbaijani Bashkir Basque Belarusian Bengali Bosnian Breton Bulgarian Burmese Catalan Cebuano Chinese Croatian Czech Danish Dutch Eastern Punjabi English Estonian Finnish French Fulah Galician Ganda Georgian German Greek Gujarati Haitian Hausa Hebrew Hindi Hungarian Icelandic Igbo Ilokano Indonesian Irish Italian Japanese Javanese Kannada Kazakh Khmer Korean Lao Latvian Lingala Lithuanian Luxembourgish Macedonian Malagasy Malay Malayalam Marathi Mongolian Nepali Northern Sotho Norwegian (Bokmål) Occitan Oriya Oromo Pashto Persian Polish Portuguese Romanian Russian Scottish Gaelic Serbian Sindhi Sinhalese Slovak Slovenian Somali Spanish Sundanese Swahili Swati Swedish Tagalog Tamil Thai Tswana Turkish Ukrainian Urdu Uzbek Vietnamese Welsh West Frisian Wolof Xhosa Yiddish Yoruba Zulu
This is a pretty huge milestone. However, I really want to see the next generation of MTL. From what I've seen, we're still a long way away from natural sounding translations.
Some pairs are also very bad right now, particularly Korean->English.
> From what I've seen, we're still a long way away from natural sounding translations.
IMO, this is a dataset problem. I think we're finally seeing the "next generation" of dataset collection in AI so there will be hopefully improvements for pairs with sparser sets.
> we're still a long way away from natural sounding translations.
DeepL makes perfect translations between English and German. Perfect in a sense that it looks like a professional translator translated it. Google, Microsoft, Facebook might be far off, but deepl isn't.
Not perfect between English and French but certainly great.
The possibility to insert your corrections in the suggested translation and let the algorithm rewrite the following text make it a huge time saver when one needs to do a translation.