A Localization Horror Story: It Could Happen to You

eloisant · on Jan 15, 2015

When I was in Japan I did proof reading for a Japanese feature phone. A major Japanese brand, actually. That was really comical.

There was an Australian guy for English, an German guy, an Italian lady, and me for French. What they did prior to the meeting is: * translate from Japanese to English by Japanese people with a poor English level (maybe the software engineers actually) * translate from weird English to other languages by translators who had only the strings, absolutely no context.

In the meeting we had all the strings, and one person from the manufacturers who had access to the "super-confidential" unreleased device.

More than half of the translations were off because of lack of context. The French guy actually translated "Garbage day" to something like "Shitty day", apparently he thought that was a way to mark in your calendar that you had a really bad day.

Pretty often we had sentences like "delete one", and invariably one of us had to ask "One what? I need to know if it's masculine/feminine/neutral". Of course they didn't prepare to that, it was too late to change the code, so they made us do ugly things like "%n item(s)".

Also the Australian guy was loosing faith into humanity: - That sentence, it's completely wrong, it just doesn't mean anything in English. People will just go "WTF?" when they read that - We're not allowed to change the English strings, they're already validated - .....

TeMPOraL · on Jan 15, 2015

I don't know why nobody seems to put information like "warning: this phone's UI in <your local language> is total and utter crap".

Anyway, what you wrote is exactly why I stick to using all software and webservices - OS, text editors, Facebook, et al. - in en_US instead of my native pl_PL. Because translations are always crappy - even for big players. Lack of context is the key here - translated text often feels out of place, because there usually is some overarching idea behind them that isn't communicated to translators. Then there is lack of consistency. Words in original text often have some site-specific meaning, which tends to also be somehow lost in the translation process. For example, on Facebook the word "like" talks about a well-defined thing, not about the dictionary meaning, so it's totally not ok to randomly replace it with synonyms during translation [0].

I realized at some point that I often look at a crapy translation, guess what was the English original, and then in my mind translate to what it should be in the first place. Because for some strange reason I, the user, have the context, and the paid translation team has not. I guess I'm going to put that into my "Translation issues" file in the "Mysteries of capitalism" drawer, right next to "how on Earth multi-milion media companies can't do a movie translation that isn't a total crap" file. I mean, seriously, you're better off looking for pirated subtitles even if you bought the original because pirates at least seem to have watched the movie they're translating.

</rant>

[0] - I wish more translators would use the approach Jehovah's Witnesses used when doing their own Bible translation. Since it was designed to be studied and analyzed, they preferred accuracy over aesthetics - therefore one of the translation rules was "as much as possible, let's have any given word in original text be always represented by the same word in English". Adhering to that single rule would eliminate like half of the "context missing" problems with software translations.

fennecfoxen · on Jan 15, 2015

You know what multi-million movie has a translation that isn't total crap? Frozen. They really put resources into that. You can look up random Disney songs on Youtube in different languages, and then look up the Frozen songs, and you can sort of tell that they've done a better job even if you don't speak the language.

Even relatively obscure languages like Dutch where they usually just watch English-language movies: https://www.youtube.com/watch?v=yOueN0sV2SY

TeMPOraL · on Jan 15, 2015

I agree. Frozen, and other Pixar/Disney/Dreamworks children movies (like Shrek) tend to be of awesome quality in all languages. But I attribute this to the fact that those movies are not translated - they're being localized, which by definition requires much more work and paying much closer attention.

daxelrod · on Jan 15, 2015

The Latin American Spanish localization of Dreamworks's Shrek is a great example.

They brought in Eugenio Derbez, a Mexican comedian, to voice Donkey (voiced in English by Eddie Murphy). Donkey in particular speaks in colloquialisms and pop culture references with wordplay, so Derbez wrote a bunch of new lines and jokes that referenced Latin American colloquialisms and pop culture.

Children learn different fairy tales in different countries, so they also managed to change the identity of some of the characters without changing their appearances (and without altering video at all, just audio).

TeMPOraL · on Jan 15, 2015

Exactly. I mentioned Shrek for a reason. It was (and still is) hugely popular in my country (Poland), and one of the reasons for that is the deep localization. They replaced original jokes and pop culture references with local ones.

emcrazyone · on Jan 15, 2015

@daxelrod: Wow, thanks for that. Quite interesting. I always wondered about how that was done and if it was a direct conversion of sorts but I guess it's not. Very interesting.

I just have to ask: How do you know this?

daxelrod · on Jan 15, 2015

All of this information comes from the teacher of a Spanish class I took (we watched the Latin American version of Shrek in the class). I wish I had some more tangible sources to cite.

EDIT: http://www.imdb.com/name/nm0220240/otherworks: "Jeffrey Katzenberg and Dreamworks allowed [Eugenio Derbez] not only to dub Donkey's voice, but to translate and adapt the script of "Shrek" and "Shrek 2" to make it more appealing to Latin America"

I also remember that the Gingerbread Man was one of the characters who was altered, but I don't remember the name of the Latin American replacement.

crpatino · on Jan 15, 2015

Gingerbread Man is translated as "El Hombre de Jengibre", which was previously unknown to Spanish speaking children.

What was relocalized was the Muffin Man nursery rhyme (http://en.wikipedia.org/wiki/The_Muffin_Man), which was substituted by Pinpon's ronda song:

Pinpon is a puppet very handsome and made out of cardboard. He washes his little face with soap and water. He untangles his hair with an ivory comb. And in spite of the hair pulling he cries not nor even winces.

daxelrod · on Jan 15, 2015

Ah! Thank you, you're absolutely right, I misremembered the character.

zem · on Jan 16, 2015

I remember the Hindi translation of Aladdin getting a bunch of critical acclaim too

bitdestroyer · on Jan 15, 2015

Not to mention that the context of each line of dialogue is pretty close to unambiguous since you have the source material right in front of you for a movie. I imagine it's still a huge job, but has to be more enjoyable that translating/localizing for us asshole programmers and our magic translation strings.

TeMPOraL · on Jan 15, 2015

That's why I say some translators seem not to bother even watching movies they're working on. Otherwise they wouldn't make such stupid mistakes.

stevewilhelm · on Jan 15, 2015

Nitpick: Shrek is a Dreamworks Animation movie.

TeMPOraL · on Jan 15, 2015

Thanks. Updated the comment.

I don't watch this type of movies often so I tend to bundle them together into "Pixareque" category in my mind ;).

btown · on Jan 15, 2015

If, by analogy, the visuals of a movie musical are the "backend" and the audio is the "frontend," then what these localizers do is the equivalent of completely redesigning the entire frontend. Dubbed musicals have an incredible number of constraints in terms of number of syllables, scansion, etc., so the script translators need to be given a tremendous amount of leeway and creative freedom. They're basically lyricists in their own right, in a world where everything's composed melody-first!

In software, this would translate to localization coders being able to (and having the talent to) rewrite the entire frontend logic. And if your software product is going to make multi-millions in new markets by virtue of feeling like it's translated natively, it might be worth retaining native-speaker coder(s) to maintain a branch that parallels (and consistently merges in) your master branch, and rewrites display logic as it comes in. I'd imagine the Googles of the world do exactly this.

wodenokoto · on Jan 16, 2015

Disney generally spend a lot of effort on the translations.

Idina Menzel, who voiced Elsa also plays the lead in the Broadway musical Wicked. Several translations of Frozen use an actress for Elsa who played the lead in a localized version of Wicked.

davvid · on Jan 17, 2015

For Big Hero 6 we also translated and replaced all of the Japanese text in San Fransokyo with Chinese and Korean for the Chinese and Korean markets. By "text" I mean all of the CG signs, posters, and environmental set dressing in the actual movie (not just the dialogue).

We literally had to re-render the entire movie for each translation. Disney Animation takes these translations very seriously :-)

merijnv · on Jan 15, 2015

To be honest, I thought the Dutch translation was relatively awkward (I have to admit I've only heard the Dutch "Let It Go" version, not the rest of the movie). I thought the Flemish version was much nicer, despite some Flemish phrasing sounding off as Dutch...

barrystaes · on Jan 15, 2015

(For those wondering: Flemish is the Belgian variant of Dutch.) And i agree.. i also prefer the Flemish dubbed voices. For example, Timon&Pumba in The Lion King are Flemish, to great effect.

raverbashing · on Jan 15, 2015

Because Kids (and Disney)

High-end kids movies are usually localized carefully, including songs, etc (there's a YT video with snippets of all the versions)

And of course they don't split the movie into phrases and make them translate one by one

raverbashing · on Jan 15, 2015

> I don't know why nobody seems to put information like "warning: this phone's UI in <your local language> is total and utter crap".

This is what makes me use SW and equipment only in English.

Translations are pretty much useless (and of course, Googling the English error messages usually gives the best results)

dtech · on Jan 15, 2015

It makes sense if not all of your target demographic, or at least a large part, cannot read English on the level neccesary to use your tool.

However, for a lot of modern tools that isn't the case. Often a translated tool is an order of magnitude less usable because of broken translations and inability to Google things.

It's also infurating that lots of tools look at your Windows location when deciding your language. No, I don't want your broken native translation on my English Windows installation, just because I do like to still have € in front of my currency.

MichaelGG · on Jan 15, 2015

That's if you're lucky! Google does far worse, and picks a language based on their often broken geo IP system. It's not even fixable. When viewing alt text for Google doodles, for instance, I get "localized" text even if the rest if the UI is in English. Google Play also has jacked up section titles from time to time.

Netflix search box is the same way. This is in addition to the discriminatory practice of often not providing subtitles in the language of the audio track.

Chrome also would install in the geo IP language, regardless of system settings. How arrogant is that? They deliberately disregard your OS language settings and pick another for you. And, it'd force the default Google search to go to the localized version until it detected a few location. Which, in Denver has had me appear in France then Hungary, as their database somehow accumulates errors. And again, no way to fully opt out- even selecting English and .com would still show country specific logos and such on eg YouTube.

Xbox is also a mess. You can buy games that are country restricted, with zero warning. When downloading, the Xbox provides no indication of a problem until it finishes. Then it does a geo check and reports "download corrupted". Xbox support wanted to RMA the unit, as they were convinced this was a hardware issue and had no KB info on country restrictions. Using a VPN fixed it.

Basically it seems that many developers are brain dead or simply do not care about travelers, expats, or anyone with different language prefs.

EvaK_de · on Jan 16, 2015

If you prefer using Google search untranslated, try their no-country-redirection: http://www.google.com/ncr

frandroid · on Jan 15, 2015

Don't forget monolingual managers that just don't care though. I've done so many last-minute string translations... "Oh shit, we forgot French, this was supposed to launch last week, you have 1 hour." And then insert the types of cases from the original article here where you need grammatical logic in a fixed string. Ugh.

ygra · on Jan 15, 2015

I think Gettext on Windows still uses the locale to determine UI language, despite those two things being completely separate concepts.

kevin_thibedeau · on Jan 15, 2015

At least it can be overridden with environment variables.

reitanqild · on Jan 15, 2015

I have tended to use the localised version of os an sw, not because I prefer it but because I often ended up supporting end users with localized versions.

tjradcliffe · on Jan 15, 2015

> Adhering to that single rule would eliminate like half of the "context missing" problems with software translations.

Sort of. Almost all the issues I have with translators and context come from the same word being used for different things. This is particularly true for words like "date" and "time", which can have different translations depending on context (is it the time of day, or the time the test has been running?)

So as well as using the same word for the same meaning throughout, using different words for similar-yet-subtly-distinct meanings is also required.

gjm11 · on Jan 15, 2015

I'm pretty sure general opinion among everyone other than Jehovah's Witnesses is that their bible translation is not good, and that the same-word rule is one of the reasons why.

And I really don't see how having a similar rule for translations of text in software would eliminate most "context missing" problems. It seems what it would actually do is stop translators even guessing those missing contexts.

TeMPOraL · on Jan 15, 2015

> And I really don't see how having a similar rule for translations of text in software would eliminate most "context missing" problems. It seems what it would actually do is stop translators even guessing those missing contexts.

Well, because for one translations would at least be consistent. In software words have often specific meaning related to the application itself. You're not free to translate the word "like" on Facebook however you like because it has its own, specific meaning that is different from the dictionary one. The same applies to things like tools in Photoshop, etc. In general, wrong use of synonyms for things that have application-specific meaning is one of the most common problems with translations I see (and the same happens in official movie translations).

When you don't have (or can't be bothered to get) context, this is the least you can do to play safe.

kevin_thibedeau · on Jan 15, 2015

I have a tool that does this. I produced machine generated translations of German and French as an expedient. The German messages were corrected by a native speaker and I cleaned up the French as best I could. When French is in use an extra message is printed appealing for someone to edit the weak translations.

I dealt with word order issues by avoiding formatted strings with more than one replacement field. Only one string needs to deal with singular vs plural quantities.

Lack of context was a definite problem with the machine translations. It would inconsistently choose translated words that should have been the same because the short string snippets could not be evaluated for their meaning within the narrow domain of the program using them.

lucio · on Jan 15, 2015

Same here. Technical text translation is normally a disaster. I truly hate web pages that switch languages based on IP location, as MSDN sometimes does. In my language the translators always get wrong keywords that should not be translated, specially when they're adjectives. e.g: * "You should use the 'new' keyword..." Often get translated as something like "You should use the word that is not old..." / "Debe usar la nueva palabra..."

rwmj · on Jan 16, 2015

To be fair this is because the code is wrong. It should use:

    printf (_("You should use the '%s' keyword"), "new");

indicating that the main string should be translated (_ == gettext) and the keyword itself should not. This also happens to generate slightly smaller binaries and fewer translations in the case where you have several keywords.

emcrazyone · on Jan 15, 2015

@eloisant: I work in automotive where I deal with translations for automotive clusters for one of the largest auto makers. Automotive companies are going to what are called reconfigurables which is basically an instrument cluster with no mechanical gauges; just a screen with gauges rendered by 3D engine. Center stacks too.

I kid you not, the way we translate is to use Google translate as a first pass and then the screens get reviewed by people who know the language. I guess somethings slip through and we evidently pissed off a lot of Chinese folks because of a similar flub. The folks doing the first pass don't know any other languages.

It's quite comical but also a real pain. One of the things I had to work into our code was the left-to-right vs. the right-to-left; you would think it would be just a C-style string but we have to know for text justification semantics.

I don't actually do translations but work on the HMI where the text is displayed. Another pain point is that we have a certain space where text needs to be displayed and everything is fitted using English but after translating to other languages, some strings are much longer than the allotted space.

TeMPOraL · on Jan 15, 2015

I can totally imagine that. Recently I've been helping a guy with tweaking his integrated car navigation/radio/media player and I spent some time browsing through its firmware and file system. I saw the translation strings and man, they were horrible. Also half of the stuff wasn't even translated (though it didn't show anywhere on the UI).

BTW. half of the files and directories were named in Chinese which made my work very "fun", but that's another story...

julenx · on Jan 16, 2015

> Another pain point is that we have a certain space where text needs to be displayed and everything is fitted using English but after translating to other languages, some strings are much longer than the allotted space.

As a rule of thumb, if you're using English strings to design your UI, you should account for 30% more space so other languages can fit in. Of course, sometimes this might not be enough.

bcbrown · on Jan 15, 2015

I worked for a place that used German as the "placeholder text", to avoid the problem of text longer than the allotted space. German is supposedly about 1.5 times as long as English, so if you can fit the German version, you won't have any problems with other translations being longer than the space.

EvaK_de · on Jan 16, 2015

German is longer than English, but there are other languages where the difference is even more noticeable, e.g. French.

jdcryans · on Jan 15, 2015

> and me for French > The French guy actually translated "Garbage day" to something like "Shitty day"

So you translated it to "jour de merde"?

eloisant · on Jan 15, 2015

That's the other way around, the guy who did the translation without context wrote "jour de merde", and I saved them from releasing that in their phone calendar app during my one-day job paid in cash.

jdcryans · on Jan 15, 2015

Well you really... saved the day!

mcphage · on Jan 15, 2015

The commenter wasn't the translator, they were the proof-reader:

> When I was in Japan I did proof reading for a Japanese feature phone.

jdcryans · on Jan 15, 2015

Ah that makes more sense, thanks.

ademarre · on Jan 15, 2015

Your experience speaks to this observation:

With very few exceptions, consumer hardware companies are bad at software.

mark-r · on Jan 15, 2015

Yes, I've noticed that since forever. Has anybody ever tried to come up with an explanation?

dkbrk · on Jan 15, 2015

As far as I can tell, the best tool for localisation almost nobody is using is http://www.grammaticalframework.org/. Licensing is a mix of GPL, BSD and MIT pieces.

It's a high-level functional programming language with a dependent type system specialised for operating on language ASTs. It's resource library, to quote "covers the morphology and basic syntax of currently 29 languages: Afrikaans, Bulgarian, Catalan, Chinese, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hindi, Japanese, Italian, Latvian, Maltese, Nepali, Norwegian bokmål, Persian, Polish, Punjabi, Romanian, Russian, Sindhi, Spanish, Swedish, Thai, Urdu."

In essence, once it has the language-independent AST, it can produce output in all its supported languages with the correct tenses, genders, inflections, etc.

It also seems to have tools for assisted parsing, so you could have an english document and interactively parse it into the correct AST. In addition, the text can be parameterised semantically, so if you changed the gender of a person, that could propagate to all the correct locations and update the translations as required.

While it seems the upfront cost may be quite high in having to learn such a complex system, I think the benefits of having reproducible, high-quality outputs into n languages for free could make this highly advantageous in many applications.

canjobear · on Jan 15, 2015

I'm very skeptical that this would work outside of toy examples, though it depends on what is meant by language-independent AST.

For example, the best way to translate Spanish "X dió un golpe a Y" would be "X hit Y". But my naive idea of what the AST for the Spanish sentence would look like would be something like `(GIVE (X HIT Y)`, which when naively transduced to English would be the "X gave a hit to Y", which is either unidiomatic or means the wrong thing altogether. In order to avoid this problem, the AST would have to be a more abstract representation of the semantics. And coming up with a sufficiently expressive, tractable, and neutral representation of natural language semantics is an unsolved problem that people are still devoting their whole careers to.

I was briefly involved in a very early stage startup that was considering using systems like this for better machine translation. We ran into problems like the above, and also: ambiguity, and the fact that the hand-written grammars and semantic representation systems were just very brittle and incomplete.

rmc · on Jan 15, 2015

> which when naively transduced to English would be the "X gave a hit to Y", which is either unidiomatic or means the wrong thing altogether

What's interesting, is that there are dialects of English (Hiberno-English spoken in Ireland) where "X gave Y a hit" would be a way to say "X hit Y". :)

pimlottc · on Jan 16, 2015

While in others, it would make X a drug dealer.

masklinn · on Jan 16, 2015

Or a mate. A mate lets his mate toke.

jrockway · on Jan 16, 2015

It's fine in en_US too, but it means something like "X let Y smoke his crack pipe", not "X punched Y". Oh English.

nitrogen · on Jan 16, 2015

"X dealt a blow to Y" seems like it could be a more grammatically similar translation.

schoen · on Jan 15, 2015

That sounds like a great technical approach, although it can't necessarily remove the problem of idiomaticity that the article mentions (with the example of "I didn't search any directories"). Probably better examples are possible, in any case where the most idiomatic way to express something isn't a literal translation of that thing from other languages. Maybe like

English "I don't care"

Portuguese "tanto faz" (literally 'so much does')

German "[das] ist mir egal" (literally '[it] is equal for me')

taejo · on Jan 15, 2015

BTW, egal doesn't mean equal in German (though it's borrowed from French égal, which does). It means "irrelevant", "all the same", "unimportant".

weinzierl · on Jan 15, 2015

That's true.

"[das] ist mir gleich" would literally be '[it] is equal for me'). It means the same ("don't care") and is perfectly idiomatic German as well.

schoen · on Jan 15, 2015

Good point, sorry for the mistake.

neilk · on Jan 15, 2015

Side note: as you might expect, Wikipedia's internationalization is the only system that attempts to do quantities and other formatting correctly for every goddamn language on the planet, but is considerably easier for translators to work with than the OP's examples (sorry, Sean ;)

I did some work on bringing it to JavaScript and making it HTML-aware, and since then Santhosh Thottingal has vastly extended it and it's become pervasive at Wikipedia. More projects should use it, or at least learn from it.

Demo: http://thottingal.in/projects/js/jquery.i18n/demo/

Github: https://github.com/wikimedia/jquery.i18n

bmn_ · on Jan 15, 2015

The Russian demo does not work. Must be 1 котёнок, 2 котёнка.

mynegation · on Jan 15, 2015

Ironically, this is exactly the case described in the article, and even the article got it slightly wrong. The rules in pseudocode - for this specific sentence! - are as follows:

  if ((n % 10 == 1) && (n % 100 != 11)) { 
      // singular nominative  (not accusative as article says) - 1 котенок, 101 котенок, 301 котенок
  } else if (n % 10 >= 2 && n % 10 <= 4 && n % 100 <12 && n % 100 > 14) {
      // singular genitive: 2 котенка, 43 котенка, 1024 котенка
  } else {
      // all other cases plural genitive: 5 котят, 11 котят, 212 котят
  }

But this is true only for this sentence because kittens here are the subject, not the object: with Russian verb "есть" (to be, to exist) the literal translation would be "Kitten/Kittens exists/exist belonging to Harry".

Once the declension of the numbered subject(s) turns to accusative: "Гарри гладит 1 котенка" ("Harry pets one kitten"), the rules become much simpler:

  if ((n % 10 == 1) && (n % 100 != 11)) { 
      // singular accusative - 1 котенка, 101 котенка, 301 котенка
  } else {
      // all other cases plural genitive: 2 котят, 43 котят, 5 котят, 11 котят, 212 котят
  }

Concours · on Jan 15, 2015

I've been looking for something like this for weeks, no luck till now, I guess I should have searched for internationalization instead of localization in this specific case on google and github. This is a great starting point, thanks a bunch for sharing. Perfect for me.

reidrac · on Jan 15, 2015

Slightly OT, mi favourite localization error was in Ubuntu when they had that nice netbook interface (that later would become Unity). The network icon label was "Rojo" in the Spanish localization, that is the word for "red" color. What?

Well, if you translate "Net" to Spanish you get "Red"; and if you translate that again (by mistake), you get "Rojo". There you are :)

tjradcliffe · on Jan 15, 2015

I once got a Spanish translation back and was briefly angry that the translator had left a few strings as "TODO"...

"What?!" I asked myself. "To do? Why did they ship me an incomplete file?"

Then I looked to see what the "untranslated" strings were: "ALL" (I don't speak Spanish but have enough of a smattering of European languages that it was immediately obvious what was going on.)

hpshelton · on Jan 25, 2015

When I worked on Outlook.com, we got feedback from a British user that he couldn't understand what the option to "Connect devices and apps with DAD" meant. Turns out we had forgotten to prevent localization of the "POP" protocol everywhere it was used.

baby · on Jan 15, 2015

OT means Out of Topic? In french we say HS (for Hors Sujet).

Kiro · on Jan 15, 2015

I'm really surprised that this is the first time you've seen "OT" being used. I can only presume it's because you don't normally ready English websites/discussion boards but I still find it intriguing that you've never encountered it before.

In Sweden we use "OT" as well but it's referring to "Off Topic" and not something Swedish.

baby · on Jan 16, 2015

I read mostly english websites/discussion boards (reddit, HN) and I know a bunch of abbreviations (IIRC, INB4, AFAIK, QED, IFF, ST...) but I don't remember seeing OT. Learning new stuff everyday :D

talideon · on Jan 15, 2015

It means 'off topic'.

chucksmash · on Jan 15, 2015

A fitting reply in the localization discussion!

me_bx · on Jan 16, 2015

HS may also mean Hors Service (Out-of-order)

psykovsky · on Jan 15, 2015

That's an error only an amateur translator would make. Guess who makes free software translations...

A professional translator makes sure to check the context of the translation, it doesn't go blindly translating sentences and words without context.

BjoernKW · on Jan 15, 2015

An amateur error perhaps but hardly limited to free software.

In a particularly expensive piece of enterprise software for the finance sector that was translated into German a form field was labelled 'Hauptstadt', which translates back into 'capital (city)' as in Washington, D.C.

Now, this kind of geographic information doesn't make sense at all in financial software but the well-paid developer (who had to double as the software's translator because hiring a proper translator would of course have been too expensive ...) simply couldn't be bothered to look up the different words 'capital' translates to in German, one of which also is the right choice in that case: 'Kapital' (meaning 'capital' as in 'asset).

anonymousDan · on Jan 15, 2015

Or the classic: http://news.bbc.co.uk/1/hi/7702913.stm

blywi · on Jan 15, 2015

My favorite is still this one: http://www.theguardian.com/world/shortcuts/2012/feb/08/diese...

ufmace · on Jan 15, 2015

That's a head-smacker. Some guy somewhere must have actually painted that on the trailer - does that guy not think "Hey, this doesn't really make sense. Are you sure they didn't want somebody to translate this to Arabic?"

Or maybe he did and his boss told him to shut up and do as he's told. Or he thought - "Screw this, I wanna go home on time. I'm not gonna bother to find an Arabic translator now. I'll just paint exactly what it says, and they can't get me in much trouble for that."

Retra · on Jan 17, 2015

Maybe he had a stencil and didn't speak english?

HCIdivision17 · on Jan 15, 2015

There's a certain schadenfreude to be had where the failed translation is also typo'd - but only once. These sorts of things are especially embarassing, as it not only shows no effort was used proofing, but nothing tripped a sanity check at all.

paol · on Jan 15, 2015

See also: restaurant sign during the Chinese Olympic games http://craphound.com/images/translateservererror.jpg

TeMPOraL · on Jan 15, 2015

> A professional translator makes sure to check the context of the translation, it doesn't go blindly translating sentences and words without context.

From my experience with both software and movies, that's exactly what a professional, paid translator does (or is forced to).

Amateurs at least watch the movie/run the software before translating things, which is something I definitely cannot say about "professionals" translating movies and software to my language.

JanezStupar · on Jan 15, 2015

From my experience with professional translators I would have to go out of my way to specify context if I wanted a good translation, because professional translators are too technologically illiterate to get stuff right.

Not to mention the difficulty in getting them to use Virtaal or PoEdit properly.

psykovsky · on Jan 15, 2015

The problem you describe is one of incompetence. The professionals are not really professionals. And yet, someone is paying them as if they were. That is not the case everywhere, I assure you.

A little of anecdata: In Portugal, where foreign stuff is translated and subtitled, the quality of translations/subtitles in english language movies and series in very good. Or at least it used to be, I can't be sure anymore since I don't currently watch TV. I grew up used to reading good translations. Those translations taught me to understand and speak the English language when I was growing up, since age 5. I understand I might be a bit biased because of my personal experience.

TeMPOraL · on Jan 15, 2015

My personal experience comes from Poland, where quality of movie translations has been basically a running joke throughout the years among the population that knows even a little bit of English. I do know personally a few professional translators - they're really smart and competent people, and most of them are even quite tech-savvy. Sadly, they don't get to translate movies.

masklinn · on Jan 15, 2015

> A professional translator makes sure to check the context of the translation, it doesn't go blindly translating sentences and words without context.

1. professional translators are human and make mistakes

2. context may not be available for third-party translators working on e.g. unreleased software strings (see currently top comment by "eloisant")

dyadic · on Jan 16, 2015

Somewhat relevant given the topic, don't read "it" at not-human.

Not everyone here speaks English as a first language and translating a non-gendered/unknown pronoun to English may result in "it".

ygra · on Jan 15, 2015

Ah, that's why Word 2007 ended up with »Keine Gliederung« (»no (document) outline«) for the setting where you can select a shape outline colour ...

In all seriousness, translators make mistakes by getting the context wrong all the time; this applies to both professional translators and amateurs, perhaps in different severity. But in my eyes that's not really a problem of the translators, but rather of the tools we use. A long list of strings that have maybe only a vague context is a horribly UX for the translator. Who would fault them to translate »outline« as »Gliederung« in a word processing application? A string table is probably the most convenient format for programmers, but not very much so for those maintaining translations.

Qt gets a bit of this right. If the translator has access to the source code and uses Qt Linguist, at least the strings that are directly in .ui files are shown with a mock-up of the respective window where the control with that text is highlighted [0]. That already helps a lot with context errors. Of course, it does nothing for text in the source code, and so our translator went ahead and translated »Breite« (line width) with »latitude« because the application in question was, after all, related to maps, geography and GPS; it just happened to have a setting to change the width of certain lines, which, in German, then mapped to the same word.

Qt also has a nice way of handling plurals in that both Linguist and the QTranslator class are aware of the languages and the special rules concerning plural forms in them. You create a translatable string like »Searched %n directories« and then create translations for that. English, German and lots of others just map to two forms (1, rest). Russian gets three, I think (1/21/31/..., 2/22/32/..., 11/12/rest), and so on. Downside is that you only get to handle a single plural in a string [1], and you have to create an en→en translation as well (to account for multiple forms in the source language). But generally it's a quite nice implementation. Gettext has something similar where you can write your own plural form matching rules somehow, but most translators I met don't really want to write math.

Visual Studio with Windows Forms has a mode of creating the string tables for translation where you can just change the language in the Form's properties and then proceed to change each control text. This nicely solves the context problem in that the translator edits the window directly and it looks like it normally does. But it also creates a whole bunch of other problems: Translators need VS, they need to edit the project directly and can accidentally mess up the UI with a mouse twitch when selecting a button to translate. They also might miss things that are buried in menus since there is no real measurement of completion and what's still missing. I've seen various projects, especially in web environments adopt a similar custom-written approach, though, where you can edit the UI directly in the application to translate it. Still with the problem that translators might miss hard-to-find and buried strings (one might argue that you should get rid of buried and obscure places in your UI anyway, though).

Long ramblings ... it was a topic I considered for my Diploma thesis while studying. But I couldn't think of a good way that could retain context for the translator in a general case, or at least most of the time. I thought about replacing each and every translatable string in a program with a custom identifier and then later trying to find those again via UI Automation or maybe screenshots and OCR to be able to map strings to parts of screenshots of the program. Would have required running the program once with those custom identifiers and once in normal mode and somehow matching up identifiered screenshots, normal screenshots and the string tables. And still with the problem that you'd need to manually go down each and every dialog and menu, including context menus, messages that only appear in certain states, etc.

Perhaps there just is no good solution, except maybe for developers to properly annotate each translatable string they use. In the »Breite«/»width«/»latitude« case I went through the source and added translator comments detailing the meaning of the word for every instance, but with large applications having thousands of translatable strings that could become unwieldy quickly.

__________

[0] http://geoinformatics.fsv.cvut.cz/wiki/images/thumb/3/39/Qt_...

[1] http://stackoverflow.com/q/5348990/73070

psykovsky · on Jan 15, 2015

Translation is like dates: hard to get right.

I like the Visual Studio approach. I usually resort to running the software I'm translating to get context. Sometimes it's very hard to get some of the text to show up.

I can only think of one way to give translators context: Comments, the exact same way you give other programmers context for your code.

hibbelig · on Jan 15, 2015

> But I couldn't think of a good way that could retain context for the translator in a general case, or at least most of the time.

How about assigning identifiers to all strings, then adding tooltips for those identifiers in the application? So the translator can hover over a menu item to determine which string they are supposed to translate.

This requires the translator to exercise the whole application which is kind of difficult, as well, of course.

ygra · on Jan 15, 2015

The goal in my mind was to try generating context information for translators as automatically as possible while retaining the usual workflow developers would use in a given framework for localisable resources.

But yes, the requirement to cover every control, menu and dialog as well as every possible code path that uses a localisable string from the source code makes the whole endeavour very impractical to solve. With dialogs built in markup, e.g. Qt's .ui or XAML it's easy enough to give context, but the hard part is strings in code where you never know where they'll end up.

reidrac · on Jan 15, 2015

To be fair it was a beta of the "Ubuntu Netbook Remix" and that bug was shortly fixed.

I took some screenshots because I really liked the interface, but none of them show the bug because I usually stick to en_GB and the problem was spotted in my wife's netbook (she's Spanish teacher and likes to have the OS interface in Spanish to amuse the kids).

Fuxy · on Jan 15, 2015

Am I the only one that gets confused on what the menus and other things on a non English computer or device actually mean when translated.

This coming from someone doesn't speak English as their first language.

I speak Romanian, Hungarian and English perfectly but my devices are always in English since it would be harder to figure out what they mean otherwise unless you have the structure memorized already.

entropy_ · on Jan 15, 2015

You are definitely not alone in that. My native language is Arabic, but if I switch my phone or computer to Arabic I wouldn't be able to use it. I dread testing translations of apps I'm working on because it forces me to switch to Arabic. It doesn't help that there's a lot of jargon that just doesn't translate that well. For example, the translation of "Tap" to Arabic is the equivalent of either "Peck" or "Perforate". It's just that that's what's generally agreed on as the proper translation and everybody uses it, however, if I didn't know that it would take me a while to figure it out for myself.

userulluipeste · on Jan 15, 2015

All things being equal, I guess that's because you accept to pay the necessary price of learning English interfaces of your devices, but do not consider reasonable to pay any learning effort for other languages.

TeMPOraL · on Jan 15, 2015

> but do not consider reasonable to pay any learning effort for other languages

One problem is that English is the language majority of technology is created in, and it has developed agreed-on terms for a lot of things. Everyone calls their tabs "tabs" in English, but when it comes to translating, every other app uses a different word. Even if names are consistent within the app, they are inconsistent between each other. This leads to a lot of problems if say, an app reports an error that's related to some OS thing, and calls it with a different translated name than the OS translation does.

EDIT: And I'm not making this up. I worked on a localized Ubuntu for a very short while before switching back to English because messages I got in my command line were painfully inconsistent with each other and with agreed-upon proper translations in my language.

psykovsky · on Jan 15, 2015

Not attacking free software translators. I'm one myself. As amateur as I can be. When one can't code, it can at least translate ;)

nitrogen · on Jan 16, 2015

I only mention this because it occurs in both of your comments in the thread, but...

When one can't code, it can at least translate ;)

In this context the pronoun "it" would make more sense as "one"; "it" is very rarely used to refer to a person in English.

A professional translator makes sure to check the context of the translation, it doesn't go blindly translating sentences and words without context.

Again, in this context, "it doesn't" might make more sense as "he or she doesn't", or less formally, "they don't".

Hope this helps with your translations :-)

bmn_ · on Jan 15, 2015

Article is from 1998 and very much out of date. Read: http://blogs.perl.org/users/aristotle/2011/04/stop-using-mak...

EvaK_de · on Jan 15, 2015

Should be reflected in the title, if possible, neh?

lmm · on Jan 15, 2015

Cool example, but I think "ne" is the "correct" way to romanize ね.

mrob · on Jan 15, 2015

If it's intended to be a romanization of ね, then yes, "ne" is more common. Alternatively it could be a reference to Ender's Game, where "neh" is used in the same way.

EvaK_de · on Jan 15, 2015

Thanks for bringing back to my mind where I snatched this one up! Yep, it's Ender's Game. :)

What is the meaning of the Chinese (?) symbol?

mrob · on Jan 15, 2015

It's a Japanese sentence ending, used to seek agreement/confirmation. Similar to "right?" or "isn't it?" in English.

Symbiote · on Jan 16, 2015

The Chinese sentence ending is 呢 (ne in Mandarin), it would be interesting to know if that's related.

(I've just started learning Mandarin, and enjoy looking at the origin of words and characters.)

"Innit" is slang in Britain, used at the end of a sentence. It doesn't have to mean "isn't it", but could be "aren't you", "won't you" etc.

lmm · on Jan 15, 2015

Japanese; you put it on the end of a statement to make it a question instead, e.g. "the server is up to date, ne?" to mean "is the server up to date?"

bmn_ · on Jan 15, 2015

That's incorrect, that interrogation particle (for making the question in your example) is か. ね is different.

sosborn · on Jan 15, 2015

ね is the colloquial way of turning a statement into a question.

detaro · on Jan 15, 2015

It's a japanese character (from hiragana script), and AFAIK used like this it means something like "isn't it?"/"shouldn't it?".

bmn_ · on Jan 15, 2015

http://enwp.org/ね

blywi · on Jan 15, 2015

Could also be from German, although this would also be written ne instead of neh. People from the northern part of Germany use this in pretty much the same way it is used in Japanese (at least according to what I know with my limited knowledge of Japanese) I always thought of this as a strange quirk that the same language construct can evolve in two unrelated languages. It's just like parallel evolution in biology.

schoen · on Jan 15, 2015

There's also Brazilian Portuguese "né", which is an end-of-sentence tag with exactly the same meaning. It's a contraction of "não é" ('isn't it') and is used in a way akin to German "nicht wahr".

Yen · on Jan 15, 2015

Which, incidentally, is almost certainly where the usage in Ender's Game comes from.

The author (http://en.wikipedia.org/wiki/Orson_Scott_Card) spent two years in Brazil on a religious mission, and there's multiple other Portuguese references in that book.

(This influence is even stronger in "Ender's Game"'s sequel, "Speaker for the Dead")

weinzierl · on Jan 15, 2015

Also "isso" which in German is colloquial for "Ist so" and means "That's it" or "Exactly". When in Brazil I always found it funny that they use "isso", short for "isso mesmo", in much the same way.

schoen · on Jan 15, 2015

It's funny to think that the conversation fragment

- ... Né?

- Isso.

could happen in either Brazil or Germany with the same meaning. :-)

kevin_thibedeau · on Jan 15, 2015

Could this have been influenced by the Japanese colonists in the São Paulo region?

aselzer · on Jan 15, 2015

It's funny that "gel", which you hear pretty often in Austria (and probably South Germany), means the same as well.

It's approximately the same as "nicht wahr?", or "oder?", which can't really be translated, but suggest wanting some kind of verification for what was said.

EvaK_de · on Jan 15, 2015

Since I'm German, this is probably at least the reason why I found it so easy to accept this construct, when I stumbled upon it in Ender's Game. It just felt natural, so my brain decided to use it in English, too. :)

csours · on Jan 15, 2015

Interestingly, there's also the Yiddish: nu? Which itself likely comes from German or Russian

darklajid · on Jan 16, 2015

That's also used in (and often a reference when making fun of..) the dialect in Saxony.

reacweb · on Jan 15, 2015

It is funny to have such a good article with so many comments at the end that disagree. The issue does not seem to be so clear cut.

bmn_ · on Jan 15, 2015

Miyagawa makes two claims.

1. L::M::L::G enables PO-based work-flow. This is not correct except for trivial lexicons. L::M::L::G cannot parse the plural extract from the article, so it is incompatible with properly working PO emitters.

2. Maketext is used on typepad and "works really well". This claim is not backed by any evidence. The verifiable facts from the article pointing out the pluralisation problems of L::M still stand unchallenged.

whizzkid · on Jan 15, 2015

It made me remember the Norwegian customer I had.

I needed to write Norwegian localization strings in a YAML file which did not work for some reason.

After 4 hours of debugging, the problem was;

In YAML, the "no:" string (for "norwegian") defined in a YAML file was parsed as a boolean, and this makes the application broke..

mcphage · on Jan 15, 2015

Yeah, I've run into that, also. Really annoying. I had to quote all of the "no": strings. Looked ugly, but what can you do?

gldalmaso · on Jan 15, 2015

In order to avoid these pitfalls, usually I get out of "sentence" mode to "label" mode. For instance: "Directories scanned: 12". Probably not well suited for all cases, but usually good enough for mine, though actually I only have to support pt-BR, es-ES and en-US so maybe that's not saying much.

MichaelGG · on Jan 15, 2015

Exactly. All this work, or just restructure the message. It should be acceptable in most languages, because charts and spreadsheets aren't going to have per cell labels. And it has the benefit of being easier to read and parse.

Also, it's a really terrible style to use first person in an app unless it's actually sentient. Otherwise it's annoyingly like Clippy, or just plain obnoxious and presumptive.

TeMPOraL · on Jan 15, 2015

> Also, it's a really terrible style to use first person in an app unless it's actually sentient. Otherwise it's annoyingly like Clippy, or just plain obnoxious and presumptive.

I agree, though I found another nice use case, well demonstrated by Bret Victor[0][1]. I played around with it for a while and I find that describing what something will happen in a normal sentence, parts of which you can tweak, is a pretty good way of doing options pages.

[0] - http://worrydream.com/#!/TenBrighterIdeas [1] - http://worrydream.com/Tangle/

dmytrish · on Jan 15, 2015

Kudos to the author of the article for his perseverance in decorating message to fit grammar. I'd go another way, just using more formal and dry format:

    Number of scanned directories: %g
    Number of found files: %g

That solves the problem with Slavic languages at least. Italian aversion to 0 may be mitigated with printing 'none', I guess. Please correct me if this form does not fit other languages.

placebo · on Jan 15, 2015

Just scanned the comments to see if anyone would suggest that :) Being the lazy type, that's the first thought I had reading the article. Perhaps it's not appropriate for all target audiences and all target languages but many times you can find a much easier solution by going about it in a totally different way.

xorcist · on Jan 15, 2015

This is my preferred solution as well. As a bonus, it makes parsing the data so much easier, if you ever need it.

randallsquared · on Jan 15, 2015

I agree: a huge chunk of this problem goes away if you stop trying to have the software communicate colloquially.

mrfoto · on Jan 15, 2015

Funny how Slovene seems to tick all the complications checkboxes :D We have 4 grammatical numbers (singular, dual, plural for 3 and 4, plural for 5 and above), they repeat at mod 100 (so 101 is singular, 102 dual,…), it's an inflectional language with 3 grammatical genders, sentence should take a different form depending on whether the user is male or female,…

gambiting · on Jan 15, 2015

I was just trying to think of how it would work out in Polish....

directory is "katalog" in Polish, and it would be:

1 katalog

2 katalogi

3 katalogi

4 katalogi

5 katalogów

6 katalogów

....(it doesn't change for any number greater than 5)

But if you wanted to say "X files were found in Y directories" then you would have to say:

....1 katalogu

....2 katalogach

....3 katalogach

....1000 katalogów

(for 1001 even I am not sure if you should say katalogów or katalogach, both sound correct to me)

...and again, the whole thing changes depending on the speaker being male/female + singular/plural("I have found"/"we have found").

Compared to the grammar of Slavic languages, English is super easy.

Swizec · on Jan 15, 2015

> Compared to the grammar of Slavic languages, English is super easy.

And thus is solved the mystery all us Slavic speakers have such good English. Especially on the online. Because nobody had the guts to localize to our languages.

If they did localize to our languages "Omg this looks weird and smells off", we all cried. And switched our interfaces to English. Admit it, you did this in High School or even Middle School because the <insert slavic language> translation just didn't make sense.

And so all of us who are now in a position to localize these interfaces don't because it seems silly and pointless.

We come full circle. Younglings after us use computers in English because no good localization exists.

gambiting · on Jan 15, 2015

Yeah. I remember being a kid and HATING Polish translations of English games. Even though I barely knew any English playing any game in Polish just felt wrong. Nowadays I think it just felt more mysterious and interesting if apart from being difficult, the game was in a language I couldn't fully understand. But yeah, it contributed massively to me learning English, so it was definitely a good thing. But I also know what you mean - I wouldn't translate my programs to Polish unless I had to, because it just feels wrong to have them in my native tongue.

Swizec · on Jan 15, 2015

I think the poor understanding of English was part of the problem. We didn't understand, we just knew that you found what you need when you click "Insert" or whatever. To us "insert" didn't mean insert, it meant "That menu where you find an Image to put in your document".

In our mind the English word and its native translation have semantically different meanings. I discover this problem a lot now that I'm older. I understand the words and their translations, but they still mean semantically different things to me in different languages.

The funniest part is how when I'm in the US I know all the English words for pots and pans and stuff, but when I'm home in Slovenia and my girlfriend is here, it becomes almost impossible to translate. Because in Slovenia the pots and pans have Slovenian names, in the US they have English names. And to me those are completely different.

TeMPOraL · on Jan 15, 2015

> In our mind the English word and its native translation have semantically different meanings.

Because they quite often have (I'd wager that more often than not).

To use your example, English "Insert" cuts through a different part of concept-space than Polish "Wstaw" (which is what usually ends up as a menu label). It's a fine translation up until you ask to insert a DVD - now you should say "Wsuń" or "Włóż".

And I don't think this is a problem. As I grow older I realize that this is how it should be. Personally, I think that the moment you stop mapping words from new language to the words of your native is when you actually start to be proficient in the language you're learning. Having mappings like "fork" -> "widelec" -> "<that pointy metal thing you use to eat>" in your head is totally wrong way to use the language. "fork" and "widelec" are two different labels referring to two different areas of concept-space, that happen to intersect somewhere in the area where you think about eating utensils.

That's why I keep my internal monologue (and speech, if people I'm talking with don't mind) switching constantly between English and Polish - there are concepts I can express with one language that I can't express in another, and any attempt at translation feels like lossy compression.

(and then I get tons of hate for occasionally saying "robi sens" instead of "ma sens"; I know what the proper translation of "makes sense" is, but the Polish expression emphasizes 'sense' as a property of things while English shows it as something that can be produced, and sometimes the idea I want to express is closer to the English than Polish version)

Swizec · on Jan 15, 2015

I agree completely.

The problem arises when your girlfriend (or friend or whatever) visits you in the homeland and you suddenly find yourself struggling to find words because many of the things at home don't have English names. It becomes even worse when you have to act as translator between them and your family who doesn't speak English or doesn't speak it as well.

Yes, I had a lot of fun over the holidays ...

Even something as simple as "Did your mum like me?" becomes difficult to translate because your mum said X and X doesn't quite translate into English with all the connotations preserved.

TeMPOraL · on Jan 15, 2015

> Yes, I had a lot of fun over the holidays ...

Yeah, I can totally imagine it :D.

nathell · on Jan 15, 2015

And don't even get me started at the case where you don't know the noun beforehand and need to synthesize the entire numeral phrase at runtime ("You have %d %s.")

Because, in Polish, numerals inflect in gender and case, come in two main variants ("normal" and "collective", and that's not including ordinals), and can (but not necessarily have to) for some genders undergo case changes that affect the verb part of the sentence under certain circumstances. Add to this the fact that there's even no definitive consensus on how many grammatical genders there are in Polish (opinions range from 3 to 9, with some of the theories based on numeral connectivity), and you're all set.

The Polish word "dwa" (two) has at least seventeen distinct grammatical forms, each of which has arcane rules that govern its usage.

kruczek · on Jan 15, 2015

> ....(it doesn't change for any number greater than 5)

Not so lucky there either:

....21 katalogów

....22 katalogi

....23 katalogi

....24 katalogi

....25 katalogów

And the same pattern repeats modulo 10.

EDIT: And as I think about the 1001 part... Actually the following sounds right to me (translating "in X directories"):

....w 1000 katalogów (w tysiącu katalogów)

....w 1001 katalogu (w tysiąc-jednym katalogu)

....w 1002 katalogach (w tysiąc-dwóch katalogach)

etc.

mrfoto · on Jan 15, 2015

1 imenik 2 imenika 3 imeniki 4 imeniki 5 imenikov 6 imenikov … 101 imenik 102 imenika 103 imeniki … 201 imenik … 1001 imenik … Slavic languages are f*d up :D

Swizec · on Jan 15, 2015

Not to mention that I would never realise that "imenik" means directory. An imenik is a phone book. Potentially the Contacts app on my phone. But never a directory.

I think this has to do with vocabulary registers as well. We've learned to use English words for computer things. Using Slovene translations just feels weird unless they're a bastardisation of the English word. Similar to how English uses French words for foods because cuisine was a thing of the elite and they didn't eat pig, they ate porc.

pif · on Jan 15, 2015

But, a directory is a phone book!

SixSigma · on Jan 15, 2015

Surely a phone book is a directory, not the other way round.

Swizec · on Jan 15, 2015

To you. To me it very much is not because as a kid I learned that directory is that thing on the computer where the other files are.

Things like that are difficult to shake off no matter how well you know a language.

tinganho · on Jan 15, 2015

ICU's messageformat solves this easily. One project that supports ICU's messageformat is L10ns http://l10ns.org

ygra · on Jan 15, 2015

This looks quite nice and powerful and indeed an elegant way of solving the problem with multiple plural forms in a string. The only concern I have with that syntax is that it's yet another DSL, or markup language and translators need to know it, or could get it wrong. Granted, a program for helping translators might do automatic linting (much as Qt's Linguist already warns if you omit placeholders from the translated phrase that are there in the original).

Another thing is that the mini-language grows complex enough that the resulting text can be quite hard to read and understand:

    {people, plural, offset:1 =0{No one went.} =1{{user1} went.} =2{{user1} and {user2} went}.} other{{user1} and # others went}}.

is just a single (or two) placeholder and it takes a while to even parse how it's supposed to work.

TeMPOraL · on Jan 15, 2015

Well, if we're going to introduce that level of complexity into a DSL, why not go full Turing-Complete and write it in code?

    (case (length folks) (0 "No one went.")
                         (1 ((elt 0 folks) " user went."))
                         (2 ((elt 0 folks) " and " (elt 1 folks) " went."))
                         (otherwise ((elt 0 folks) " and " (length folks) " others went.")))

You can wrap that in a lambda that concatenates resulting strings and voilà, you have "smart" string tables. And it's not a problem to make it even more DSL-y and translator friendly.

ygra · on Jan 15, 2015

And then you have the exact same problem as if you'd write that logic in your source code. Just with half a dozen layers of abstraction, a more cumbersome way of displaying strings in your application and another programming language on top. I'd say that's not a net positive.

TeMPOraL · on Jan 15, 2015

I disagree. That logic has to go somewhere anyway - you can't skip it because it's inherent in the problem of displaying a proper message. So you could at least write it in an expressive language instead of encoding it into what looks almost as readable as regular expressions.

jameshart · on Jan 15, 2015

Now you need to find a translator who knows Lisp

TeMPOraL · on Jan 15, 2015

No you don't, in much the same way that you don't need a translator that "knows JSON or XML". Just don't tell them it's Lisp. That's how you do DSLs.

Also, I advocate closer work between translators and developers. Let the translators give the text and explain corner cases to someone who can code up the logic.

BTW. Lisp is only hard for people who acquired this stupid meme that "Lisp is weird/for crazy people". You'd be hard-pressed to find something which is simpler in terms of syntax and readability.

tinganho · on Jan 15, 2015

I recommend that you need to teach the translator about the markup. It's really easy to understand.

The way L10ns try to mitigate the complexity of the markup language is to provide buttons for pasting the correct markup for translators. Also a developer translates on his own language first. And this example will always be visible to the translator of an another language for easy reference. So the programmer offloads the logic thinking from the translator.

IMO ICU's messageformat markup is also slowly becoming the standard for localizing strings. It is already used widely by big organizations such as Apple, Google and Yahoo.

L10ns pre-compiles all message string. So it offloads the parsing performance.

tinganho · on Jan 15, 2015

Here is more info about the plural format of ICU's messageformat. http://l10ns.org/docs.html#pluralformat.

po · on Jan 15, 2015

Just curious... are Slovene web services more likely to ask your gender after you sign up so they can get the grammar right or do they just go with 'male' or something?

pavel_lishin · on Jan 15, 2015

Russian speaker here; I don't think it would be relevant, since services ought to be addressing you in second person, which I believe is gender-neutral in just about every Slavic language.

lepouet · on Jan 15, 2015

Wikipedia mention only singular, dual and plural ? What is the one for 3 and 4 ?

http://en.wikipedia.org/wiki/Slovene_grammar

mrfoto · on Jan 15, 2015

No idea how it's officially called, but I'm not making this up :D

http://localization-guide.readthedocs.org/en/latest/l10n/plu...

lepouet · on Jan 15, 2015

I think i found the solution here : http://en.wikipedia.org/wiki/Dual_%28grammatical_number%29

eginhard · on Jan 15, 2015

It's the normal plural, but 3 and 4 take the nominative, while other numbers take the genitive plural.

theoh · on Jan 15, 2015

The two Turkish letters dotted and dotless i are often confused by users of poorly localised software. Wikipedia links to a murder case allegedly caused by this: http://en.wikipedia.org/wiki/Dotted_and_dotless_I

A real horror story.

(Less seriously, Unicode has counterintuitive case-changing behaviours with those letters. If you are working outside the Turkish locale and uppercase a dotless I and then lowercase it, it gains a dot. I am curious about this design decision, since it seems like a basic error in operating a the level of glyphs rather than symbols. Or maybe the opposite.)

lmm · on Jan 15, 2015

Upper and lower casing can't be assumed to be inverse; there are plenty of other cases where they will change (e.g. precomposed characters that don't have a precomposed upper case). The correct lower-casing of "I" in English is definitely "i"; the correct upper-casing of "ı" in English is maybe a wrong question, because it just isn't an English letter, so I guess you could argue for leaving it unchanged, but converting it to "I" is probably what the person who wrote "ı" would want to happen when it was upper-cased. Maybe?

zokier · on Jan 16, 2015

> there are plenty of other cases where they will change (e.g. precomposed characters that don't have a precomposed upper case).

Just curious, is there still some cases if you only consider NFKD strings/characters?

lmm · on Jan 16, 2015

Yes; the Turkish "I"s under discussion here are the most immediate case, but there are other cases where you have two almost-aliases in one case that aren't present in another case even ignoring composition. E.g. the ohm symbol "Ω" lowercases to a standard omega "ω", but that uppercases to a standard uppercase omega "Ω", because there's a distinct codepoint for "ohm symbol" (even though it's "just" omega, perhaps because some legacy codepages included it as a symbol without including a full greek alphabet) but no corresponding lowercase codepoint.

zokier · on Jan 17, 2015

> E.g. the ohm symbol "Ω" lowercases to a standard omega "ω", but that uppercases to a standard uppercase omega "Ω", because there's a distinct codepoint for "ohm symbol" (even though it's "just" omega, perhaps because some legacy codepages included it as a symbol without including a full greek alphabet) but no corresponding lowercase codepoint.

Except that the NFKD form (which I was specifically asking) for 'OHM SIGN' is 'GREEK CAPITAL LETTER OMEGA'.

lmm · on Jan 17, 2015

Ah sorry, I was sure I'd read NFD. Will look more.

theoh · on Jan 17, 2015

Is that actually the right thing to do, or is it another mistake?

lmm · on Jan 17, 2015

As you said and linked elsewhere in the thread, the unicode consortium takes the viewpoint that there should be one codepoint for each glyph, even if that glyph has multiple semantic meanings in different languages (e.g. "U"). So by that standard they should probably be the same codepoint, but in that case it's hard to argue that roman capital I and turkish capital dotless I should be different codepoints.

Alternately you could argue that ohm symbol shouldn't lowercase to omega, which, maybe. I think the right view is simply that lower- and upper-casing aren't always well defined, are culturally and contextually dependent, and are probably something you should only ever be doing for display, not for semantic purposes. (If you want to do case-insensitive comparisons of strings, Unicode comes with algorithms for that which do a better job than upper- or lower-casing the strings before comparing)

theoh · on Jan 15, 2015

With Unicode, why aren't the two Turkish Is just treated as if they have nothing to do with the normal Latin I? The fact that the glyph for uppercase dotless I resembles the glyph for uppercase Latin I should be irrelevant, surely. It's a kind of typographic false friend situation.

Maybe there's a missing level of indirection in Unicode that prevents it from doing this, but I can't see how there could be.

lmm · on Jan 16, 2015

One answer is that unicode had to import existing documents; I suspect that a lot of documents are written in a Turkish codepage that would have been an 8-bit encoding with the lower half as ASCII, that wouldn't have bothered with a different codepoint for "Turkish" I. As I said, you can't rely on upper/lowercasing roundtripping correctly in general.

(I was about to give the example of ß, which is usually uppercased to SS. But interestingly Unicode has now adopted a codepoint for the (disputed, and currently lacking a typographic consensus) capital version, ẞ. So maybe a codepoint for "uppercase Turkish I" is on the way. Turkish users will still expect to be able to lowercase "I" to a dotless lowercase i though, since a lot of existing documents will have "I"s in)

theoh · on Jan 16, 2015

I did a bit of research on this and you're right, legacy encodings are one problem. More seriously there seems to be no established way to manage multilingual text which includes homoglyphs (say by using colour coding) so you would really be replacing one problem with another.

It does seem like this Turkish I problem is the most conspicuous situation, maybe unique, where changing locale changes the behaviour of toupper/tolower. Unicode, on the other hand, has many homoglyphs and duplicate characters which all need to be dealt with.

http://en.m.wikipedia.org/wiki/Homoglyph http://en.m.wikipedia.org/wiki/Duplicate_characters_in_Unico...

zokier · on Jan 16, 2015

Yeah, turkish I is bit of a red herring due it being a quirk in Unicode specifically. From a previous comment of mine:

> They [Unicode consortium] should have specified Turkish alphabet to use ı and a diacritic to make the dotted one. That would have made (in this case) capitalization locale-independent. [...] I dislike the common usage of turkish i as a example because it is such a obviously fixable (if legacy stuff wasn't concern) flaw in Unicode rather than fundamental issue.

theoh · on Jan 16, 2015

That's helpful. Wikipedia page for "glyph" seems to concur:

"For example, in most languages written in any variety of the Latin alphabet the dot on a lower-case "i" is not a glyph because it does not convey any distinction, and an i in which the dot has been accidentally omitted is still likely to be read as an "i". In Turkish, however, it is a glyph because that language has two distinct versions of the letter "i", with and without a dot."

radnor · on Jan 15, 2015

The Turkish I problem is seriously annoying. We recently tried adding Turkish translations to our website, only to find that .NET suddenly treats DataRow("ID") differently than DataRow("id") when your locale is set to Turkish.

pavel_lishin · on Jan 15, 2015

Link to the case in question: http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two...

tedunangst · on Jan 15, 2015

I think if you stab somebody over a text message, there's more going on than just a missing dot. As if it'd be acceptable to kill somebody even if they did call your daughter a prostitute.

The article itself could benefit from corrections: "with a knife on his chest" That doesn't sound so bad.

masklinn · on Jan 16, 2015

> Less seriously, Unicode has counterintuitive case-changing behaviours with those letters. If you are working outside the Turkish locale and uppercase a dotless I and then lowercase it, it gains a dot.

AFAIK the only solution would be to error out when uppercasing a dotless i in a non-turkish locale. Which I'm not sure sounds better. Or going back in time and creating a separate category of i and I for the turkish script.

olau · on Jan 15, 2015

Just in case people are wondering about the horror story: use ngettext which is a function in the gettext library.

gulpahum · on Jan 15, 2015

Indeed, this is a solved problem. Use ngettext, see its page for all different language variations (even the Slovenian four different forms) https://www.gnu.org/software/gettext/manual/html_node/Plural...

philh · on Jan 15, 2015

How does ngettext handle "Your query matched %g files in %g directories"?

MatthewWilkes · on Jan 15, 2015

There's always an exception. You have two choices.

First: you might split it into three strings and trust the translator to handle it.

For example, you might pick:

  "Your query matched %(filecount)s in %(directorycount)s"?
  "%g files"
  "%g directories"

with appropriate plurals for the %g ones.

Another translator might pick:

  "Your query %(filecount)s %(directorycount)s"?
  "matched %g files"
  "in %g directories"

to get the flexibility they need.

You only hit a problem there when you get one word that is requires declination dependent on both numbers in a way that you can't fit to be contiguous with the numbers. It's not foolproof, but it's certainly enough flexibility if you're willing to re-word things slightly.

Alternatively, and this is my favoured way, you can have the result of a plural lookup be a second msgid, for example:

  msgid "files_found_by_n_directories"
  msgstr[0] "files_found_directory_0"
  msgstr[1] "files_found_directory_1"
  msgstr[2] "files_found_directory_2"
  
  msgid "files_found_directory_0"
  msgstr[0] "Your query didn't match any files"
  msgstr[1] "Your query matched a file in no directories"
  msgstr[2] "Your query matched %(file)g files in no directories"
  
  msgid "files_found_directory_1"
  msgstr[0] "Your query didn't match any files in a directory"
  msgstr[1] "Your query matched %(file)g file in a directory"
  msgstr[2] "Your query matched %(file)g files in a directory"
  
  msgid "files_found_directory_2"
  msgstr[0] "Your query didn't match any files in %(directories)g directories"
  msgstr[1] "Your query matched %(file)g file in %(directories)g directories"
  msgstr[2] "Your query matched %(file)g files in %(directories)g directories"

annnnd · on Jan 15, 2015

Let's face it - sometimes it is better to just avoid the problem than to (try to) solve it. Just use some other form of sentence, people will not even notice and you will save yourself tons of problems.

bmn_ · on Jan 15, 2015

With reordering syntax.

https://www.gnu.org/software/gettext/manual/html_node/c_002d...

yoha · on Jan 15, 2015

This only addresses the ordering of parameters, not the fact that you need two different counters.

tinganho · on Jan 15, 2015

ngettext can't handle two plural words well.

DangerousPie · on Jan 15, 2015

Interesting, but I am wondering if it is really worth going through all this trouble, just to support a few edge cases.

Personally, I usually don't even notice small mistakes like "1 directories" (or similar mistakes in my native language). Sometimes I will see the correct version somewhere and think "Oh, nice that they thought of that" but I definitely don't expect it.

Are the possible returns of having a "perfect" translation really high enough to justify investing in a much more complex system? I am sure translators who can code functions instead of just putting values into an Excel table will come at quite a premium as well...

PinguTS · on Jan 15, 2015

You may not notice it. But there are other people who do.

That is the difference between a very well designed product or a product, which just does the job.

That is the reason why engineers should not design interfaces. That should be left to UI/UX experts. Just the other day, I as an engineer myself, complained to another that his product does the job nicely and looks OK. But it was missing this little twist of a finished product that I'd really like to use. Because the interface was designed how I would have done it myself, because lack of UI/UX knowledge.

DangerousPie · on Jan 15, 2015

Of course, I'm sure there are people who notice this. My question was whether something like this makes enough of a difference to justify the investment.

As a developer you only have a limited amount of funds and time to spend on your product. Your goal is to invest these in a way that gives you the biggest returns. Of course it's great to not just have a translation that "does the job" and if you can do better you definitely should. But if that you takes a lot of work and only has a negligible effect on your sales, shouldn't you be prioritizing other things?

bloodorange · on Jan 15, 2015

We have 100s of thousands to millions of pageviews per day in 10s of languages on the various pages of our site. The site grows from A/B testing and I have to say that many of these small things do add up over time (they add up to measurable conversion in the funnel). There is the odd one that surprisingly does nothing or does worse but generally, paying attention to language details did prove effective for us. I work on this stuff everyday in a small team where we know more than 30 languages between us and sometimes just two or three of these small changes more than pay for our annual wages in just a month.

TeMPOraL · on Jan 15, 2015

> Of course, I'm sure there are people who notice this. My question was whether something like this makes enough of a difference to justify the investment.

Honestly - probably not. And that's exactly why we live in the world that is full of crappy stuff that works barely enough to get sold. I personally strongly appreciate when someone, however little economical sense that may have, goes the extra mile and makes their product polished. I'm willing to pay for it.

Zarkonnen · on Jan 15, 2015

Huh. I actually took a stab at solving this problem using language generation with my final year [project](https://github.com/Zarkonnen/A-Natural-Language-Generator-fo...) at university.

ajuc · on Jan 15, 2015

Another thing:

"Are you sure you want to quit?" in Slavic language will have gender of the user embedded. You need to know it to adress user correctly.

You can do stilted "To the person that uses this program - are you sure you want to quit?", but that's insane. So everybody just use male version.

userulluipeste · on Jan 15, 2015

I don't know, in Russian either "ты" (singular/informal) or "вы" (plural/formal) works fine for both genders. Now, for your example, Google translated result is "Вы уверены, что хотите выйти?" which seems fine to me!

ajuc · on Jan 15, 2015

"Wy" (plural you) don't work for single person in Polish, it sounds like "communist-speach" to us (almost like you called people "tovarishch") :) It was only used by soviet puppet politicians during communism (as carbon copy of Russian expression).

So it needs "Ty" (singular you), and with singular you "sure" translates to "pewien" for male recipients and "pewna" for female.

In Polish it should be "Jesteś pewny(or pewien)/pewna, że chcesz wyjść?"

So, it's Polish-specific, not Slavic-specific as I thought.

pavel_lishin · on Jan 15, 2015

How is that typically handled in software? I can also imagine non-software examples where this might be hard (A sign reading "Warning: you are entering a restricted area", etc.)

ajuc · on Jan 15, 2015

I explained poorly, verbs in second person are gender agnostic, adjectives are gender-dependent.

The "restriced area" is funny, standard version is "Nieuprawnionym wstęp wzbroniony" ~ "For non-priviledged-ones entry is forbidden", it's plural noun made from adjective, in nominative it would be different for male and mixed/female groups (uprawnieni/uprawnione), but fortunately in plural in dative case it's "uprawnionym" for both genders so it works out OK. I guess it's common and maybe that's why the dative case works that way?

In most dialogs in software adjectives are the problem, especially "are you sure". Usually software that don't know your gender for other reasons just use male version.

In non-software world in formal documents it's often written like "he/she" in every gender-dependent place, often with passive voice to cut the amount of "/".

bobbles · on Jan 16, 2015

"Are you sure you want to Cancel?"

[OK] or [Cancel]

eCa · on Jan 15, 2015

Also, if you 'localize' something using Google Translate[1], please let the user choose language somewhere in the app.

For example, the Hostelworld ios app[2] requires the user to change language for the entire device. As something of a language perfectionist it leaves the app virtually useless.

[1] Translating to English from other languages works fine for me.

[2] https://itunes.apple.com/us/app/hostelworld.com-hostels-budg...

dezgeg · on Jan 15, 2015

I wanted to do that on one Android app, but apparently it's impossible to implement with Android's built-in localization features.

cj · on Jan 15, 2015

I created Localize.js (https://localizejs.com), a localization SaaS.

Pluralization is a challenge, but we're able to solve this with some pretty simple HTML tags.

For example:

Localize.js identifies the <var> tag with the pluralize attribute, and pluralizes the phrase to any language (including languages like Arabic which can have 6 different plural forms).

mfenniak · on Jan 15, 2015

Huh. Localize.js sounds really cool. Fantastic idea.

Can you explain your example a little more? What would translators see in this case for Arabic; would they need to provide three translations? And if there were two variables, nine translations?

"Localization" also implies a lot more than just translation. Does Localize.js handle work like culture-specific number and date formatting? Different collation of records for different languages? How does it identify application-generated text versus user data (eg. on a blog, does it translate blog comments entered by readers, or just text like "Please enter a comment below")?