When I was in Japan I did proof reading for a Japanese feature phone. A major Japanese brand, actually. That was really comical.
There was an Australian guy for English, an German guy, an Italian lady, and me for French. What they did prior to the meeting is:
* translate from Japanese to English by Japanese people with a poor English level (maybe the software engineers actually)
* translate from weird English to other languages by translators who had only the strings, absolutely no context.
In the meeting we had all the strings, and one person from the manufacturers who had access to the "super-confidential" unreleased device.
More than half of the translations were off because of lack of context. The French guy actually translated "Garbage day" to something like "Shitty day", apparently he thought that was a way to mark in your calendar that you had a really bad day.
Pretty often we had sentences like "delete one", and invariably one of us had to ask "One what? I need to know if it's masculine/feminine/neutral". Of course they didn't prepare to that, it was too late to change the code, so they made us do ugly things like "%n item(s)".
Also the Australian guy was loosing faith into humanity:
- That sentence, it's completely wrong, it just doesn't mean anything in English. People will just go "WTF?" when they read that
- We're not allowed to change the English strings, they're already validated
- .....
I don't know why nobody seems to put information like "warning: this phone's UI in <your local language> is total and utter crap".
Anyway, what you wrote is exactly why I stick to using all software and webservices - OS, text editors, Facebook, et al. - in en_US instead of my native pl_PL. Because translations are always crappy - even for big players. Lack of context is the key here - translated text often feels out of place, because there usually is some overarching idea behind them that isn't communicated to translators. Then there is lack of consistency. Words in original text often have some site-specific meaning, which tends to also be somehow lost in the translation process. For example, on Facebook the word "like" talks about a well-defined thing, not about the dictionary meaning, so it's totally not ok to randomly replace it with synonyms during translation [0].
I realized at some point that I often look at a crapy translation, guess what was the English original, and then in my mind translate to what it should be in the first place. Because for some strange reason I, the user, have the context, and the paid translation team has not. I guess I'm going to put that into my "Translation issues" file in the "Mysteries of capitalism" drawer, right next to "how on Earth multi-milion media companies can't do a movie translation that isn't a total crap" file. I mean, seriously, you're better off looking for pirated subtitles even if you bought the original because pirates at least seem to have watched the movie they're translating.
</rant>
[0] - I wish more translators would use the approach Jehovah's Witnesses used when doing their own Bible translation. Since it was designed to be studied and analyzed, they preferred accuracy over aesthetics - therefore one of the translation rules was "as much as possible, let's have any given word in original text be always represented by the same word in English". Adhering to that single rule would eliminate like half of the "context missing" problems with software translations.
You know what multi-million movie has a translation that isn't total crap? Frozen. They really put resources into that. You can look up random Disney songs on Youtube in different languages, and then look up the Frozen songs, and you can sort of tell that they've done a better job even if you don't speak the language.
I agree. Frozen, and other Pixar/Disney/Dreamworks children movies (like Shrek) tend to be of awesome quality in all languages. But I attribute this to the fact that those movies are not translated - they're being localized, which by definition requires much more work and paying much closer attention.
The Latin American Spanish localization of Dreamworks's Shrek is a great example.
They brought in Eugenio Derbez, a Mexican comedian, to voice Donkey (voiced in English by Eddie Murphy). Donkey in particular speaks in colloquialisms and pop culture references with wordplay, so Derbez wrote a bunch of new lines and jokes that referenced Latin American colloquialisms and pop culture.
Children learn different fairy tales in different countries, so they also managed to change the identity of some of the characters without changing their appearances (and without altering video at all, just audio).
Exactly. I mentioned Shrek for a reason. It was (and still is) hugely popular in my country (Poland), and one of the reasons for that is the deep localization. They replaced original jokes and pop culture references with local ones.
@daxelrod: Wow, thanks for that. Quite interesting. I always wondered about how that was done and if it was a direct conversion of sorts but I guess it's not. Very interesting.
All of this information comes from the teacher of a Spanish class I took (we watched the Latin American version of Shrek in the class). I wish I had some more tangible sources to cite.
EDIT:
http://www.imdb.com/name/nm0220240/otherworks: "Jeffrey Katzenberg and Dreamworks allowed [Eugenio Derbez] not only to dub Donkey's voice, but to translate and adapt the script of "Shrek" and "Shrek 2" to make it more appealing to Latin America"
I also remember that the Gingerbread Man was one of the characters who was altered, but I don't remember the name of the Latin American replacement.
Pinpon is a puppet
very handsome and made out of cardboard.
He washes his little face
with soap and water.
He untangles his hair
with an ivory comb.
And in spite of the hair pulling
he cries not nor even winces.
Not to mention that the context of each line of dialogue is pretty close to unambiguous since you have the source material right in front of you for a movie. I imagine it's still a huge job, but has to be more enjoyable that translating/localizing for us asshole programmers and our magic translation strings.
If, by analogy, the visuals of a movie musical are the "backend" and the audio is the "frontend," then what these localizers do is the equivalent of completely redesigning the entire frontend. Dubbed musicals have an incredible number of constraints in terms of number of syllables, scansion, etc., so the script translators need to be given a tremendous amount of leeway and creative freedom. They're basically lyricists in their own right, in a world where everything's composed melody-first!
In software, this would translate to localization coders being able to (and having the talent to) rewrite the entire frontend logic. And if your software product is going to make multi-millions in new markets by virtue of feeling like it's translated natively, it might be worth retaining native-speaker coder(s) to maintain a branch that parallels (and consistently merges in) your master branch, and rewrites display logic as it comes in. I'd imagine the Googles of the world do exactly this.
Disney generally spend a lot of effort on the translations.
Idina Menzel, who voiced Elsa also plays the lead in the Broadway musical Wicked. Several translations of Frozen use an actress for Elsa who played the lead in a localized version of Wicked.
For Big Hero 6 we also translated and replaced all of the Japanese text in San Fransokyo with Chinese and Korean for the Chinese and Korean markets. By "text" I mean all of the CG signs, posters, and environmental set dressing in the actual movie (not just the dialogue).
We literally had to re-render the entire movie for each translation. Disney Animation takes these translations very seriously :-)
To be honest, I thought the Dutch translation was relatively awkward (I have to admit I've only heard the Dutch "Let It Go" version, not the rest of the movie). I thought the Flemish version was much nicer, despite some Flemish phrasing sounding off as Dutch...
(For those wondering: Flemish is the Belgian variant of Dutch.) And i agree.. i also prefer the Flemish dubbed voices. For example, Timon&Pumba in The Lion King are Flemish, to great effect.
It makes sense if not all of your target demographic, or at least a large part, cannot read English on the level neccesary to use your tool.
However, for a lot of modern tools that isn't the case. Often a translated tool is an order of magnitude less usable because of broken translations and inability to Google things.
It's also infurating that lots of tools look at your Windows location when deciding your language. No, I don't want your broken native translation on my English Windows installation, just because I do like to still have € in front of my currency.
That's if you're lucky! Google does far worse, and picks a language based on their often broken geo IP system. It's not even fixable. When viewing alt text for Google doodles, for instance, I get "localized" text even if the rest if the UI is in English. Google Play also has jacked up section titles from time to time.
Netflix search box is the same way. This is in addition to the discriminatory practice of often not providing subtitles in the language of the audio track.
Chrome also would install in the geo IP language, regardless of system settings. How arrogant is that? They deliberately disregard your OS language settings and pick another for you. And, it'd force the default Google search to go to the localized version until it detected a few location. Which, in Denver has had me appear in France then Hungary, as their database somehow accumulates errors. And again, no way to fully opt out- even selecting English and .com would still show country specific logos and such on eg YouTube.
Xbox is also a mess. You can buy games that are country restricted, with zero warning. When downloading, the Xbox provides no indication of a problem until it finishes. Then it does a geo check and reports "download corrupted". Xbox support wanted to RMA the unit, as they were convinced this was a hardware issue and had no KB info on country restrictions. Using a VPN fixed it.
Basically it seems that many developers are brain dead or simply do not care about travelers, expats, or anyone with different language prefs.
Don't forget monolingual managers that just don't care though. I've done so many last-minute string translations... "Oh shit, we forgot French, this was supposed to launch last week, you have 1 hour." And then insert the types of cases from the original article here where you need grammatical logic in a fixed string. Ugh.
I have tended to use the localised version of os an sw, not because I prefer it but because I often ended up supporting end users with localized versions.
> Adhering to that single rule would eliminate like half of the "context missing" problems with software translations.
Sort of. Almost all the issues I have with translators and context come from the same word being used for different things. This is particularly true for words like "date" and "time", which can have different translations depending on context (is it the time of day, or the time the test has been running?)
So as well as using the same word for the same meaning throughout, using different words for similar-yet-subtly-distinct meanings is also required.
I'm pretty sure general opinion among everyone other than Jehovah's Witnesses is that their bible translation is not good, and that the same-word rule is one of the reasons why.
And I really don't see how having a similar rule for translations of text in software would eliminate most "context missing" problems. It seems what it would actually do is stop translators even guessing those missing contexts.
> And I really don't see how having a similar rule for translations of text in software would eliminate most "context missing" problems. It seems what it would actually do is stop translators even guessing those missing contexts.
Well, because for one translations would at least be consistent. In software words have often specific meaning related to the application itself. You're not free to translate the word "like" on Facebook however you like because it has its own, specific meaning that is different from the dictionary one. The same applies to things like tools in Photoshop, etc. In general, wrong use of synonyms for things that have application-specific meaning is one of the most common problems with translations I see (and the same happens in official movie translations).
When you don't have (or can't be bothered to get) context, this is the least you can do to play safe.
I have a tool that does this. I produced machine generated translations of German and French as an expedient. The German messages were corrected by a native speaker and I cleaned up the French as best I could. When French is in use an extra message is printed appealing for someone to edit the weak translations.
I dealt with word order issues by avoiding formatted strings with more than one replacement field. Only one string needs to deal with singular vs plural quantities.
Lack of context was a definite problem with the machine translations. It would inconsistently choose translated words that should have been the same because the short string snippets could not be evaluated for their meaning within the narrow domain of the program using them.
Same here. Technical text translation is normally a disaster. I truly hate web pages that switch languages based on IP location, as MSDN sometimes does. In my language the translators always get wrong keywords that should not be translated, specially when they're adjectives. e.g:
* "You should use the 'new' keyword..." Often get translated as something like "You should use the word that is not old..." / "Debe usar la nueva palabra..."
To be fair this is because the code is wrong. It should use:
printf (_("You should use the '%s' keyword"), "new");
indicating that the main string should be translated (_ == gettext) and the keyword itself should not. This also happens to generate slightly smaller binaries and fewer translations in the case where you have several keywords.
@eloisant: I work in automotive where I deal with translations for automotive clusters for one of the largest auto makers. Automotive companies are going to what are called reconfigurables which is basically an instrument cluster with no mechanical gauges; just a screen with gauges rendered by 3D engine. Center stacks too.
I kid you not, the way we translate is to use Google translate as a first pass and then the screens get reviewed by people who know the language. I guess somethings slip through and we evidently pissed off a lot of Chinese folks because of a similar flub. The folks doing the first pass don't know any other languages.
It's quite comical but also a real pain. One of the things I had to work into our code was the left-to-right vs. the right-to-left; you would think it would be just a C-style string but we have to know for text justification semantics.
I don't actually do translations but work on the HMI where the text is displayed. Another pain point is that we have a certain space where text needs to be displayed and everything is fitted using English but after translating to other languages, some strings are much longer than the allotted space.
I can totally imagine that. Recently I've been helping a guy with tweaking his integrated car navigation/radio/media player and I spent some time browsing through its firmware and file system. I saw the translation strings and man, they were horrible. Also half of the stuff wasn't even translated (though it didn't show anywhere on the UI).
BTW. half of the files and directories were named in Chinese which made my work very "fun", but that's another story...
> Another pain point is that we have a certain space where text needs to be displayed and everything is fitted using English but after translating to other languages, some strings are much longer than the allotted space.
As a rule of thumb, if you're using English strings to design your UI, you should account for 30% more space so other languages can fit in. Of course, sometimes this might not be enough.
I worked for a place that used German as the "placeholder text", to avoid the problem of text longer than the allotted space. German is supposedly about 1.5 times as long as English, so if you can fit the German version, you won't have any problems with other translations being longer than the space.
That's the other way around, the guy who did the translation without context wrote "jour de merde", and I saved them from releasing that in their phone calendar app during my one-day job paid in cash.
As far as I can tell, the best tool for localisation almost nobody is using is http://www.grammaticalframework.org/. Licensing is a mix of GPL, BSD and MIT pieces.
It's a high-level functional programming language with a dependent type system specialised for operating on language ASTs. It's resource library, to quote "covers the morphology and basic syntax of currently 29 languages: Afrikaans, Bulgarian, Catalan, Chinese, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hindi, Japanese, Italian, Latvian, Maltese, Nepali, Norwegian bokmål, Persian, Polish, Punjabi, Romanian, Russian, Sindhi, Spanish, Swedish, Thai, Urdu."
In essence, once it has the language-independent AST, it can produce output in all its supported languages with the correct tenses, genders, inflections, etc.
It also seems to have tools for assisted parsing, so you could have an english document and interactively parse it into the correct AST. In addition, the text can be parameterised semantically, so if you changed the gender of a person, that could propagate to all the correct locations and update the translations as required.
While it seems the upfront cost may be quite high in having to learn such a complex system, I think the benefits of having reproducible, high-quality outputs into n languages for free could make this highly advantageous in many applications.
I'm very skeptical that this would work outside of toy examples, though it depends on what is meant by language-independent AST.
For example, the best way to translate Spanish "X dió un golpe a Y" would be "X hit Y". But my naive idea of what the AST for the Spanish sentence would look like would be something like `(GIVE (X HIT Y)`, which when naively transduced to English would be the "X gave a hit to Y", which is either unidiomatic or means the wrong thing altogether. In order to avoid this problem, the AST would have to be a more abstract representation of the semantics. And coming up with a sufficiently expressive, tractable, and neutral representation of natural language semantics is an unsolved problem that people are still devoting their whole careers to.
I was briefly involved in a very early stage startup that was considering using systems like this for better machine translation. We ran into problems like the above, and also: ambiguity, and the fact that the hand-written grammars and semantic representation systems were just very brittle and incomplete.
> which when naively transduced to English would be the "X gave a hit to Y", which is either unidiomatic or means the wrong thing altogether
What's interesting, is that there are dialects of English (Hiberno-English spoken in Ireland) where "X gave Y a hit" would be a way to say "X hit Y". :)
That sounds like a great technical approach, although it can't necessarily remove the problem of idiomaticity that the article mentions (with the example of "I didn't search any directories"). Probably better examples are possible, in any case where the most idiomatic way to express something isn't a literal translation of that thing from other languages. Maybe like
English "I don't care"
Portuguese "tanto faz" (literally 'so much does')
German "[das] ist mir egal" (literally '[it] is equal for me')
Side note: as you might expect, Wikipedia's internationalization is the only system that attempts to do quantities and other formatting correctly for every goddamn language on the planet, but is considerably easier for translators to work with than the OP's examples (sorry, Sean ;)
I did some work on bringing it to JavaScript and making it HTML-aware, and since then Santhosh Thottingal has vastly extended it and it's become pervasive at Wikipedia. More projects should use it, or at least learn from it.
Ironically, this is exactly the case described in the article, and even the article got it slightly wrong. The rules in pseudocode - for this specific sentence! - are as follows:
if ((n % 10 == 1) && (n % 100 != 11)) {
// singular nominative (not accusative as article says) - 1 котенок, 101 котенок, 301 котенок
} else if (n % 10 >= 2 && n % 10 <= 4 && n % 100 <12 && n % 100 > 14) {
// singular genitive: 2 котенка, 43 котенка, 1024 котенка
} else {
// all other cases plural genitive: 5 котят, 11 котят, 212 котят
}
But this is true only for this sentence because kittens here are the subject, not the object: with Russian verb "есть" (to be, to exist) the literal translation would be "Kitten/Kittens exists/exist belonging to Harry".
Once the declension of the numbered subject(s) turns to accusative: "Гарри гладит 1 котенка" ("Harry pets one kitten"), the rules become much simpler:
I've been looking for something like this for weeks, no luck till now, I guess I should have searched for internationalization instead of localization in this specific case on google and github. This is a great starting point, thanks a bunch for sharing. Perfect for me.
Slightly OT, mi favourite localization error was in Ubuntu when they had that nice netbook interface (that later would become Unity). The network icon label was "Rojo" in the Spanish localization, that is the word for "red" color. What?
Well, if you translate "Net" to Spanish you get "Red"; and if you translate that again (by mistake), you get "Rojo". There you are :)
I once got a Spanish translation back and was briefly angry that the translator had left a few strings as "TODO"...
"What?!" I asked myself. "To do? Why did they ship me an incomplete file?"
Then I looked to see what the "untranslated" strings were: "ALL" (I don't speak Spanish but have enough of a smattering of European languages that it was immediately obvious what was going on.)
When I worked on Outlook.com, we got feedback from a British user that he couldn't understand what the option to "Connect devices and apps with DAD" meant. Turns out we had forgotten to prevent localization of the "POP" protocol everywhere it was used.
I'm really surprised that this is the first time you've seen "OT" being used. I can only presume it's because you don't normally ready English websites/discussion boards but I still find it intriguing that you've never encountered it before.
In Sweden we use "OT" as well but it's referring to "Off Topic" and not something Swedish.
I read mostly english websites/discussion boards (reddit, HN) and I know a bunch of abbreviations (IIRC, INB4, AFAIK, QED, IFF, ST...) but I don't remember seeing OT. Learning new stuff everyday :D
An amateur error perhaps but hardly limited to free software.
In a particularly expensive piece of enterprise software for the finance sector that was translated into German a form field was labelled 'Hauptstadt', which translates back into 'capital (city)' as in Washington, D.C.
Now, this kind of geographic information doesn't make sense at all in financial software but the well-paid developer (who had to double as the software's translator because hiring a proper translator would of course have been too expensive ...) simply couldn't be bothered to look up the different words 'capital' translates to in German, one of which also is the right choice in that case: 'Kapital' (meaning 'capital' as in 'asset).
That's a head-smacker. Some guy somewhere must have actually painted that on the trailer - does that guy not think "Hey, this doesn't really make sense. Are you sure they didn't want somebody to translate this to Arabic?"
Or maybe he did and his boss told him to shut up and do as he's told. Or he thought - "Screw this, I wanna go home on time. I'm not gonna bother to find an Arabic translator now. I'll just paint exactly what it says, and they can't get me in much trouble for that."
There's a certain schadenfreude to be had where the failed translation is also typo'd - but only once. These sorts of things are especially embarassing, as it not only shows no effort was used proofing, but nothing tripped a sanity check at all.
> A professional translator makes sure to check the context of the translation, it doesn't go blindly translating sentences and words without context.
From my experience with both software and movies, that's exactly what a professional, paid translator does (or is forced to).
Amateurs at least watch the movie/run the software before translating things, which is something I definitely cannot say about "professionals" translating movies and software to my language.
From my experience with professional translators I would have to go out of my way to specify context if I wanted a good translation, because professional translators are too technologically illiterate to get stuff right.
Not to mention the difficulty in getting them to use Virtaal or PoEdit properly.
The problem you describe is one of incompetence. The professionals are not really professionals. And yet, someone is paying them as if they were. That is not the case everywhere, I assure you.
A little of anecdata: In Portugal, where foreign stuff is translated and subtitled, the quality of translations/subtitles in english language movies and series in very good. Or at least it used to be, I can't be sure anymore since I don't currently watch TV. I grew up used to reading good translations. Those translations taught me to understand and speak the English language when I was growing up, since age 5. I understand I might be a bit biased because of my personal experience.
My personal experience comes from Poland, where quality of movie translations has been basically a running joke throughout the years among the population that knows even a little bit of English. I do know personally a few professional translators - they're really smart and competent people, and most of them are even quite tech-savvy. Sadly, they don't get to translate movies.
Ah, that's why Word 2007 ended up with »Keine Gliederung« (»no (document) outline«) for the setting where you can select a shape outline colour ...
In all seriousness, translators make mistakes by getting the context wrong all the time; this applies to both professional translators and amateurs, perhaps in different severity. But in my eyes that's not really a problem of the translators, but rather of the tools we use. A long list of strings that have maybe only a vague context is a horribly UX for the translator. Who would fault them to translate »outline« as »Gliederung« in a word processing application? A string table is probably the most convenient format for programmers, but not very much so for those maintaining translations.
Qt gets a bit of this right. If the translator has access to the source code and uses Qt Linguist, at least the strings that are directly in .ui files are shown with a mock-up of the respective window where the control with that text is highlighted [0]. That already helps a lot with context errors. Of course, it does nothing for text in the source code, and so our translator went ahead and translated »Breite« (line width) with »latitude« because the application in question was, after all, related to maps, geography and GPS; it just happened to have a setting to change the width of certain lines, which, in German, then mapped to the same word.
Qt also has a nice way of handling plurals in that both Linguist and the QTranslator class are aware of the languages and the special rules concerning plural forms in them. You create a translatable string like »Searched %n directories« and then create translations for that. English, German and lots of others just map to two forms (1, rest). Russian gets three, I think (1/21/31/..., 2/22/32/..., 11/12/rest), and so on. Downside is that you only get to handle a single plural in a string [1], and you have to create an en→en translation as well (to account for multiple forms in the source language). But generally it's a quite nice implementation. Gettext has something similar where you can write your own plural form matching rules somehow, but most translators I met don't really want to write math.
Visual Studio with Windows Forms has a mode of creating the string tables for translation where you can just change the language in the Form's properties and then proceed to change each control text. This nicely solves the context problem in that the translator edits the window directly and it looks like it normally does. But it also creates a whole bunch of other problems: Translators need VS, they need to edit the project directly and can accidentally mess up the UI with a mouse twitch when selecting a button to translate. They also might miss things that are buried in menus since there is no real measurement of completion and what's still missing. I've seen various projects, especially in web environments adopt a similar custom-written approach, though, where you can edit the UI directly in the application to translate it. Still with the problem that translators might miss hard-to-find and buried strings (one might argue that you should get rid of buried and obscure places in your UI anyway, though).
Long ramblings ... it was a topic I considered for my Diploma thesis while studying. But I couldn't think of a good way that could retain context for the translator in a general case, or at least most of the time. I thought about replacing each and every translatable string in a program with a custom identifier and then later trying to find those again via UI Automation or maybe screenshots and OCR to be able to map strings to parts of screenshots of the program. Would have required running the program once with those custom identifiers and once in normal mode and somehow matching up identifiered screenshots, normal screenshots and the string tables. And still with the problem that you'd need to manually go down each and every dialog and menu, including context menus, messages that only appear in certain states, etc.
Perhaps there just is no good solution, except maybe for developers to properly annotate each translatable string they use. In the »Breite«/»width«/»latitude« case I went through the source and added translator comments detailing the meaning of the word for every instance, but with large applications having thousands of translatable strings that could become unwieldy quickly.
I like the Visual Studio approach. I usually resort to running the software I'm translating to get context. Sometimes it's very hard to get some of the text to show up.
I can only think of one way to give translators context: Comments, the exact same way you give other programmers context for your code.
> But I couldn't think of a good way that could retain context for the translator in a general case, or at least most of the time.
How about assigning identifiers to all strings, then adding tooltips for those identifiers in the application? So the translator can hover over a menu item to determine which string they are supposed to translate.
This requires the translator to exercise the whole application which is kind of difficult, as well, of course.
The goal in my mind was to try generating context information for translators as automatically as possible while retaining the usual workflow developers would use in a given framework for localisable resources.
But yes, the requirement to cover every control, menu and dialog as well as every possible code path that uses a localisable string from the source code makes the whole endeavour very impractical to solve. With dialogs built in markup, e.g. Qt's .ui or XAML it's easy enough to give context, but the hard part is strings in code where you never know where they'll end up.
To be fair it was a beta of the "Ubuntu Netbook Remix" and that bug was shortly fixed.
I took some screenshots because I really liked the interface, but none of them show the bug because I usually stick to en_GB and the problem was spotted in my wife's netbook (she's Spanish teacher and likes to have the OS interface in Spanish to amuse the kids).
Am I the only one that gets confused on what the menus and other things on a non English computer or device actually mean when translated.
This coming from someone doesn't speak English as their first language.
I speak Romanian, Hungarian and English perfectly but my devices are always in English since it would be harder to figure out what they mean otherwise unless you have the structure memorized already.
You are definitely not alone in that. My native language is Arabic, but if I switch my phone or computer to Arabic I wouldn't be able to use it. I dread testing translations of apps I'm working on because it forces me to switch to Arabic. It doesn't help that there's a lot of jargon that just doesn't translate that well. For example, the translation of "Tap" to Arabic is the equivalent of either "Peck" or "Perforate". It's just that that's what's generally agreed on as the proper translation and everybody uses it, however, if I didn't know that it would take me a while to figure it out for myself.
All things being equal, I guess that's because you accept to pay the necessary price of learning English interfaces of your devices, but do not consider reasonable to pay any learning effort for other languages.
> but do not consider reasonable to pay any learning effort for other languages
One problem is that English is the language majority of technology is created in, and it has developed agreed-on terms for a lot of things. Everyone calls their tabs "tabs" in English, but when it comes to translating, every other app uses a different word. Even if names are consistent within the app, they are inconsistent between each other. This leads to a lot of problems if say, an app reports an error that's related to some OS thing, and calls it with a different translated name than the OS translation does.
EDIT: And I'm not making this up. I worked on a localized Ubuntu for a very short while before switching back to English because messages I got in my command line were painfully inconsistent with each other and with agreed-upon proper translations in my language.
If it's intended to be a romanization of ね, then yes, "ne" is more common. Alternatively it could be a reference to Ender's Game, where "neh" is used in the same way.
Could also be from German, although this would also be written ne instead of neh.
People from the northern part of Germany use this in pretty much the same way it is used in Japanese (at least according to what I know with my limited knowledge of Japanese)
I always thought of this as a strange quirk that the same language construct can evolve in two unrelated languages.
It's just like parallel evolution in biology.
There's also Brazilian Portuguese "né", which is an end-of-sentence tag with exactly the same meaning. It's a contraction of "não é" ('isn't it') and is used in a way akin to German "nicht wahr".
Also "isso" which in German is colloquial for "Ist so" and means "That's it" or "Exactly". When in Brazil I always found it funny that they use "isso", short for "isso mesmo",
in much the same way.
It's funny that "gel", which you hear pretty often in Austria (and probably South Germany), means the same as well.
It's approximately the same as "nicht wahr?", or "oder?", which can't really be translated, but suggest wanting some kind of verification for what was said.
Since I'm German, this is probably at least the reason why I found it so easy to accept this construct, when I stumbled upon it in Ender's Game. It just felt natural, so my brain decided to use it in English, too. :)
1. L::M::L::G enables PO-based work-flow. This is not correct except for trivial lexicons. L::M::L::G cannot parse the plural extract from the article, so it is incompatible with properly working PO emitters.
2. Maketext is used on typepad and "works really well". This claim is not backed by any evidence. The verifiable facts from the article pointing out the pluralisation problems of L::M still stand unchallenged.
In order to avoid these pitfalls, usually I get out of "sentence" mode to "label" mode. For instance: "Directories scanned: 12". Probably not well suited for all cases, but usually good enough for mine, though actually I only have to support pt-BR, es-ES and en-US so maybe that's not saying much.
Exactly. All this work, or just restructure the message. It should be acceptable in most languages, because charts and spreadsheets aren't going to have per cell labels. And it has the benefit of being easier to read and parse.
Also, it's a really terrible style to use first person in an app unless it's actually sentient. Otherwise it's annoyingly like Clippy, or just plain obnoxious and presumptive.
> Also, it's a really terrible style to use first person in an app unless it's actually sentient. Otherwise it's annoyingly like Clippy, or just plain obnoxious and presumptive.
I agree, though I found another nice use case, well demonstrated by Bret Victor[0][1]. I played around with it for a while and I find that describing what something will happen in a normal sentence, parts of which you can tweak, is a pretty good way of doing options pages.
Kudos to the author of the article for his perseverance in decorating message to fit grammar. I'd go another way, just using more formal and dry format:
Number of scanned directories: %g
Number of found files: %g
That solves the problem with Slavic languages at least. Italian aversion to 0 may be mitigated with printing 'none', I guess. Please correct me if this form does not fit other languages.
Just scanned the comments to see if anyone would suggest that :) Being the lazy type, that's the first thought I had reading the article. Perhaps it's not appropriate for all target audiences and all target languages but many times you can find a much easier solution by going about it in a totally different way.
Funny how Slovene seems to tick all the complications checkboxes :D We have 4 grammatical numbers (singular, dual, plural for 3 and 4, plural for 5 and above), they repeat at mod 100 (so 101 is singular, 102 dual,…), it's an inflectional language with 3 grammatical genders, sentence should take a different form depending on whether the user is male or female,…
> Compared to the grammar of Slavic languages, English is super easy.
And thus is solved the mystery all us Slavic speakers have such good English. Especially on the online. Because nobody had the guts to localize to our languages.
If they did localize to our languages "Omg this looks weird and smells off", we all cried. And switched our interfaces to English. Admit it, you did this in High School or even Middle School because the <insert slavic language> translation just didn't make sense.
And so all of us who are now in a position to localize these interfaces don't because it seems silly and pointless.
We come full circle. Younglings after us use computers in English because no good localization exists.
Yeah. I remember being a kid and HATING Polish translations of English games. Even though I barely knew any English playing any game in Polish just felt wrong. Nowadays I think it just felt more mysterious and interesting if apart from being difficult, the game was in a language I couldn't fully understand. But yeah, it contributed massively to me learning English, so it was definitely a good thing. But I also know what you mean - I wouldn't translate my programs to Polish unless I had to, because it just feels wrong to have them in my native tongue.
I think the poor understanding of English was part of the problem. We didn't understand, we just knew that you found what you need when you click "Insert" or whatever. To us "insert" didn't mean insert, it meant "That menu where you find an Image to put in your document".
In our mind the English word and its native translation have semantically different meanings. I discover this problem a lot now that I'm older. I understand the words and their translations, but they still mean semantically different things to me in different languages.
The funniest part is how when I'm in the US I know all the English words for pots and pans and stuff, but when I'm home in Slovenia and my girlfriend is here, it becomes almost impossible to translate. Because in Slovenia the pots and pans have Slovenian names, in the US they have English names. And to me those are completely different.
> In our mind the English word and its native translation have semantically different meanings.
Because they quite often have (I'd wager that more often than not).
To use your example, English "Insert" cuts through a different part of concept-space than Polish "Wstaw" (which is what usually ends up as a menu label). It's a fine translation up until you ask to insert a DVD - now you should say "Wsuń" or "Włóż".
And I don't think this is a problem. As I grow older I realize that this is how it should be. Personally, I think that the moment you stop mapping words from new language to the words of your native is when you actually start to be proficient in the language you're learning. Having mappings like "fork" -> "widelec" -> "<that pointy metal thing you use to eat>" in your head is totally wrong way to use the language. "fork" and "widelec" are two different labels referring to two different areas of concept-space, that happen to intersect somewhere in the area where you think about eating utensils.
That's why I keep my internal monologue (and speech, if people I'm talking with don't mind) switching constantly between English and Polish - there are concepts I can express with one language that I can't express in another, and any attempt at translation feels like lossy compression.
(and then I get tons of hate for occasionally saying "robi sens" instead of "ma sens"; I know what the proper translation of "makes sense" is, but the Polish expression emphasizes 'sense' as a property of things while English shows it as something that can be produced, and sometimes the idea I want to express is closer to the English than Polish version)
The problem arises when your girlfriend (or friend or whatever) visits you in the homeland and you suddenly find yourself struggling to find words because many of the things at home don't have English names. It becomes even worse when you have to act as translator between them and your family who doesn't speak English or doesn't speak it as well.
Yes, I had a lot of fun over the holidays ...
Even something as simple as "Did your mum like me?" becomes difficult to translate because your mum said X and X doesn't quite translate into English with all the connotations preserved.
And don't even get me started at the case where you don't know the noun beforehand and need to synthesize the entire numeral phrase at runtime ("You have %d %s.")
Because, in Polish, numerals inflect in gender and case, come in two main variants ("normal" and "collective", and that's not including ordinals), and can (but not necessarily have to) for some genders undergo case changes that affect the verb part of the sentence under certain circumstances. Add to this the fact that there's even no definitive consensus on how many grammatical genders there are in Polish (opinions range from 3 to 9, with some of the theories based on numeral connectivity), and you're all set.
The Polish word "dwa" (two) has at least seventeen distinct grammatical forms, each of which has arcane rules that govern its usage.
Not to mention that I would never realise that "imenik" means directory. An imenik is a phone book. Potentially the Contacts app on my phone. But never a directory.
I think this has to do with vocabulary registers as well. We've learned to use English words for computer things. Using Slovene translations just feels weird unless they're a bastardisation of the English word. Similar to how English uses French words for foods because cuisine was a thing of the elite and they didn't eat pig, they ate porc.
This looks quite nice and powerful and indeed an elegant way of solving the problem with multiple plural forms in a string. The only concern I have with that syntax is that it's yet another DSL, or markup language and translators need to know it, or could get it wrong. Granted, a program for helping translators might do automatic linting (much as Qt's Linguist already warns if you omit placeholders from the translated phrase that are there in the original).
Another thing is that the mini-language grows complex enough that the resulting text can be quite hard to read and understand:
{people, plural, offset:1 =0{No one went.} =1{{user1} went.} =2{{user1} and {user2} went}.} other{{user1} and # others went}}.
is just a single (or two) placeholder and it takes a while to even parse how it's supposed to work.
Well, if we're going to introduce that level of complexity into a DSL, why not go full Turing-Complete and write it in code?
(case (length folks) (0 "No one went.")
(1 ((elt 0 folks) " user went."))
(2 ((elt 0 folks) " and " (elt 1 folks) " went."))
(otherwise ((elt 0 folks) " and " (length folks) " others went.")))
You can wrap that in a lambda that concatenates resulting strings and voilà, you have "smart" string tables. And it's not a problem to make it even more DSL-y and translator friendly.
And then you have the exact same problem as if you'd write that logic in your source code. Just with half a dozen layers of abstraction, a more cumbersome way of displaying strings in your application and another programming language on top. I'd say that's not a net positive.
I disagree. That logic has to go somewhere anyway - you can't skip it because it's inherent in the problem of displaying a proper message. So you could at least write it in an expressive language instead of encoding it into what looks almost as readable as regular expressions.
No you don't, in much the same way that you don't need a translator that "knows JSON or XML". Just don't tell them it's Lisp. That's how you do DSLs.
Also, I advocate closer work between translators and developers. Let the translators give the text and explain corner cases to someone who can code up the logic.
BTW. Lisp is only hard for people who acquired this stupid meme that "Lisp is weird/for crazy people". You'd be hard-pressed to find something which is simpler in terms of syntax and readability.
I recommend that you need to teach the translator about the markup. It's really easy to understand.
The way L10ns try to mitigate the complexity of the markup language is to provide buttons for pasting the correct markup for translators. Also a developer translates on his own language first. And this example will always be visible to the translator of an another language for easy reference. So the programmer offloads the logic thinking from the translator.
IMO ICU's messageformat markup is also slowly becoming the standard for localizing strings. It is already used widely by big organizations such as Apple, Google and Yahoo.
L10ns pre-compiles all message string. So it offloads the parsing performance.
Just curious... are Slovene web services more likely to ask your gender after you sign up so they can get the grammar right or do they just go with 'male' or something?
Russian speaker here; I don't think it would be relevant, since services ought to be addressing you in second person, which I believe is gender-neutral in just about every Slavic language.
The two Turkish letters dotted and dotless i are often confused by users of poorly localised software. Wikipedia links to a murder case allegedly caused by this:
http://en.wikipedia.org/wiki/Dotted_and_dotless_I
A real horror story.
(Less seriously, Unicode has counterintuitive case-changing behaviours with those letters. If you are working outside the Turkish locale and uppercase a dotless I and then lowercase it, it gains a dot. I am curious about this design decision, since it seems like a basic error in operating a the level of glyphs rather than symbols. Or maybe the opposite.)
Upper and lower casing can't be assumed to be inverse; there are plenty of other cases where they will change (e.g. precomposed characters that don't have a precomposed upper case). The correct lower-casing of "I" in English is definitely "i"; the correct upper-casing of "ı" in English is maybe a wrong question, because it just isn't an English letter, so I guess you could argue for leaving it unchanged, but converting it to "I" is probably what the person who wrote "ı" would want to happen when it was upper-cased. Maybe?
Yes; the Turkish "I"s under discussion here are the most immediate case, but there are other cases where you have two almost-aliases in one case that aren't present in another case even ignoring composition. E.g. the ohm symbol "Ω" lowercases to a standard omega "ω", but that uppercases to a standard uppercase omega "Ω", because there's a distinct codepoint for "ohm symbol" (even though it's "just" omega, perhaps because some legacy codepages included it as a symbol without including a full greek alphabet) but no corresponding lowercase codepoint.
> E.g. the ohm symbol "Ω" lowercases to a standard omega "ω", but that uppercases to a standard uppercase omega "Ω", because there's a distinct codepoint for "ohm symbol" (even though it's "just" omega, perhaps because some legacy codepages included it as a symbol without including a full greek alphabet) but no corresponding lowercase codepoint.
Except that the NFKD form (which I was specifically asking) for 'OHM SIGN' is 'GREEK CAPITAL LETTER OMEGA'.
As you said and linked elsewhere in the thread, the unicode consortium takes the viewpoint that there should be one codepoint for each glyph, even if that glyph has multiple semantic meanings in different languages (e.g. "U"). So by that standard they should probably be the same codepoint, but in that case it's hard to argue that roman capital I and turkish capital dotless I should be different codepoints.
Alternately you could argue that ohm symbol shouldn't lowercase to omega, which, maybe. I think the right view is simply that lower- and upper-casing aren't always well defined, are culturally and contextually dependent, and are probably something you should only ever be doing for display, not for semantic purposes. (If you want to do case-insensitive comparisons of strings, Unicode comes with algorithms for that which do a better job than upper- or lower-casing the strings before comparing)
With Unicode, why aren't the two Turkish Is just treated as if they have nothing to do with the normal Latin I? The fact that the glyph for uppercase dotless I resembles the glyph for uppercase Latin I should be irrelevant, surely. It's a kind of typographic false friend situation.
Maybe there's a missing level of indirection in Unicode that prevents it from doing this, but I can't see how there could be.
One answer is that unicode had to import existing documents; I suspect that a lot of documents are written in a Turkish codepage that would have been an 8-bit encoding with the lower half as ASCII, that wouldn't have bothered with a different codepoint for "Turkish" I. As I said, you can't rely on upper/lowercasing roundtripping correctly in general.
(I was about to give the example of ß, which is usually uppercased to SS. But interestingly Unicode has now adopted a codepoint for the (disputed, and currently lacking a typographic consensus) capital version, ẞ. So maybe a codepoint for "uppercase Turkish I" is on the way. Turkish users will still expect to be able to lowercase "I" to a dotless lowercase i though, since a lot of existing documents will have "I"s in)
I did a bit of research on this and you're right, legacy encodings are one problem. More seriously there seems to be no established way to manage multilingual text which includes homoglyphs (say by using colour coding) so you would really be replacing one problem with another.
It does seem like this Turkish I problem is the most conspicuous situation, maybe unique, where changing locale changes the behaviour of toupper/tolower. Unicode, on the other hand, has many homoglyphs and duplicate characters which all need to be dealt with.
Yeah, turkish I is bit of a red herring due it being a quirk in Unicode specifically. From a previous comment of mine:
> They [Unicode consortium] should have specified Turkish alphabet to use ı and a diacritic to make the dotted one. That would have made (in this case) capitalization locale-independent. [...] I dislike the common usage of turkish i as a example because it is such a obviously fixable (if legacy stuff wasn't concern) flaw in Unicode rather than fundamental issue.
That's helpful. Wikipedia page for "glyph" seems to concur:
"For example, in most languages written in any variety of the Latin alphabet the dot on a lower-case "i" is not a glyph because it does not convey any distinction, and an i in which the dot has been accidentally omitted is still likely to be read as an "i". In Turkish, however, it is a glyph because that language has two distinct versions of the letter "i", with and without a dot."
The Turkish I problem is seriously annoying. We recently tried adding Turkish translations to our website, only to find that .NET suddenly treats DataRow("ID") differently than DataRow("id") when your locale is set to Turkish.
I think if you stab somebody over a text message, there's more going on than just a missing dot. As if it'd be acceptable to kill somebody even if they did call your daughter a prostitute.
The article itself could benefit from corrections: "with a knife on his chest" That doesn't sound so bad.
> Less seriously, Unicode has counterintuitive case-changing behaviours with those letters. If you are working outside the Turkish locale and uppercase a dotless I and then lowercase it, it gains a dot.
AFAIK the only solution would be to error out when uppercasing a dotless i in a non-turkish locale. Which I'm not sure sounds better. Or going back in time and creating a separate category of i and I for the turkish script.
There's always an exception. You have two choices.
First: you might split it into three strings and trust the translator to handle it.
For example, you might pick:
"Your query matched %(filecount)s in %(directorycount)s"?
"%g files"
"%g directories"
with appropriate plurals for the %g ones.
Another translator might pick:
"Your query %(filecount)s %(directorycount)s"?
"matched %g files"
"in %g directories"
to get the flexibility they need.
You only hit a problem there when you get one word that is requires declination dependent on both numbers in a way that you can't fit to be contiguous with the numbers. It's not foolproof, but it's certainly enough flexibility if you're willing to re-word things slightly.
Alternatively, and this is my favoured way, you can have the result of a plural lookup be a second msgid, for example:
msgid "files_found_by_n_directories"
msgstr[0] "files_found_directory_0"
msgstr[1] "files_found_directory_1"
msgstr[2] "files_found_directory_2"
msgid "files_found_directory_0"
msgstr[0] "Your query didn't match any files"
msgstr[1] "Your query matched a file in no directories"
msgstr[2] "Your query matched %(file)g files in no directories"
msgid "files_found_directory_1"
msgstr[0] "Your query didn't match any files in a directory"
msgstr[1] "Your query matched %(file)g file in a directory"
msgstr[2] "Your query matched %(file)g files in a directory"
msgid "files_found_directory_2"
msgstr[0] "Your query didn't match any files in %(directories)g directories"
msgstr[1] "Your query matched %(file)g file in %(directories)g directories"
msgstr[2] "Your query matched %(file)g files in %(directories)g directories"
Let's face it - sometimes it is better to just avoid the problem than to (try to) solve it. Just use some other form of sentence, people will not even notice and you will save yourself tons of problems.
Interesting, but I am wondering if it is really worth going through all this trouble, just to support a few edge cases.
Personally, I usually don't even notice small mistakes like "1 directories" (or similar mistakes in my native language). Sometimes I will see the correct version somewhere and think "Oh, nice that they thought of that" but I definitely don't expect it.
Are the possible returns of having a "perfect" translation really high enough to justify investing in a much more complex system? I am sure translators who can code functions instead of just putting values into an Excel table will come at quite a premium as well...
You may not notice it. But there are other people who do.
That is the difference between a very well designed product or a product, which just does the job.
That is the reason why engineers should not design interfaces. That should be left to UI/UX experts. Just the other day, I as an engineer myself, complained to another that his product does the job nicely and looks OK. But it was missing this little twist of a finished product that I'd really like to use. Because the interface was designed how I would have done it myself, because lack of UI/UX knowledge.
Of course, I'm sure there are people who notice this. My question was whether something like this makes enough of a difference to justify the investment.
As a developer you only have a limited amount of funds and time to spend on your product. Your goal is to invest these in a way that gives you the biggest returns. Of course it's great to not just have a translation that "does the job" and if you can do better you definitely should. But if that you takes a lot of work and only has a negligible effect on your sales, shouldn't you be prioritizing other things?
We have 100s of thousands to millions of pageviews per day in 10s of languages on the various pages of our site. The site grows from A/B testing and I have to say that many of these small things do add up over time (they add up to measurable conversion in the funnel). There is the odd one that surprisingly does nothing or does worse but generally, paying attention to language details did prove effective for us. I work on this stuff everyday in a small team where we know more than 30 languages between us and sometimes just two or three of these small changes more than pay for our annual wages in just a month.
> Of course, I'm sure there are people who notice this. My question was whether something like this makes enough of a difference to justify the investment.
Honestly - probably not. And that's exactly why we live in the world that is full of crappy stuff that works barely enough to get sold. I personally strongly appreciate when someone, however little economical sense that may have, goes the extra mile and makes their product polished. I'm willing to pay for it.
I don't know, in Russian either "ты" (singular/informal) or "вы" (plural/formal) works fine for both genders. Now, for your example, Google translated result is "Вы уверены, что хотите выйти?" which seems fine to me!
"Wy" (plural you) don't work for single person in Polish, it sounds like "communist-speach" to us (almost like you called people "tovarishch") :) It was only used by soviet puppet politicians during communism (as carbon copy of Russian expression).
So it needs "Ty" (singular you), and with singular you "sure" translates to "pewien" for male recipients and "pewna" for female.
In Polish it should be "Jesteś pewny(or pewien)/pewna, że chcesz wyjść?"
So, it's Polish-specific, not Slavic-specific as I thought.
How is that typically handled in software? I can also imagine non-software examples where this might be hard (A sign reading "Warning: you are entering a restricted area", etc.)
I explained poorly, verbs in second person are gender agnostic, adjectives are gender-dependent.
The "restriced area" is funny, standard version is "Nieuprawnionym wstęp wzbroniony" ~ "For non-priviledged-ones entry is forbidden", it's plural noun made from adjective, in nominative it would be different for male and mixed/female groups (uprawnieni/uprawnione), but fortunately in plural in dative case it's "uprawnionym" for both genders so it works out OK. I guess it's common and maybe that's why the dative case works that way?
In most dialogs in software adjectives are the problem, especially "are you sure". Usually software that don't know your gender for other reasons just use male version.
In non-software world in formal documents it's often written like "he/she" in every gender-dependent place, often with passive voice to cut the amount of "/".
Also, if you 'localize' something using Google Translate[1], please let the user choose language somewhere in the app.
For example, the Hostelworld ios app[2] requires the user to change language for the entire device. As something of a language perfectionist it leaves the app virtually useless.
[1] Translating to English from other languages works fine for me.
Pluralization is a challenge, but we're able to solve this with some pretty simple HTML tags.
For example:
<div>I have <var pluralize="3">3</var> dogs!</div>
Localize.js identifies the <var> tag with the pluralize attribute, and pluralizes the phrase to any language (including languages like Arabic which can have 6 different plural forms).
Can you explain your example a little more? What would translators see in this case for Arabic; would they need to provide three translations? And if there were two variables, nine translations?
"Localization" also implies a lot more than just translation. Does Localize.js handle work like culture-specific number and date formatting? Different collation of records for different languages? How does it identify application-generated text versus user data (eg. on a blog, does it translate blog comments entered by readers, or just text like "Please enter a comment below")?
You don't solve the inflexion problem.
You would need something where the lemmization tokenization of the sentence is easily accessible to a grammar engine.
Something horrible like:
<div ><lemm="personnal pronoun">I</lemm> <lemm=verb>have</lemm>< <case=accusative quant=3><lemm=noun quantifier /><lemm=....</div>
Lemmatization isn't necessary in this case, as there's no need for a grammar engine.
Here's how we handle pluralization on our backend (all abstracted away from the user): For a language with 2 plural forms (ie. French), we create 2 different phrases. One singular and one plural. Or for languages like Arabic, we create 6 different phrases for each plural form. We then send each plural form of the phrase to a human translator, who translates each phrase independently.
This ensures proper plural forms for all variations, without the need for complicated grammar syntax.
I've long found that "externalized" translations in po files (or any equivalent) are more trouble than they're worth, for exactly this reason. Translations need to be functions, so they need to be written in a format that's good for writing functions - i.e. a programming language. What we want is a MessageSource interface, and a bunch of language-specific implementations.
Fortunately I work in Scala, so it's very easy to have an "embedded DSL" that's ordinary, first-class code but not much harder for non-technical translators to read or write than the .po format; we can write helpers for grammatical case or numbers or similar. But having the full power of a programming languages there means that when you hit a case you haven't thought of (and you will), you can fall back to just an if/else.
I dont agree with the article. The author goes about manually implementing localisations, and eventually throwing out GnuGetText. But it DOES have excellent plural support, and a header in your PO file allows chinese to use "nplurals=1; plural=0;" for example:
http://localization-guide.readthedocs.org/en/latest/l10n/plu...
That might be ok for technical people who are used to this kind of "machine speak", but for end users it's often not acceptable. Consider the difference between:
You write tr("I scanned %1 directory.", "", count) and it takes cares of applying the correct translation with the right plurals depending on the number.
How does that deal with the case where it should translate to "I didn't scan any directories"? According to the documentation, "In the translated version the variables must still appear." http://doc.qt.digia.com/4.2/linguist-translators.html
We version (localize) projects all the time, so I thought reasonable logic for seeing whether a string is empty is to check whether its trimmed length was greater than two, allowing for two stray characters that show up all the time in content. For example, someone who's not sure how to translate something would type "??" in a field. So, {if $slide.title|length>2} ... This worked for 4-5 languages.
Then our Chinese (Mandarin) division called and asked me to look into buggy behavior with their translations. Turns out a whole sentence got translated to ... two characters, and wasn't showing up.
I once suggested an idea that maybe instead of strings in tables one could use something better suited for the task at hand like, say, code? Maybe let the tables store not only strings but functions as well, so that you could handle the more complex cases directly?
I remember being hit in the head by gettext manual and told something about translators not knowing how to code.
Heck, I still think it's a neater idea than gettext.
I wonder if Lojban[1] could serve as an unambiguous, largely-context-independent way to store representations of the central concepts, which can then be translated into natural human language.
I saw most of this coming. I have studied a little Ancient Greek which has the same problem of the Arabic, and Polish which is similar to the Russian, and now I am living in Italy.
I guess it is one of those things though as programmers that we just forget about too much, and just expect translation to a mechanical process in the last stage of the development cycle.
I wonder if it isn't better to generate the messages as an AST, and have a language generator, the back-end of a compiler really, that generates strings for each language. I'm sure there will be less edge cases that way.
I'm sure someone smart will tell me what assumption I made that's invalid in some language I do not know, but this already is better, and simpler, and more correct than a lot of text-mapping based internalizations I've seen.
And I will be able to correct my S-expr to get rid of more assumptions, and I will just make it longer perhaps, but I won't need to write all the convoluted code prevalent in the article.
Also, with AST-based representation you can add whatever context you need. For example, were I to have used "pig" instead of "cat", this would have already worked fine, since we have (OANIMAL, "pig") and not (OPERSON, "pig"). This is trivial, but you can add whatever amount of context required.
Now I'm just wondering about the lexicon since the primitive tokens are things like "like". One potential problem is when two languages don't have words with enough semantic overlap to be comfortable using one as a translation for another. Another potential problem is when
An analogy to my other comment on AST translation: in English you "like doing something" but in German you "do something gladly" (and again in English you "like a person" but in German you can "have a person dear", akin to English "hold dear"). If we expect that the AST can produce a translation using the single verb "like", we may be in for trouble if the target language doesn't do that (although maybe code can be written that uses the AST and that's aware of this complexity as part of the realization of the translation).
Another example could come from the problem of describing states of being or perception, like "I'm cold", "I'm tired", "I'm hungry", "I'm thirsty", "I'm sick", etc. In English we really like using "to be" plus adjectives for such situations, but other languages have other preferred strategies. For example Latin has specialized verbs for the actions of (at least) being hungry, thirsty, or sick (like esurio, sitio, aegroto); in Romance languages people often "have" hunger or thirst (Spanish "tengo hambre", lit. 'I have hunger'; Portuguese "estou com fome", lit. 'I am with hunger'); in German it is cold "to" a person ("mir ist kalt", not "ich bin kalt" 'I am a cold person').
If you imagine having your AST start with Latin, you may have a challenging story about how you could get from "sitisne?" to "are you thirsty?" "¿tienes sed?" "está com sede?" and "hast du durst?" -- not to deny that it may be achievable with enough work.
- While it may be more machine-readable, it's absolutely not human readable
- Something like (OPRONOUN (3, 0)) assumes that pronouns work like in English. Want to represent text in a language for which it is not the case with this system? No dice.
In short, I'd rather have an improved gettext (without hacks to detect if you're gettext'ing printf'd text...) than your proposed AST solution.
This have been done. It's called "ICU's messageformat". Though instead of having a general purpose language, the markup language is specially tailored for localization. One project that support ICU's messageformat is http://l10ns.org
First, I am talking about generating text. In software. Translators would really only need to check the text so that it's correct, and when it isn't, they should submit bug requests so that the programmers implementing the language back-end can fix them.
Second, an AST is nothing but the result of applying grammar to some input, something which translators are very familiar with (probably more than programmers). It is utterly trivial to teach a translator a new textual representation of what they already know.
It's a tree of nodes connected by arrows. They learned the parsing stage in school. At least here in my country, everybody is required to decompose a phrase in it's underlying grammatical constituents in the eight grade. You just never draw it as a tree, but it's trivial to learn to draw it as a tree and read it from a tree.
This is by far my favorite bit of documentation. I have to go look it up every time a manager or client starts asking for localization to justify my high estimates on how long it'll take.
So the solution is to design and implement your interface in a Slavic language (presumably most complex, as we found so far) and translate down to other languages with less demanding rules?
> The %g slots are in an order reverse to what they are in English. You wonder how you'll get gettext to handle that.
I learned C/++ after C# and this is one thing that really got to me. String interpolation in C++ is extremely primitive, which would be fine only if more recent iterations of stdlib had something that wasn't so completely incompetent.
For those who don't use .Net, the Italian translation would have been: "In {1:g} directories contains {0:g} files match your query." It doesn't solve all the problems, but being able to specify indices in your template string does solve many.
Makes me wonder why anyone would ever design a translation system that doesn't have the ability to reorder placeholders from the very first version. Perhaps if the author doesn't know anything about languages andn their differences, but in that case they probably shouldn't write such a library ...
Until you can display negative numbers¹. Or have native numerals². Or need to know what the numbers even mean. :-)
__________
¹ I tend to set my minus sign to U+2212 to catch errors in code where we just use ToString() instead of ToString(CultureInfo.InvariantCulture). Almost as much fun as putting a Unicode character into your user name that isn't representable in the current legacy codepage on Windows.
² ۱۲ ۱۰ ۴ probably won't work as input to the application trying to parse that line ;)
There was an Australian guy for English, an German guy, an Italian lady, and me for French. What they did prior to the meeting is: * translate from Japanese to English by Japanese people with a poor English level (maybe the software engineers actually) * translate from weird English to other languages by translators who had only the strings, absolutely no context.
In the meeting we had all the strings, and one person from the manufacturers who had access to the "super-confidential" unreleased device.
More than half of the translations were off because of lack of context. The French guy actually translated "Garbage day" to something like "Shitty day", apparently he thought that was a way to mark in your calendar that you had a really bad day.
Pretty often we had sentences like "delete one", and invariably one of us had to ask "One what? I need to know if it's masculine/feminine/neutral". Of course they didn't prepare to that, it was too late to change the code, so they made us do ugly things like "%n item(s)".
Also the Australian guy was loosing faith into humanity: - That sentence, it's completely wrong, it just doesn't mean anything in English. People will just go "WTF?" when they read that - We're not allowed to change the English strings, they're already validated - .....