Hacker News new | past | comments | ask | show | jobs | submit login
Internationalisation for beginners (vincentsanders.blogspot.co.uk)
18 points by kyllikki on May 17, 2013 | hide | past | favorite | 14 comments



Thanks for the write-up. A very interesting read. It's an area that is ripe for innovation, and a massively growing industry.

The XLIFF and TMX formats also offer flexibility in the handling of translated data, as with .po files, but there are many problems still to be solved, as contingencies mentions.

As you mention "Real people are still required to do the translations and verify them" and the army of professional translators and agencies in the market is on hand to do that, but developers often work in formats they are unfamiliar with.

The bulk of a freelancer's work is in MS Office files, run through a CAT (computer assisted translation) tool, and the resulting file (and translation memory, TM) is delivered. When a developer needs a bunch of strings translated they stray into unfamiliar territory for the average freelancer.

Specialists are out there, but a common format approach would help here. Most professional CAT tools (costing from 200-1000+ of your local currency units) can process .po files, which is a bonus, but doesn't solve many of the remaining problems out there.

A multi-language translation memory (i.e. several source/target combinations) would be useful in many cases, as would a simple 'export translatables' button in the admin dashboards of apps.

I hope more HN readers dig in to the problems mentioned here, as technical solutions could have a big influence on the future of globalisation(-ization!).


"Finally the Java property file format was used (with UTF-8 encoding) which while having bugs in the import and export escaping these could at least be worked around."

The java property file format is ISO-8859-1 not UTF-8. I have to wonder if that's the bugs you hit? While you can have something that is UTF-8, there's a couple of wrinkles with trying to use that with java i18n.

See: http://docs.oracle.com/javase/6/docs/api/java/util/ResourceB...

... when you load a resourcebundle, it tries to load a properties file, and it ends up calling this method:

http://docs.oracle.com/javase/6/docs/api/java/util/Propertie...

... which mentions the encoding.

There's a couple of ways around this - one is to write a bunch of code to change how resourcebundles are loaded, the other is to use java's native2ascii tool in your to provide files that are correctly escaped.


Transifex have extended the format and allow resources to be UTF-8 encoded (see http://help.transifex.com/features/formats.html#java-propert... ) however the importer does not correctly cope with single quote characters, backslash n (newline) and several other characters being encoded when they ought to be (as per the document you referenced which I also used to begin with ;-)

If you look at the script Vivek wrote http://git.netsurf-browser.org/netsurf.git/tree/utils/split-... he clearly documents the odd importer issues which is why we called it the transifex resource format and not Java resource format ;-)


Transifex looks nice, thanks for the tip, but it seems like you have to add a lot of glue to connect your own version control to their proprietary version control via their API.

What I would really like is something like Weblate (http://weblate.org), that you can hook in directly to your code repo. Is there anything like that out there?


Disclaimer: I work at Transifex.

The perl script written by the user in the article is about 100 lines of code. Doesn't seem like a lot of glue...

Another nice thing that Transifex provides which is not described in the blog is the Transifex client[1]. I wonder why he didn't use it.

[1]http://support.transifex.com/customer/portal/articles/960804...


I did not use it because it was an unverified python script with a bunch of dependencies, as a rule I generally do not like executing untrusted code without at least a basic review.

As you mentioned, the Perl script was 100 lines and was easier at that moment in time to integrate than to review the python.

Of course once the python client has been reviewed perhaps it would be a more general solution.


What's wrong with Weblate, actually? First time I've seen this project.


Great blog post about l10n and i18n! I'm working on improving that process in our company and currently I'm choosing Zanata [0] as a (Java-based) translation platform because out of Transifex's no longer maintained community edition (how unfortunate!) and Pootle, Zanata's installation actually was painless and the community around it is very responsive!

Too bad I didn't stumble upon Weblate [1] first though, it looks promising (thanks onemorepassword).

I've set up an independant "localization server" that executes the following process:

1) Regularly pulls new revisions of the code and updates to the latest revision.

2) A mercurial hook [2] is thus called and the source strings are extracted from the code with xgettext [3] so that new POT gettext files are generated.

3) The POT files are finally pushed to Zanata's server via its API.

We currently do in-house translations for one locale, while others are managed by an extenal translation provider. Employees in our company can just login (Zanata provides OpenID authentication) and collaboratively translate and review the application strings. Whereas Zanata can be used to export ressource files and push projects to our external translation provider's platform via their API.

But as others have said in this thread, l10n automation curently involves a lot of manual code glueing and adapting with your version control system. There's definitely potential since available solutions only address the translation problem and haven't gone very far in the whole process.

I'd be more than glad to exchange about the subject with others who have gone through the same experience!

---

Links

[0] http://zanata.org/

[1] http://weblate.org/fr/

[3] http://mercurial.selenic.com/wiki/Hook

[4] http://www.gnu.org/software/gettext/manual/html_node/xgettex...


Thanks for the write-up, I deal with the same issue in our company and while we do work in Gettext with UTF-8 (that solves most basic issues just fine), it seems every project that does i18n is cooking it up in their own way and I have not been able to find many references online. I will probably make an article describing our setup when I get around to polishing it.

The concensus around Transifex in #i18n@freenode seems to be that the open source version is old and not maintained and should not be used. The SaaS offering is much newer and packs quite a bit more features.

The "good" open source offering appears to be Pootle [0].

Honestly, I would be very worried about depending on a cloud service such as Transifex for something that is so deeply embedded into our (pretty continuous) development process. This requires automation, and all the time invested in integrating with release processes and continuous integration can easily go overboard. Of course, if Transifex were seamlessly integrated with project management applications out of the box, then it wouldn't be such a risky proposition.

----

An interesting point about i18n that is quite independent from the tool selection is how you write your message identifiers. You can basically use labels (i.e, an ID for the string) or use the "original" string.

Here's the tradeoff: if you use an ID, you must reference the application constantly to understand what the translation should say (and in any non trivial application, this is a huge burden for translators), and there is either no string reuse (because places with the same intended content have used different IDs), or the need for an anal curator to go around chastising developers ("the OK button should always be ACTION_BUTTON_LABEL_OK!! fix it!!"). On the other hand, if you use original strings in English you will find that you experience language collisions (two places where the original string in English is the same, but the translated one is not), so you end up resorting to introducing artificial differences to make them unique (i.e "Request (verb)" and "Request (substantive)" instead of just "Request").

A hack that goes a long way if your engineering team is based off a country that uses a latin language, is to use that instead of English for original strings. Latin languages are typically more complex than English so collisions are greatly reduced. Chances are your translation team is also based in that country as well, so no harm done.

----

If you are doing branchy development, I put together a wiki page [1] on the Mercurial wiki with a script I use to merge translation catalogs (.po) seamlessly when doing branch merges. It can easily be used with git as well.

----

Links

[0] http://pootle.translatehouse.org/

[1] http://mercurial.selenic.com/wiki/MergeGettext


> An interesting point about i18n that is quite independent from the tool selection is how you write your message identifiers. You can basically use labels (i.e, an ID for the string) or use the "original" string.

> Here's the tradeoff: if you use an ID, you must reference the application constantly to understand what the translation should say (and in any non trivial application, this is a huge burden for translators), and there is either no string reuse (because places with the same intended content have used different IDs), or the need for an anal curator to go around chastising developers ("the OK button should always be ACTION_BUTTON_LABEL_OK!! fix it!!"). On the other hand, if you use original strings in English you will find that you experience language collisions (two places where the original string in English is the same, but the translated one is not), so you end up resorting to introducing artificial differences to make them unique (i.e "Request (verb)" and "Request (substantive)" instead of just "Request").

The PO format uses the field "context" to differentiate among the various uses of a word/phrase. You should also add a comment for your translators in this case.

Also, using an ID messes with the PO format itself. E.g., fallbacks in case of a missing translation will not work.

But there are other formats that are ID-based, like .properties in Java.


Whilst the key/value approach is solid, the 'industry standard' .po (GNU gettext / https://en.wikipedia.org/wiki/Gettext) format supports more features, like complex plural and ordinal/cardinal number support that is a requirement in some languages.

In addition, some of the biggest issues with internationalization in my experience (~exclusively i18n projects for 10+ years) are generally missing/broken support in certain components (great reasons to contribute resources upstream for open source projects!), managing translations over time, cultural issues, right-to-left, differing program-level logic (eg. maximum SMS message length variations based upon character set requirements), differing seasons/days of operation/holidays. Calendars are of course a pain (though a solved one), as are timezones - for which a truly synchronized, global approach is frustratingly hard to deploy at the best of times.


The gettext PO file format does indeed provide many other features, I do not disagree, but there does seem to be an over reliance on it within the platforms I looked at.

The format does have some pretty major drawbacks too, like the msgid can become "fuzzy" which leads to a differing set of issues related to the unique keying between translations.

It also tends to lead to developers English (C locale if you like) being selected as the default language and it turns out developers like myself are sloppy and sometimes produce barely parsable messages.

Your remaining points are really valuable to someone inexperience in the field, like myself, so thanks for pointing those out.

It is interesting you call out cultural issues, did you have any specific examples?


> The format does have some pretty major drawbacks too, like the msgid can become "fuzzy" which leads to a differing set of issues related to the unique keying between translations.

I am not sure how much of an issue this is in practice. The main problem of the PO format AFAICT is that it is quite outdated. For instance, it has no support for genders and you cannot "mix" plural rules within a phrase.

> It is interesting you call out cultural issues, did you have any specific examples?

The wikipedia entry on l10n[1] has some examples.

The process of localization is not merely about translating some strings, but adapting them to a specific language and culture, which is the hardest part. For instance, your home page is one of the most important pages in your app and is geared to make as many people as possible sign up. Do you think a simple translation would have the same effect on British, French, Arabs, Japanese etc people?

[1]: https://en.wikipedia.org/wiki/Internationalization_and_local...


Cultural issues include whether or not it is appropriate to bring up a particular point in the first place, visual design (eg. cultural norms for button or multi-point statement/list ordering, spacing, typeface size, etc.), special overtones implied by the use of certain words or combinations thereof, in some cases whether the use of text or icons is appropriate for a given translation, whether certain phrases or features should be 'off limits' (eg. due to cultural taboos, eg. impossible to configure a system in such a way that is disrespectful to one or more religious deities or real world monarchs), etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: