Hacker News new | past | comments | ask | show | jobs | submit login
Lessons From Linguistics: i18n Best Practices for Front-End Developers (shopify.engineering)
188 points by open-source-ux 11 months ago | hide | past | favorite | 99 comments



In general a good short rule of thumb is to always always _always_ write out the full sentence you want to translate and use the tooling to interpolate everything you want to put in it. That way the translator always sees the full context and you make it harder (although not impossible) for yourself to shoot yourself in the foot. Another recommendation I would add is to use two meta locales in development in addition to whatever you need to support otherwise: id and pseudo. The id locale should be an identity function so you (and the translator and everyone else) can open up a page and see what keys are used on that page. The pseudo locale should be either random or pseudorandom text and is good for both ensuring you haven't accidentally left something hardcoded as well as checking how your layout plays with different length strings. These ideas alone will get you most of the way most of the time, and they have the added benefit that they're straightforward to teach to juniors.


Common translation tools would benefit from the ability to input some domain information. Typing in an entire sentence is a sensible start, but oftentimes a lone sentence does not carry enough context. Example: in my language, the term "fresh water" translates to "fresh water" or "sweet water". "Sweet water" is the opposite of saltwater, while "fresh water" would only be used in a context like "water in a tank has been replaced". The translations in Cities: Skylines use the literal translation and this has always irked me; if the translator accepted "plumbing" as a context, it would have produced a correct translation.


Most common translation tools already support this. From GNU gettext where it's called context [0], to FormatJS where it's called a description [1], all translation tools worth their salt support providing additional information about the usage. The problem is generally not on the tooling side but on the awareness side, if the person implementing the feature only speaks one language it's easy to overlook issues that wouldn't occur in their native language.

[0] https://www.gnu.org/software/gettext/manual/gettext.html#Con...

[1] https://formatjs.io/docs/intl/#message-descriptor


If you can tell me how to provide context in the google translation api i would be very happy. Dont think it exists. Only in the web version.


For what it's worth, the term "fresh water" has each of those separate connotations in American English too, and it can be just as context-dependent. Both "These fish belong in fresh water, not salt water" and "I put fresh water in the ice trays because the old stuff was stale" are valid contextually and linguistically.


Yes, but it does not necessarily have those two connotations in other languages (which is the issue here).

In Portuguese, if you want to say "fresh water" (as in, "low-salinity water"), you use the term "sweet water", rather than "fresh water" ("fresh water" can only be interpreted either as "cold water" or "non-stale water", but never as "low-salinity water").


In English we have the term potable to designate drinkable water, which comes from an old PIEish verb meaning to drink that no one uses anymore and so it's often confused with "pottable" i.e., to be put into a pot, as in a game of pool (a non-aquatic table sport fyi)


What does PIEish mean?


Not op, but I think it stands for Proto-Indo-European language.


Implementing something like an "id" locale is a great idea. Just don't call it that; it's the locale code for Indonesian.

Using a word that's longer than three characters like "identity" will help ensure that it won't conflict with a real locale.


For special terminology an invaluable trick is to search for the wikipedia article in your language and switch the language if possible. Of course this has to be used with caution, but it can help you find the right variant of a word.


A third dev locale worth considering is one which consists of random emojis. Makes it very easy to find hardcoded strings.


A particularly tricky case of this is with usernames and user defined content.

Eg, a notification like "Alice is online" in some languages requires knowing Alice's gender. Which may be something that's not even stored anywhere in the system. There's probably some language out there that requires some other piece of personal info for a correct translation.

To make things tricky, try having a multitude of items that you refer to: "You're holding a dagger". Now you need to have a serious discussion with your translators, because this is going to get all kinds of tricky, as the maker of Obra Dinn discovered: https://www.youtube.com/watch?v=OMi6xgdSbMA

And to make things extra-tricky, allow users to create content. "Alice gave you a banana", where "banana" is a custom object Alice made herself.

Most translation efforts seem to give up at this point and resort to something stilted like "Alice: online"


And then you "discover" that depending on context "Alice" can change as well. In finno-ugric languages there is no prefixes and names change themselves. In Finnish:

Alice is online -> Alice on verkossa

Mail to Alice -> Postita Alicelle

And no, you can't use "%s"+ "lle" either, it depends on name:

Mail to John -> Postita Johnille


I think most common languages you can either choose between a combo adjective like

Alice es activo(a)

Or alice es activ@

But honestly as engineers, we should follow the advice of making the stilted constructions. We're already trained as users to expect that anyways from all the other apps.

If it's just a monolingual app. Sure go nuts, make it read super fluidly. If you have to localize, just do it the tried and true way


But honestly as engineers, we should follow the advice of making the stilted constructions. We're already trained as users to expect that anyways from all the other apps.

I do the opposite.

Instead of training my users to accept a kludge, I take the time to make it correct, even if it's hard.

I have the luxury of full-time professional translators on staff, and I know that not everyone else does.

But I've always believed that computers are supposed to work for people, not the other way around.


In this particular case, only "está" (instead of "es") is correct. Otherwise it means that the person is an active person, instead of meaning that at the present period of time, the person can be found active/online.


> If it's just a monolingual app. Sure go nuts, make it read super fluidly.

The Polish IM app Gadu-Gadu was monolingual. The user's gender was part of the user profile. Now, virtually all female Polish names end in -a (foreign names aside, there's maybe one or two exceptions). There's a fairly popular male name (Kuba) that ends with -a, and GG did use the feminine verb when showing the notification that he's online.

It's usually a diminutive form, but some people have this as their official name, and people are likely to use a diminutive form in your contacts list.


Also, in Chinese, the logogram might change depending on the gender: 他/她, he/she, also part of his/her


To make things even worse. There is 祂 specifically for god-like characters (for example jesus). And 牠 for animals.


@ as a superposition of “a” and “o”; very nice.

If gender is how we sneak conditional expressions and nondet Prolog predicates into natural languages, then I’m in.


Seamlessly embedding user-generated content into template text is a losing battle. There are always going to be users who test the waters with emojis, zalgo text, l33tspeak profanity, SQL/XSS injection attempts, copypastas, and so on. In the face of such merry nonsense, linguistic idiosyncrasies like proper gendered forms are moot concerns.


That reminds me I need to add some Zalgo test cases to my layout code. I got a stern warning from dang for Zalgo-ing a HN comment once.


awesome video


Smells:

* If you're concatenating sentence bits, you're doing it wrong

* If you're formatting numbers, dates, times, or durations by hand, you're doing it wrong

* If you're formatting strings with placeholders and you don't know the gender and number of your placeholders, your translators are going to have a bad time

There are two more important rules that this article doesn't mention

* Write long descriptions of what the thing is that you're asking someone to translate (button label, menu item, dialog header...), and include a screenshot. Translating very short strings without context is very difficult.

* Ask your translators to do a global once-over QA pass once in a while to detect inconsistencies and weirdness. Once I dealt with a product that had three tabs, and two of the tabs were translated identically. Each tab header translation made sense on its own, but as distinct tab headers side by side, it made no sense to use the same word.


> If you're concatenating sentence bits, you're doing it wrong.

This is completely impractical for anything other than the most static content. Take the most basic line from any single imagined game.

E.g. “Your [Abrams tank] has [fired] a [lead-tipped bullet] at a [green] [dragon].”

Ok, let’s say you have fifteen unit types, five attack types, thirteen types of ammo, thirty enemy type adjectives, and fifty enemy nouns. This is one announcement type out of hundreds in your game (Your green dragon was spotted by an enemy Lizardman! Your attack dirigible is running low on hydrogen!) You’re going to - what - precompose all of this? And pay to have your trillions of lines translated into a dozen languages?

At that point it would seem more cost effective to use your i18n budget to finance some kind of monolingual colony on Mars, and just sell your product there.

Yes, this is going to be hard to concat programmatically, and yes, you’ll likely make some mistakes and might need to rethink some phrasing. One language’s ‘blue’ is another language’s ‘wine dark’. One language’s adjective (he is tall) is another language’s verb (he towers). But “don’t concatenate” is simply not useful advice, because it is not actually actionable for most projects.

People act like i18n is some great moral-ethical challenge, and if you fail, you will mortally offend millions. You won’t. You might sound silly, but people are patient, and you’ll recover.

No matter how hard you try and plan, you will, at some point, get localisation wrong. Oh, you’ve thought of gender? That’s great! What about case? Ok, but that only works for languages with nominative-accusative alignment. What about languages with ergative-absolutive alignment? Or a tripartite system? And this is just nouns - wait until we get to verbs!


Interpolation is potentially problematic but frequently necessary. Concatenation is pretty much always flat out wrong.


I think concatenate is a keyword here - it will lead to your fire lead-tipped tank Abrams in some languages. Token replacing should be far more okay("Player enacted laissez-faire policy")


As others have said, I was talking about concatenation as opposed to interpolation. Interpolation still has challenges. It's hard to make arbitrary sentences with many placeholders work with agreement. But concatenation is almost always a bad choice.

Also am not talking about this as an absolute moral-ethical challenge. In the examples I gave, it's usually less work long term on the developer to do the right thing, and a more efficient use of translation budget because you're not spending money to get convoluted translations that sound confusing or don't sound idiomatic. Why bother translating if your users are going to use English anyways?


> Also am not talking about this as an absolute moral-ethical challenge.

I find myself frustrated by devs perpetually talking about localisation as an impossible, semi-mystical 'gotchya' that causes untold cultural offence if mishandled, and I took that frustration out on your comment, and I apologise.

I agree that keeping localisation in mind early is better than remembering about it too late.


I note the replies, but respectfully disagree that concatenation versus interpolation is in any way a meaningful distinction.

Analysing “green house” as either “green”.concat(“ house”) or “{} {}”(green, house) is not meaningfully different, and runs into the same i18n problem (the order should be reversed for many languages, like Spanish and French).


> Ask your translators to do a global once-over QA pass once in a while to detect inconsistencies and weirdness. Once I dealt with a product that had three tabs, and two of the tabs were translated identically. Each tab header translation made sense on its own, but as distinct tab headers side by side, it made no sense to use the same word.

Even the very big ones get this wrong. I believe AliExpress still uses the same translation in Dutch for the word “register” as they do for “login” (both can be translated as “aanmelden”). Kind of an important distinction.

Seems like this should be so easy to detect (look for duplicates on the right hand side of the translations where there are none on the left).


Oh I’ve seen that AliEx one. Apparently the word “sign in” in Chinese and “sign up” in Japanese happens to share the literal(“登録”), and it’s the exact same binary in UTF-8, so some software might have no clue about that.


Another aspect to be aware of is that English is often much shorter than the equivalent translated text, especially on buttons with text labels. I remember many years ago we used a rough rule of thumb of always doubling the space used for English to ensure there was enough space for the translated text.


While developing, it's a good practice to have at least "lorem ipsum" or auto translated versions of the phrases in your product, if you're targeting multiple languages.

Likewise though, character based languages like Chinese CAN be a lot more compact than English. But counterpoint, those same languages have a different internet culture where they will put more information in headlines. Random website I looked up; https://cn.chinadaily.com.cn/ has a small header in a sidebar:

    数读中国 | 韧性强活力足信心稳 外贸外资稳中提质
But it translates to something that's three times as long in English:

    Data Reading China | Resilience is strong, vitality is sufficient, confidence is stable, foreign trade and foreign investment are stable and quality is improving
Another thing to keep in mind for development: right-to-left languages like Arabic, Hebrew, Persian, Urdu, Kashmiri, Pashto, Uighur, Sorani Kurdish, Punjabi, and Sindhi. I was lucky to be involved in a very international website that also had a Hebrew version, the CSS developers (another luck, we had dedicated CSS developers) even made the site layout right-to-left on RTL languages.


Years ago when I worked at Microsoft, they had a "pseudo localization" tool, at least for Visual Studio, that would inject backwards text, longer text, etc... It was gibberish, but it gave QA instant feedback on whether the UI would accommodate the localization process.

I did a quick web search, and it appears to be a well-understood practice, even covered in a different Shopify blog:

https://www.shopify.com/partners/blog/pseudo-localization.


I like to have a "double length" localization mode that doubles the English text; useful for fixing some layout issues while waiting for translations.

A separate toggle to underline everything going through the localization system (or adding some __delimiters__ around it if you don't support rich text) is great for spotting text that is not running through localization yet.

Once you have translations back, a "longest length" translation mode is more useful. It picks the longest translation for each token, no matter what the language. Confusing to look at, but great for seeing only the places where you actually have text-fit issues.


Another tip I've found in pseudo-localization is to add in lots of emoji to the text. That can help check some Unicode assumptions/encoding issues in your localization pipelines in ways English readers can better visualize.


One common problem on macOS that International users often complaint about is how messages in their native language are really long so they get cut off, and they're never fixed despite being reported for years. But if even Apple, a multi-Billion dollar company, struggles with these problems, I'm not sure there's any easy solutions.


That reminds me about german version of TES Oblivion with its "Schw. Tr. d. Le.en.-W." for "small potion of healing"


Non-latin fonts also have different glyph widths. For example cyryllic text might be same "length" but glyphs in a system font might be wider hence overall the string needs more space.


On Apple platforms, translators can not only translate strings but also adjust the actual layout of the UI. Sure that's adding more work but is probably worthwhile.


It's frustrating that the post does not provide any solution for some of the problems like declinations and gender. I internationalised a couple of applications, and it's incredible how i18n frameworks are still so limited in linguistic aspects that are so important for so many languages.

Finnish, for example works with a ton of suffixes, and you end up having to rewrite the copy (to non natural structures) to fit interpolation and declinations. Portuguese genders almost every subject in a phrase construction.

The web is killing (or creating artificial versions) of many languages because the lack of tooling...


Good news is folks are working on better tooling, for example fluent[0] by mozilla. The bad news though is adoption.

---

0: https://projectfluent.org/


Fluent is terrible. I excitedly implemented it in my most recent app, and immediately ran into so many issues.

1) It can transclude variables in translation strings, but these variables cannot themselves be localised. This makes Fluent completely useless for constructions like “Your knight has killed a dragon with a crossbow” in any language with case or gender, unless you pre-translate every possible combination in advance. Which is absurd - the possible combinations almost immediately grow astronomical.

2) The parser is extremely sensitive, and it produces errors which are terse and difficult to debug. Something as basic as a localisation key appearing twice throws an unrecoverable exception.

3) The input files mandate a weird arrangement of new lines for even the simplest branching (e.g. 1 -> cat, 2+ -> cats), which makes them become incredibly difficult to follow. I quickly lost track of my own reference English file - good luck giving them to translator to figure out for any languages with greater grammatical complexity.

4) The documentation is too Spartan to know what happens in edge cases. The word “Fluent” does not restrict language-related web searches in any useful way.

I worked around issue 1 with some ugly nested localisation hacks, but I plan to rip Fluent out before release. It heralds itself to be the saviour of all i18n, but it’s literally worse than the mess that came before it.


Hi! Thank you for your critique!

> 1) “Your knight has killed a dragon with a crossbow”

We have a proposal for dynamic references to address this problem - https://github.com/projectfluent/fluent/issues/80 - it's non-trivial but I hope we'll see it solved in Fluent and/or in MessageFormat 2.

> 2) The parser is extremely sensitive

True. It's on purpose. We wanted to start with strict and loosen, rather than the opposite.

> 3) The input files mandate a weird arrangement of new lines for even the simplest branching

Same as above.

> 4) The documentation is too Spartan to know what happens in edge cases.

We're a small team :)

> It heralds itself to be the saviour of all i18n, but it’s literally worse than the mess that came before it.

I'm sorry to hear it doesn't work for you. I'm relieved that your criticism is seems more subjective except of one missing feature that no other l10n system has as of yet. We'll keep pushing, but if you encounter a better l10n system, please let me know! We're working on Unicode MessageFormat 2.0 based on Fluent and incorporating lessons learned.


I wish you guys the best, but I think you’re being a little self-congratulatory here.

The first feature is not optional - it has been a feature of i18n systems since the 1990s, possibly earlier. I’ve seen cludged-together in-house solutions that can do it without breaking a sweat. It is currently not feasible to use Fluent to localise any substantive, dynamic content in languages with case or gender - which is the main challenge an i18n package exists to solve. (I note the issue you link is five years old, dismisses the problem as not significant, and flat out states it is not being worked on.)

Translation files are generally made by translators, not programmers, and the fact that Fluent falls over in a slight breeze makes it difficult to imagine a translator being able to produce working Fluent files. This is not a ”subjective” problem. Translators do not, and should not, work for free. Using Fluent adds considerable (and needless!) complexity and therefore expense.

As you point out, you’re working on a new data format, so it’s unclear why anyone should adopt (and pay for translations in) the current moribund format.

I genuinely do wish you guys the best, and I apologise if I spoke too bluntly above, but it is not merely a matter of personal opinion that Fluent is de facto still in alpha.


i18next has that covered, as well as all the other issues mentioned: https://www.i18next.com/translation-function/context

(note that I don't fully understand gendered languages, the above may or may not be applicable)


With formatjs [0], you don't have to split the sentence for interpolation. The same example as in the article can be implemented as:

    const message = defineMessage({
      defaultMessage: 'Learn more about <a>supported images</a>.',
      description: 'Footer text containing a hyperlink',
    })
and the anchor element can be interpolated as:

    formatMessage(message, {
      a: (chunks: ReactNode) => <a href="#link">{chunks}</a>,
    })
[0]: https://formatjs.io


My weapon of choice is i18next, which elegantly handles inline markup, even nested translations really well (although this example has less than ideal keys)

    <Trans i18nKey="userMessagesUnread" count={count}>
      Hello <strong title={t('nameTitle')}>{{name}}</strong>, you have {{count}} unread message. <Link to="/msgs">Go to messages</Link>.
    </Trans>


My biggest issue with i18n is not the grammar. It's the sheer uncanny valley of it.

There are translations to "Hello" in Portuguese but I'd cringe at being greeted with them by a webmail client instead of the more formal Good Morning/Afternoon.

The formal/informal gradient is very culture-bound and even hard to pin to a scalar space of possibilities. In a work environment people will fluently code-switch too -- say, between ranks or in the middle of a tiresome meeting when everyone takes five minutes to kick back and comment on lighter matters. It's hard to situate a computer in this social context.


Good article. Knowing some Slavic, Latin or Asian language helps immensely when dealing with i18n.

I wrote an article on a similar subject (with some additional technical details about Android and iOS) a few years ago, with a few similar conclusions:

https://jakub.gieryluk.net/blog/reusing-software-translation...


Don't forget Right-To-Left languages, that also affects how UI elements are arranged (position within the page) and rendered (input widgets like sliders get reversed).


I think the currently blessed CSS solution is to only use {inline,block}-{start,end} https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_logical... in place of {left,right,top,bottom} which then automagically supports even vertical scripts like Traditional Mongolian, but most people probably don't think that far ahead when just starting out.


I know Traditional Mongolian comes up regularly in i18n contexts, but I've never taken the time to see if it's something an app should support, based on how actively it is used. It seems like it's too minor, mostly due to the fact that nobody could be bothered to support it in electronic formats.

Vertical Japanese is very common in the print magazines I read, and is RTL amongst the rest of the Japanese text which is LTR. I just don't see it much online because of the trouble of setting it.

If I ever have too much spare time I must try to support vertical Mongolian in my app.


I’m highly procedural game dev lots of text has unpredictable text inserted within it. Eg, a notification for how “(person 1) has left (room1) to perform (action) in (room2) with (item1)”

So your “never do interpolation” trick is a bit of an over-simplification already. Not to mention all the ways to modify a verb or noun with surrounding words, Eg, preceding it with the. I walked to Larry vs I walked to the couch

Our languages systems for our game got pretty complex, pretty fast, and I find these simplified hand wavy articles pretty frustrating tbh


Translation/Internationalisation is one of the hardest problem that is not going to be solved by technology only.

RTL, plural depending on the number, non-latin char behavior, font issues, UI broken by longer translation, context dependent translation, etc. Every time I start a project or think about it I'm sweating.


well, at least we are trying to solve i18n with tech https://github.com/inlang/inlang :D


I don’t see any real way to deal with it except overstaffing (not gonna happen), designing everything into discrete horizontal lines, or simply coming to terms with the idea that some locales will break more than others, depending on who you are and who your customer base is. And deciding what you’re gonna do about that when it arises.

Buttons are gonna look jacked up in German because all their words are 45 letters long. RTL plays havoc with the layout. Referring to Taiwan the wrong way is a Defcon 2 political crisis. And on and on and on, it never fucking ends.


Can I just rant for a second about how much I hate the whole `<starting-letter-of-a-word-><count-of-inner-letters><ending-letter-of-a-word>` trend that folks seem to love? This intentional sort of obfuscation makes it hard for juniors or students (the exact people who would be interested in an article like this), to engage in the material. The most egregious example is doing it for the word 'accessibility'!


> <starting-letter-of-a-word-><count-of-inner-letters><ending-letter-of-a-word>

Also known as <s73d> for short.


I think the Polish example is even a bit more complex than that: it's not really that Polish has a separate form for "a few". That's the regular plural form. It's that Polish uses the genitive plural with certain numerals, instead of the nominative. That is, instead of saying "5 dogs" you say "5 of dogs".

This doesn't, to my knowledge, apply if there is no numeral provided, even if we're talking about 1000 dogs, so it wouldn't be right to call it a plural.

Of course, the point of the article still stands.

Disclaimer: I don't speak Polish. I did learn some Czech though at some point (most of which I've forgotten).


Not covered here: fonts and glyph appearances, which will nearly always end up displaying wrong in certain Asian languages -- https://heistak.github.io/your-code-displays-japanese-wrong/


Thank you for this. I tested it immediately on my site. Now, during testing I looked at the debug output before I browsed to a page, and the debug output was displaying the wrong characters, hanzi instead of kanji. But, when I browsed to the actual page, the characters were right, presumably because I'm setting the lang attribute in the HTML tag. Thank god I did that.

Still, testing this uncovered a bug in my Traditional Chinese code, so there's that to fix now :)


> The order of the words is hardcoded, with “added” preceding the date. This would be incorrect in many languages, from Dutch (“1 januari toegevoegd”)

This is simply not true. Since English and Dutch are both Germanic languages they largely work the same way. Saying "Toegevoegd: 1 januari" would be just fine. By using "1 januari toevoegd" you're syntactically changing the sentence.


1 januari toegevoegd sounds like you're adding the date. If you're going to write it as a sentence it ought to have something like "Op 1 januari", and in that case it actually doesn't matter whether you put the toegevoegd before or after "op 1 januari". But I agree, nothing wrong with "Toegevoegd: 1 januari".


It's best practice to shorten "Internationalization (i18n)" to "I25)"


This is a good article, though as someone who prefers references/tables to prose for technical topics, the real find for me was the link out to the Unicode CLDR project (which sadly contains a LOT of broken links right now due to a data migration effort but I'll bookmark it & hopefully it'll be navigable in future).

As someone with a Polish partner, who also fluently speaks my own weird minority local language (Irish), I'm more than well aware of pluralisation pitfalls; Irish may have one of the most complex rulesets, so much so that I'm almost certain it isn't represented in CDLR (possibly can't be). But I see the plural pitfall brought up in so many of these guide - I've always been curious about other unexpected/unintuitive pitfalls across languages out there. Would love if there was a simple reference of the most interesting (starting with plurals I guess).


Remarkable in similar situation, Irish + Polish learnt because of wife and extended family. Found when it comes i18n with the present codebase, I have inherited, it needs a lot of work, but it was immensely helpful to have two backgrounds in particular tricky languages, relative to English.

In quickly scoping out what needs to be worked, and where the inherited setup clearly falls short. Don't think knowing a great deal about other languages is necessary though for the same effect, just enough to smell something might be a bit trickier, like know say the case system in X language shows up in different ways, verb + pronoun order is not predefined.


The CLDR plural rules for Irish are here https://www.unicode.org/cldr/charts/42/supplemental/language... Not quite as complex as the ones for Breton https://www.unicode.org/cldr/charts/42/supplemental/language... but of course they might be wrong.


Both of these only cover case changes for the object being counted: cases are pretty common across non-English languages so this is simple enough on its own.

The main difference I see in Breton is the rules for 1-9 follow through for double-digits n1-n9 - I would've suspected this to be true for Irish but I just speak it, don't study it, so confidence in my grammatical knowledge is low.

Irish definitely doesn't have the exceptions for 71-79 & 91-99; this smells like French influence.

The main thing missing from both of these though is modifications to the actual words for numerals: they kind of get away with it by sticking to digits, but I'm not sure if that's always the case in the translated output of i18n libraries referencing this. If they're ever outputting words in place of numerical digits then these are both very incomplete.

Irish then also has an entirely different way of counting persons which isn't included.


Unrelated but I had a weird experience navigating to this article on my IPhone: the music in my headset switched to call mode. I can reproduce that about 50% of the time.

Is there something using the microphone somewhere? Feels really weird…

14 Pro Max with latest beta software


I would give my little toe to see what this person's opinion is on CCS/CCMS (Component Content Systems)

https://en.wikipedia.org/wiki/Component_content_management_s...

There's quite a bit more to be written here about natural language, formal language, and how constituents of each class interact with each other. Stuff that the initial architects of "component content" were not necessarily thinking about, because they were coming at the problem from an extremely limited corpus.


Also, number and percent formats are important too. I've seen many 'professional' websites/software, that uses only a standart 100.0% format, where the decimal and the percent are not localised.


tangent:

did you know that "institutionalization" also resolves to the English numerical contraction: "i18n"?

here's a tool to test for conflicts in other words (a11y, k8s, ets):

https://encapsulate.me/writing/e25n.html


I hate these numerical contractions. I have no intuitive knowledge of how many characters words have even if I'm seeing them spelled out, let alone trying to do it in reverse.


l1l


i am on a 2 year long rabbit hole to solve many i18n problems that devs face https://github.com/inlang/inlang

we are in our third (major) refactor because the problem is so complex and new requirements emerge regularly :/


The post is interesting as it exposes the problem statement.

Unfortunately, I expected from a Shopify Engineering blog that it would provide solutions to this problem like a JS library for i18n.

Disclaimer: As I'm not a frontend developer I'm not familiar with the ecosystem solutions.


The article shows solutions and they're library agnostic, you can use the approaches shown in pretty much any modern intl library.


Is there some kind of Auto-i18n where the function sends a request to a server if there is no localization available? The server could in turn request a translation from a service and add it to the localization files


This does exist, there are a few translation-as-a-service companies that offer more or less what you want. It's not exactly as you are describing, but I have a pipeline setup for an app I built that will extract and machine translate strings at build time. It would be pretty trivial to just do the extraction part and send off the files to whatever endpoint you want (being a human or a machine translating in the end), if you wanted to "roll your own".


I believe the library mentioned in the article (i18next) provides the necessary hooks to do this. See ‘saveMissing’ option [1].

Though you need a backend to save this and manage translations. The maintainer of that library, locize, business model is to provide such a service.

[1] https://www.i18next.com/how-to/extracting-translations


There’s a really nice babel plugin as well that can do the extraction.


I've found that I need to support localization from the very start.

I never display a quoted string. I always use Apple's tokenization (or create my own, if doing server code).

Apple has terrific support for localization, which puts the onus on us, to honor it. I have some basic extensions that I use to support localization in my coding[0-2], but there's also just stuff I need to keep in mind, all the time.

There has been discussion of how to deal with things like word order in different languages. For example, in Germanic languages, the modifier usually precedes the subject, while in Romance languages, it tends to be the opposite.

Thankfully, Apple supports the "$" format for sprintf strings[3], so we can do stuff like this:

    import Foundation

    let localizationAssets = [
        (format: "The %1$@ %2$@", modifier: "white", subject: "horse"),
        (format: "Le %2$@ %1$@", modifier: "blanc", subject: "cheval")
    ]

    func localizedHorse(_ inLocalization: Int) -> String {
        String(
            format: localizationAssets[inLocalization].format,
            localizationAssets[inLocalization].modifier,
            localizationAssets[inLocalization].subject
        )
    }

    // English (Prints "The white horse")
    print(localizedHorse(0))
    // French (Prints "Le cheval blanc")
    print(localizedHorse(1))

[0] https://github.com/RiftValleySoftware/RVS_Generic_Swift_Tool...

[1] https://github.com/RiftValleySoftware/RVS_Generic_Swift_Tool...

[2] https://github.com/RiftValleySoftware/RVS_Generic_Swift_Tool...

[3] https://developer.apple.com/library/archive/documentation/Co...


The code example should rather be an example of what not to do. This method of concatenating different parts of sentences would fall short for any language where the adjective and noun change the form depending on each other and the context (ex. cases). Most Slavic languages make heavy use of this concept. The solution is to just have the full sentence as a single string and let the translation team write it out in its entirety. Avoid interpolating nouns, adjectives, verbs etc. as much as possible.


It was just a silly example.

Of course, you are correct, but we don't always have the luxury of being able to hand an entire string to a localization team, and some programmatic interpretation often needs to be done.

The best way to deal with it, is to design code that allows entire sentences to be handed to translators (I do that, most of the time).

The other advantage of writing localized content, is that it can be easily modified by non-engineers, like marketing folks. Talking points and corporate glossaries are important. Even if we don't plan to localize, it doesn't hurt to use localization tokenized content.


Oh, BTW. This is how not to do it:

    print("The white horse")
I would gently suggest taking the spirit of our Ts and Cs into account, when critiquing others.

I was simply sharing a quick technique that may not be known by some folks (the "$" in sprintf). It can definitely have uses.

I've been writing localized software for over thirty years. I worked for a Japanese corporation that localized into more than 20 languages (including RTL and up/down). Pretty much everything I write is localized up the yin-yang. Even my test harnesses are usually localized, as I like to keep the habit of writing localized software. I've written systems that have been adopted worldwide, in many different languages, and are still in use (and expanding), many, many years later.

There's a teensy little chance that I may have something valid to contribute to the topic.


> There's a teensy little chance that I may have something valid to contribute to the topic.

That may be true, but the criticism of GP still stands.

Yes, in some special situations your approach might be the only viable one (just as how in every codebase you sometimes have to cut corners), but GP is right in pointing out that this is a problematic approach, especially without qualification ("only use this if ...").


Yes, but don’t you think it might be little more diplomatic to say something like:

> While this is an interesting way to allow a translator to arrange the order of terms, there’s a great deal more involved, like … , so we should probably avoid using this technique, if at all possible. It’s always a much better idea to allow translators to work on an entire sentence or paragraph.

See? We get to show off our expertise, without throwing shade on others.

BTW: I have learned, the hard way, that cultural and human sensitivity, as well as basic respect and kindness, are important, when localizing.

But I guess those values are kinda “last century,” and don’t really have any value, in today’s hyper-competitive world.


I don't know, I personally value directness more than a "compliment sandwich". And I don't consider criticism of code to be criticism of one's person or expertise. It may be a cultural thing (I'm German).

But you'll have to take that up with the person you originally replied to.


Just remove the first sentence of that post and it's fine. It's even in the HN guidelines:

    When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."


Eh, you're the one making wild claims about how your uncle works for Nintendo and Sonic and Mario are totally gonna fight. Er... I mean how you have dozens of years of experience at internationalizationizing the cyber and everyone but you is $100% wrong.

You could make a lot more impact by explaining why we're wrong, but I guess that's a last-century concept.


Why should I, when you're making it clear that you're only interested in using it as an attack vector?

You have no interest at all in my experience, and that is clear, from your combative and insulting approach.

BTW: They are not "wild claims," as, literally, five minutes, browsing my extensive online presence, will show (try following one of the links I provided, just for a start). I have an open and aboveboard presence on the Internet, and don't hide behind anonymity. It helps me to stay honest and respectful.

During my career, I worked with hundreds of people that put my experience and technical prowess to shame.

There really are a lot of us around here, but we aren't interested in playing status games.

This is exactly what I mean.


Nobody cares who you are or claim to be. Anyone can pick any user name here. We care what you say, or, in this case, fail to.


If you have so much experience, maybe you can share why

    print("The white horse")
is such a horrible example of code that can't be internationalized.


Nah, I’m done here. Have a great day!


I don't see how this generalizes. If this is supposed to work for any "the {modifier} {subject}" phrase, it won't, because if you want to say "the white cow" you'll end up with "Le vache blanche," which is wrong.

You could store `(format: "%3$@ %1$@ %2$@", modifier: "blanc", subject: "cheval", article: "Le")` but now you've got so many gender associations to keep track of, and a phrase that will still only work as a subject (because in German, all of the words will change if the white horse is the object of the sentence).

EDIT: Oh you're just talking about sprintf specifically, I see what you mean. I agree with your other comment that it's ideal to pass entire sentences to the translators when possible, which is what I'm getting at here too.


  I name them by component.context.phrase

  There's https://cldr.unicode.org/index .

  In Angular I liked Transloco [0] very much.
  For Vue I use vue-i18n, I don't think there's any alternative.
  For Go I like go-i18n [1] when doing SSR Go.
  For Svelte.. not sure if there's a best package.

  [0] https://github.com/ngneat/transloco
  [1] https://github.com/nicksnyder/go-i18n




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: