Hacker News new | past | comments | ask | show | jobs | submit login

Both problems are missing the point: you cannot handle Unicode correctly without locale information (which needs to be carried alongside as metadata outside of the string itself).

To a Swede or a Finn, o and ö are different letters, as distinct as a and b (ö sorts at the very end at the alphabet). A search function that mixes them up would be very frustrating. On the other hand, to an American, a search function that doesn't find "coöperation" when you search for "cooperation" is also very frustrating. Back in Sweden, v and w are basically the same letter, especially when it comes to people's last names, and should probably be treated the same. Further south, if you try to lowercase an I and the text is in Turkish (or in certain other Turkic languages), you want a dotless i (ı), not a regular lowercase i. This is extremely spooky if you try to do case insensitive equality comparisons and aren't paying attention, because if you do it wrong and end up with a regular lowercase i, you've lost information and uppercasing again will not restore the original string.

There are tons and tons of problems like this in European languages. The root cause is exactly the same as the Han unification gripes: Unicode without locale information is not enough to handle natural languages in the way users expect.




> which needs to be carried alongside as metadata outside of the string itself

Why not as data tagged with the appropriate language?

https://www.unicode.org/faq/languagetagging.html


If you mean in-band language tagging inside the string itself, the page you're linking to points out that this is deprecated. The tag characters are now mostly used for emoji stuff. If you only need to be compatible with yourself you can of course do whatever you like, but otherwise, I agree with what the linked page says:

> Users who need to tag text with the language identity should be using standard markup mechanisms, such as those provided by HTML, XML, or other rich text mechanisms. In other contexts, such as databases or internet protocols, language should generally be indicated by appropriate data fields, rather than by embedded language tags or markup.


The interesting question is why you agree, the deprecation fact isn't telling much, the quote also doesn't explain anything, like, the "appropriate data fields" might not exist for mixed content, a rather common thing, and why resort to the full ugliness of XML just for this?

(and that emojis have had their positive impact in forcing apps into better Unicode support would be a + for the use of a tag)


Most applications do not do anything useful with in-band language tags. They never had widespread adoption in the first place and have been deprecated since 2008, so this is unsurprising. If you're using them in your strings and those strings might end up displayed by any code you don't control, you'll probably want to strip out the language tags to avoid any potential problems or unexpected behaviors. Out-of-band metadata doesn't have this problem.

As I said though, if you're in full control and only need to be compatible with yourself, you can do whatever you want.


in 2008 uft-8 was only ~20% of all web pages! Again, that deprecation fact is not meaningful, a quick search shows that rfc for tagging is dated 1999, so that's just 10 years before deprecation, that's a tiny timeframe for such things, so I agree, it's not surprising there was no widespread use.

Out-of-band metadata has plenty of other problems besides the fact that it doesn't exist in a lot of cases


> a search function that doesn't find "coöperation" when you search for "cooperation" is also very frustrating.

Look, we can just disregard The New Yorker entirely and the UX will improve.


Exactly! Thank you for giving a good explanation of why this whole post is founded on a fundamental misunderstanding.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: