The ü/ü Conundrum

re · 2024-03-24T19:25:37 1711308337

> Can you spot any difference between “blöb” and “blöb”?

It's tricky to try to determine this because normalization can end up getting applied unexpectedly (for instance, on Mac, Firefox appears to normalize copied text as NFC while Chrome does not), but by downloading the page with cURL and checking the raw bytes I can confirm that there is no difference between those two words :) Something in the author's editing or publishing pipeline is applying normalization and not giving her the end result that she was going for.

  00009000: 0a3c 7020 6964 3d22 3066 3939 223e 4361  .<p id="0f99">Ca
  00009010: 6e20 796f 7520 7370 6f74 2061 6e79 2064  n you spot any d
  00009020: 6966 6665 7265 6e63 6520 6265 7477 6565  ifference betwee
  00009030: 6e20 e280 9c62 6cc3 b662 e280 9d20 616e  n ...bl..b... an
  00009040: 6420 e280 9c62 6cc3 b662 e280 9d3f 3c2f  d ...bl..b...?</

Let's see if I can get HN to preserve the different forms:

Composed: ü Decomposed: ü

Edit: Looks like that worked!

mgaunard · 2024-03-24T20:14:17 1711311257

I believe XML and HTML both require Unicode data to be in NFC.

fanf2 · 2024-03-24T20:28:20 1711312100

I don’t think so?

https://www.w3.org/TR/2008/REC-xml-20081126/#charsets

XML 1.1 says documents should be normalized but they are still well-formed even if not normalized

https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-normaliza...

But you should not use XML 1.1

https://www.ibiblio.org/xml/books/effectivexml/chapters/03.h...

mbrubeck · 2024-03-24T20:39:51 1711312791

HTML does not require NFC (or any other specific normalization form):

https://www.w3.org/International/questions/qa-html-css-norma...

Neither does XML (though it XML 1.0 recommends that element names SHOULD be in NFC and XML 1.1 recommends that documents SHOULD be fully normalized):

https://www.w3.org/TR/2008/REC-xml-20081126/#sec-suggested-n...

https://www.w3.org/TR/xml11/#sec-normalization-checking

layer8 · 2024-03-24T20:36:53 1711312613

You believe incorrectly. Not even Canonical XML requires normalization: https://www.w3.org/TR/xml-c14n/#NoCharModelNorm

Eisenstein · 2024-03-24T19:40:39 1711309239

Perhaps the author used the same character twice for effect, not suspecting someone would use curl to examine the raw bytes?

mglz · 2024-03-24T21:21:45 1711315305

My last name contains an ü and it has been consistenly horrible.

* When I try to preemptively replace ü with ue many institutions and companies refuse to accept it because it does not match my passport

* Especially in France, clerks try to emulate ü with the diacritics used for the trema e, ë. This makes it virtually impossible to find me in a system again

* Sometimes I can enter my name as-is and there seems to be no problem, only for some other system to mangle it to � or or a box. This often triggers error downstream I have no way of fixing

* Sometimes, people print a u and add the diacritics by hand on the label. This is nice, but still somehow wrong.

I wonder what the solution is. Give up and ask people to consistenly use a ascii-only name? Allow everybody 1000+ unicode characters as a name and go off that string? Officially change my name?

makeitdouble · 2024-03-24T21:43:01 1711316581

The part I came to love about France in general is that while all of these are broken, the people dealing with it will completely agree it's broken and amply sympathize, but just accept your name is printed as G�nter.

Same for names that don't fit field lengths, addresses that require street numbers etc. It's a real pain to deal with all of it and each system will fail in its own way to make your life a mess, but people will embrace the mess and won't blink an eye when you bring paper that just don't match.

zokier · 2024-03-24T22:06:06 1711317966

Under GDPR people have the right to have their personal data to be accurate, there was a legal case exactly about this: https://news.ycombinator.com/item?id=38009963

makeitdouble · 2024-03-24T22:24:57 1711319097

That's a pretty unexpected twist, and I'm frilled with it.

I don't see every institution come up with a fix anytime soon, but having it clear that they're breaking the law is such a huge step. That will also have a huge impact on bank system development, and I wonder how they'll do it (extend the current system to have the customer facing bits rewritten, or just redo it all from top to bottom)

There is the tale of Mizuho bank [0], botching their system upgrade project so hard they were still seeing widespread failures after a decade into it.

[0] https://www.japantimes.co.jp/news/2022/02/11/business/mizuho...

ryandrake · 2024-03-24T23:06:53 1711321613

> I don't see every institution come up with a fix anytime soon, but having it clear that they're breaking the law is such a huge step.

It's excellent, but also sad that it takes legislation to motivate companies to fix their crappy legacy systems, and they will likely fight tooth and nail rather than comply.

rurban · 2024-03-25T06:30:27 1711348227

So it's time to finally ditch the POSIX string libc, and adopt u8 as universal string type. Which can finally find normalized.

All the coreutils still can not find strings, just buffers. Zero terminated buffers are NOT strings, strings are unicode.

https://perl11.github.io/blog/foldcase.html

This is not just convenience, it also has spoofing security implications for all names. C and C++11 are insecure since C11. https://github.com/rurban/libu8ident/blob/master/doc/c11.md Most other programming languages and OS kernels also.

actionfromafar · 2024-03-25T16:40:26 1711384826

Or ve kan fainali hav orxogrefkl riform!

jojobas · 2024-03-24T23:35:30 1711323330

> Does it mean Z̸̰̈́̓a̸͖̰͗́l̸̻͊g̸͙͂͝ǒ̷̬́̐ can finally have a bank account?

I wonder if this also means one can require a European bank have a name on file in Kanju, Thai script or some other not-so-well-known in Europe alphabet.

makeitdouble · 2024-03-25T01:38:53 1711330733

A bank can specially request it to be the name on a passport or domestic ID card. That's one way to make sure that the name falls within some parameters, though that can be tough on the customer in some conditions.

jojobas · 2024-03-25T02:15:31 1711332931

I guess every country has a technical document on what's allowed in names, but then say EU banks have to cater for full superset of EU rules.

As far as the passports go, ICAO 9303-3 allows for latin characters, additional latin characters, such as Þ and ß, and "diacritics", so something not too crazy, i.e. Z̷̪͘a̵͈͘l̷̹̃g̷̣̈́ő̶͍ would still be plausible.

pastage · 2024-03-25T14:54:46 1711378486

Since work on central ID in Europe moves slowly banks will only need to bother with local name rules atm since only local names are valid. I am guessing we will have normalization rules in the end and that looks completely unplausible.

badcppdev · 2024-03-25T16:21:58 1711383718

They might get the name to fit in that field but what are you going to do about date of birth??

Faaak · 2024-03-25T16:32:42 1711384362

Ahah, I can relate to that. My driving license doesn't spell my name correctly, and somehow nobody cares. I somehow like this "nah, who cares" attitude

zokier · 2024-03-24T22:24:48 1711319088

> * Especially in France, clerks try to emulate ü with the diacritics used for the trema e, ë. This makes it virtually impossible to find me in a system again

In Unicode umlaut and diaeresis are both represented by same codepoint, U+0308 COMBINING DIAERESIS.

https://en.wikipedia.org/wiki/Umlaut_(diacritic)

samatman · 2024-03-24T22:20:26 1711318826

The only solution is going to be a lot of patience, unfortunately.

Everyone should be storing strings as UTF-8, and any time strings are being compared they should undergo some form of normalization. Doesn't matter which, as long as it's consistent. There's no reason to store string data in any other format, and any comparison code which isn't normalizing is buggy.

But thanks to institutional inertia, it will be a very long time before everything works that way.

lmm · 2024-03-24T22:26:15 1711319175

> Everyone should be storing strings as UTF-8, and any time strings are being compared they should undergo some form of normalization. Doesn't matter which, as long as it's consistent. There's no reason to store string data in any other format, and any comparison code which isn't normalizing is buggy.

This will result in misprinting Japanese names (or misprinting Chinese names depending on the rest of your system).

earthboundkid · 2024-03-24T22:34:57 1711319697

Can we please talk about Unicode without the myth of Han Unification being bad somehow? The problem here is exactly the lack of unification in Roman alphabets!

lmm · 2024-03-24T23:06:25 1711321585

> Can we please talk about Unicode without the myth of Han Unification being bad somehow?

It's not a myth, as anyone living in Japan knows, and the "just use Unicode, all you need is Unicode" dogma is really harmful; a lot of "international" software has become significantly worse for Japanese users since it took hold.

> The problem here is exactly the lack of unification in Roman alphabets!

Problems caused by failing to unify characters that look the same do not mean it was a good idea to unify characters that look different!

jhanschoo · 2024-03-25T06:04:09 1711346649

> "just use Unicode, all you need is Unicode" dogma is really harmful; a lot of "international" software has become significantly worse for Japanese users since it took hold.

The alternative would be that the software used Shift_JIS with a Japanese font. If the software used a Japanese font for Japanese it wouldn't need metadata anyway.

There really isn't a problem with Han unification as long as you always switch to a font appropriate for your language; you don't need to configure metadata. If you don't you are always going to run into missing codepoint problems.

In cases where the system or user configures the font, properly using Unicode is still easier than configuring alternate encodings for multiple languages.

lmm · 2024-03-25T06:18:45 1711347525

> The alternative would be that the software used Shift_JIS with a Japanese font.

As far as I know all Shift_JIS fonts are Japanese; you would have to be wilfully perverse to make one that wasn't.

> If the software used a Japanese font for Japanese it wouldn't need metadata anyway.

If it just uses the system default font for that encoding, as almost all software does, then it will also behave correctly.

> There really isn't a problem with Han unification as long as you always switch to a font appropriate for your language

Right. But approximately no software does that, because if you don't do it then your software will work fine everywhere other than Japan, and even in Japan it will kind-of-sort-of work to the point that a non-native probably won't notice a problem.

> In cases where the system or user configures the font, properly using Unicode is still easier than configuring alternate encodings for multiple languages.

I'm not convinced it is. Configuring your software to use the right font on a Unicode system is, as far as I can see, at least as hard as configuring your software to use the right encoding on a non-Unicode system. It just fails less obviously when you don't, particularly outside Japan.

jhanschoo · 2024-03-27T04:04:57 1711512297

> Right. But approximately no software does that, because if you don't do it then your software will work fine everywhere other than Japan, and even in Japan it will kind-of-sort-of work to the point that a non-native probably won't notice a problem.

Most games that I know of that target CJK + English (and are either CJK-developed, or have a local publisher based in East Asia) do indeed switch fonts depending on language (and on TC vs. SC).

> I'm not convinced it is. Configuring your software to use the right font on a Unicode system is, as far as I can see, at least as hard as configuring your software to use the right encoding on a non-Unicode system. It just fails less obviously when you don't, particularly outside Japan.

I'm considering 3 scenarios:

1. You are configuring for the Japanese-speaking market. In which case, fix a font, or fonts.

2. You are localizing into multiple languages and care about localization quality. In which case, yes, you need to know that localization in Unicode is more than just replacing content strings, but this is comparable to dealing with multiple encodings.

3. You are localizing into multiple languages and do not care about localization quality, or Japanese is not a localization target. In which case Japanese (user input / replaced strings) in your app / website will appear childish and shoddy, but it is still a better experience than mojibake.

In any case, it seems to me that it is not a worse experience than pre-Unicode. It's just that people who have no experience in localization expect Unicode systems to do things it cannot do by just replacing strings. You indeed frequently run into issues even in European languages if you just think it's a matter of replacing strings.

GoblinSlayer · 2024-03-25T13:29:57 1711373397

Japanese programs aren't globalized and already rely on the system being fine tuned for Japanese, so default font is already correct.

lmm · 2024-03-25T22:37:19 1711406239

> Japanese programs aren't globalized and already rely on the system being fine tuned for Japanese

Right, because unicode-based systems don't work well in Japan. E.g. a unicode-based application framework that ships its own font and expects to use it will display ok everywhere that's not Japan. So Japan is increasingly cut off from the paradigms that the rest of the world is using.

GoblinSlayer · 2024-03-26T08:26:02 1711441562

Custom fonts are often a mistake for any language, especially google fonts often look wrong. Due to this browsers often have an option to force usage of system fonts and set minimum size to improve readability.

lmm · 2024-03-26T21:55:01 1711490101

> Custom fonts are often a mistake for any language, especially google fonts often look wrong.

Be that as it may, the overwhelming majority of unicode fonts are dramatically wrong for Japanese and not dramatically wrong for other languages.

> Due to this browsers often have an option to force usage of system fonts and set minimum size to improve readability.

Such options are shrinking IME. E.g. Electron is built on browser internals, but does it offer that option?

eviks · 2024-03-25T03:25:55 1711337155

Would it still be harmful if language tag were used?

lmm · 2024-03-25T03:37:59 1711337879

If the tag mechanism was used consistently and handled by all software, no. But in practice the only way that would happen is if the tag mechanism was required for many languages. Unicode is, in practice, a system that works the same way for ~every human language except Japanese, which makes it much worse than the previous "byte stream + encoding" system where any program written to support anything more than just US English would naturally work correctly for every other language, including Japanese.

renhanxue · 2024-03-25T12:51:47 1711371107

> Unicode is, in practice, a system that works the same way for ~every human language except Japanese

This is simply not true. As I've pointed out in a sibling comment, Unicode has a lot of surprising and frustrating behaviors with many European languages as well if you use it without locale data. The characters will look right, but e.g. searching, sorting and case-insensitive comparisons will not work as expected if the application is not locale aware.

lmm · 2024-03-25T22:40:15 1711406415

> The characters will look right, but e.g. searching, sorting and case-insensitive comparisons will not work as expected

This is quite a different situation from Japan. A lot of applications don't do searching, sorting, or case-insensitive comparisons, but virtually every application displays text.

earthboundkid · 2024-03-29T21:41:23 1711748483

> It's not a myth, as anyone living in Japan knows

I lived in Japan. It is a myth. :-¥

renhanxue · 2024-03-24T23:18:01 1711322281

Both problems are missing the point: you cannot handle Unicode correctly without locale information (which needs to be carried alongside as metadata outside of the string itself).

To a Swede or a Finn, o and ö are different letters, as distinct as a and b (ö sorts at the very end at the alphabet). A search function that mixes them up would be very frustrating. On the other hand, to an American, a search function that doesn't find "coöperation" when you search for "cooperation" is also very frustrating. Back in Sweden, v and w are basically the same letter, especially when it comes to people's last names, and should probably be treated the same. Further south, if you try to lowercase an I and the text is in Turkish (or in certain other Turkic languages), you want a dotless i (ı), not a regular lowercase i. This is extremely spooky if you try to do case insensitive equality comparisons and aren't paying attention, because if you do it wrong and end up with a regular lowercase i, you've lost information and uppercasing again will not restore the original string.

There are tons and tons of problems like this in European languages. The root cause is exactly the same as the Han unification gripes: Unicode without locale information is not enough to handle natural languages in the way users expect.

eviks · 2024-03-25T03:23:56 1711337036

> which needs to be carried alongside as metadata outside of the string itself

Why not as data tagged with the appropriate language?

https://www.unicode.org/faq/languagetagging.html

renhanxue · 2024-03-25T12:45:44 1711370744

If you mean in-band language tagging inside the string itself, the page you're linking to points out that this is deprecated. The tag characters are now mostly used for emoji stuff. If you only need to be compatible with yourself you can of course do whatever you like, but otherwise, I agree with what the linked page says:

> Users who need to tag text with the language identity should be using standard markup mechanisms, such as those provided by HTML, XML, or other rich text mechanisms. In other contexts, such as databases or internet protocols, language should generally be indicated by appropriate data fields, rather than by embedded language tags or markup.

eviks · 2024-03-25T14:39:46 1711377586

The interesting question is why you agree, the deprecation fact isn't telling much, the quote also doesn't explain anything, like, the "appropriate data fields" might not exist for mixed content, a rather common thing, and why resort to the full ugliness of XML just for this?

(and that emojis have had their positive impact in forcing apps into better Unicode support would be a + for the use of a tag)

renhanxue · 2024-03-25T16:50:10 1711385410

Most applications do not do anything useful with in-band language tags. They never had widespread adoption in the first place and have been deprecated since 2008, so this is unsurprising. If you're using them in your strings and those strings might end up displayed by any code you don't control, you'll probably want to strip out the language tags to avoid any potential problems or unexpected behaviors. Out-of-band metadata doesn't have this problem.

As I said though, if you're in full control and only need to be compatible with yourself, you can do whatever you want.

eviks · 2024-03-25T17:28:22 1711387702

in 2008 uft-8 was only ~20% of all web pages! Again, that deprecation fact is not meaningful, a quick search shows that rfc for tagging is dated 1999, so that's just 10 years before deprecation, that's a tiny timeframe for such things, so I agree, it's not surprising there was no widespread use.

Out-of-band metadata has plenty of other problems besides the fact that it doesn't exist in a lot of cases

hyperdimension · 2024-03-26T00:15:37 1711412137

> a search function that doesn't find "coöperation" when you search for "cooperation" is also very frustrating.

Look, we can just disregard The New Yorker entirely and the UX will improve.

earthboundkid · 2024-03-29T21:43:33 1711748613

Exactly! Thank you for giving a good explanation of why this whole post is founded on a fundamental misunderstanding.

RedNifre · 2024-03-24T22:37:10 1711319830

lmm · 2024-03-24T23:03:41 1711321421

Unicode reuses codepoints for characters that the committee decided were in some sense "the same", including Japanese and Chinese characters that are written differently from each other (different numbers of strokes etc.). This is a minor irritation for everyday text, but can be quite upsetting when it's someone's name that's getting printed wrong.

bouke · 2024-03-26T17:54:31 1711475671

No system will get support for unicode by just the passing of time. Software needs to be upgraded/replaced for that to happen. Reluctant institutions will not just do that, and need external pressure.

zokier · 2024-03-24T22:01:52 1711317712

Germans have of course a standard for this

> a normative subset of Unicode Latin characters, sequences of base characters and diacritic signs, and special characters for use in names of persons, legal entities, products, addresses etc

https://en.wikipedia.org/wiki/DIN_91379

em-bee · 2024-03-24T22:36:30 1711319790

and it's used in the passport too. so names with umlaut show up in both forms and it is possible to match either form

Chilko · 2024-03-24T23:28:34 1711322914

> Officially change my name?

My German last name also contains an ü, so when we emigrated to an English-speaking country and obtained dual-citizenship we used 'ue' for that passport and I now use 'ue' on a day-to-day basis. This also means I have two slightly different legal surnames depending by which passport I go.

hodgesrm · 2024-03-24T23:49:17 1711324157

At least German transliteration is 1-to-1. Slavic names among others often have multiple transliterations available. The Russian name Валерий can be rendered for example as Valery, Valeriy, or Valeri. It's very confusing for documents that require the person's name.

[0] https://en.wikipedia.org/wiki/Wikipedia:Romanization_of_Russ...

krab · 2024-03-25T05:27:06 1711344426

That's the English transliteration. Don't forget that other Slavic languages also transcribe according to their own rules.

For example in Czech, Валерий would be transliterated as Valerij because "j" is pronounced in Czech as English "y" in "you".

bobthepanda · 2024-03-25T02:48:01 1711334881

Also don't forget Chinese, which due to different romanizations or different dialects being used for the romanization, can result in different outputs depending on whether a person is from PRC, ROC, Macao, Hong Kong, or Singapore.

koliber · 2024-03-26T10:30:27 1711449027

Transliteration is a two way street. Non-Russian names get transliterated into Cyrillic inconsistently as well.

illiac786 · 2024-03-25T06:35:37 1711348537

There's an ISO standard for this. Can't find it but I am 100% sure for russian for example.

fsckboy · 2024-03-25T02:52:06 1711335126

just out of curiosity, can you port the ue back to Germany (or wherever) or will they automatically transform it to ü? (could you change your name in a German speaking country to Mueller et al?)

wongarsu · 2024-03-25T14:30:20 1711377020

In Germany, there are some names that use ue, ae or oe instead of ü, ä, ö, and you run into issues with some systems wrongly autocorrecting it to the umlaut. Usually not a big deal, but having the umlaut is less error prone than the transliteration in Germany.

Scharkenberg · 2024-03-25T15:55:56 1711382156

The most famous German poet is (probs) Goethe. Still written with oe to this day.

sib · 2024-03-25T16:12:15 1711383135

There are old houses in the US that have bronze placards on them that say, "George Washington slept here."

Goethe is so famous that in Heidelberg, Germany, there is a building with a placard that says, "Goethe almost slept here."

It was an inn and he was supposed to spend the night but was unable to.

lmm · 2024-03-24T22:32:12 1711319532

> Give up and ask people to consistenly use a ascii-only name?

> Officially change my name?

Yes. That's the only one that's going to actually work. You can go on about how these systems ought to work until until the cows come home, and I'm sure plenty of people on HN will, but if you actually want to get on with your life and avoid problems, legally change your name to one that's short and ascii-only.

em-bee · 2024-03-25T10:15:30 1711361730

a friend of mine in china had a character in his name that was not in the recognized set of characters. he refused to change his name and instead submitted the character to be added to unicode (which i believe eventually happened)

in the meantime he was unable to own the company he founded (instead made his wife the owner), had a national ID card with a different character, and i am not sure if he had a bank account, but i think the bank didn't care because laws that enforced the names to match the passport/ID only came later. i don't know how the ID didn't automatically imply a name change, but the IDs were issued automatically and maybe he filed a complaint about his name being wrong.

mavhc · 2024-03-25T16:19:46 1711383586

https://en.wikipedia.org/wiki/Spell_My_Name_with_an_S

aden1ne · 2024-03-25T15:45:50 1711381550

Names changes are only permitted in a very narrow set of conditions in my place of residence. And this would not be one of them. And I imagine that's the case in many nations.

taejo · 2024-03-25T08:29:45 1711355385

And then never move to Japan (or any other country where names are expected not to have Latin letters in them)

lmm · 2024-03-25T11:15:00 1711365300

Or rather, if you move countries, change your name to one that fits. It's pretty normal and really not that hard.

taejo · 2024-03-25T15:16:26 1711379786

Interestingly, it seems that Japan does have a procedure for foreigners to officially adopt a Japanese name. Changing your name is often very hard, and doing it in a country where you're not a citizen might be completely impossible, depending on the country.

lmm · 2024-03-25T22:35:30 1711406130

> Japan does have a procedure for foreigners to officially adopt a Japanese name.

Sort of but not really. The post-2012 residence cards do not display a registered alias anywhere, and since those cards are what banks are required to KYC you on, a lot of banks won't allow you to use a registered alias which in turn means it's hard to use it for anything else (credit cards, phone, pension...). It's very non-joined-up government.

enderstenders · 2024-03-25T15:38:50 1711381130

We clearly need to phase out name-based identification within software. "What's your name?" should never be a question heard from workers as any means of locating one's official identity in any system.

Some form of biometrics to pull up an ID in a globally agreed-upon system is certainly the way forward. Whether or not it is close to what a final solution should be, World ID is making some effort into solving global identification problems https://worldcoin.org/world-id

lupire · 2024-03-25T16:05:54 1711382754

"global identification" and "final solution" dot sit well together.

GardenLetter27 · 2024-03-25T16:24:57 1711383897

Or just standardise the alphabet...

_v7gu · 2024-03-24T22:27:22 1711319242

Can ü be printed on a passport rather than a u? I have a ş and a ç so I have been successfully substituting s and c for them in a somewhat consistent manner.

mschuster91 · 2024-03-24T23:33:42 1711323222

On the human-readable zone ("VIZ" in ICAO 9303) yes, see part 3 section 3.1 [1]. The MRZ however, not - it is limited to Latin alphanumeric only, see section 4.3. How to transliterate non-Latin characters is left to the discretion of the issuing government, and that has been a consistent source of annoyances for people who have identity cards issued by different governments (e.g. dual-nationals of Western European and Turkish, Arabic or Cyrillic-using Slavic countries).

[1] https://www.icao.int/publications/Documents/9303_p3_cons_en....

Tabular-Iceberg · 2024-03-25T16:37:41 1711384661

What’s the difference between the ë and ü diacritics? I would assume, like the French, that the two are interchangeable.

littlestymaar · 2024-03-25T16:56:28 1711385788

See this post [1] somewhere else in the comments.

[1]: https://news.ycombinator.com/item?id=39818435

gsich · 2024-03-25T17:14:05 1711386845

Passports have an entry like "corresponds to ..." for that.

benhurmarcel · 2024-03-25T09:11:20 1711357880

When my child was born, one of the requirements I had to choose his name was that it shouldn't have any accent (or character that's not in the 26 universal letters basically).

em-bee · 2024-03-25T10:02:47 1711360967

who made this requirement? in which country?

d1sxeyes · 2024-03-25T10:26:33 1711362393

Based on OP's comment history, he's Belgian or lives in Belgium. Seems that there's no such requirement in Belgium (https://be.brussels/en/identity-nationality/children/birth-f...) and in many countries I know that ü is explicitly allowed.

Potentially OP is talking about a set of requirements he imposed on himself?

Edit: or maybe France? Either way, it's free choice still theoretically. https://en.wikipedia.org/wiki/Naming_law#:~:text=Since%20199....

benhurmarcel · 2024-03-25T17:13:46 1711386826

Sorry for the confusion, it’s just a requirement I had for myself, to make my child’s life a little easier

ale42 · 2024-03-25T14:45:56 1711377956

Isn't it the OP him/herself? Maybe they just wanted to prevent the issue for their child...

em-bee · 2024-03-25T15:17:45 1711379865

ah, possibly. the way it is worded i didn't read it that way. but i get it.

we did something comparable to make sure our kids had names that transliterated nicely into chinese so that they could use the same or at least a similar name in english and chinese, instead of having two names like it is common for many expats and locals in china.

userbinator · 2024-03-24T21:41:16 1711316476

Everyone's name should just be a GUID. /s

BuyMyBitcoins · 2024-03-24T22:28:17 1711319297

Falsehoods Programmers Believe About Names, #41 - People have GUIDs.

https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...

weinzierl · 2024-03-25T16:40:14 1711384814

This article is about a failure to do normalization properly and is not really about an issue with Unicode. Regardless what some comments seem to allude to, an Umlaut-ü should always render exactly the same, no matter how it is encoded.

There is, however, a real ü/ü conundrum, regarding ü-Umlaut and ü-diaeresis. The ü's in the words Müll and aigüe should render differently. The dots in the French word are too close to the letter. In printed French material this is usually not the case.

Unfortunately Unicode does not capture the nuance of the semantic difference between an Umlaut and a Tréma or Diaresis.

The Umlaut is a letter in its own right with its own space in the alphabet. An ü-Umlaut can never be replaced by an u alone. This would be just as wrong as replacing a p by a q. Just because they look similar does not mean they are interchangeable. [1]

The Tréma on the other hand, is a modifier that helps with proper pronunciation of letter combinations. It is not a letter in its own right, just additional information. It can sometimes move over other adjacent letters (aiguë=aigüe, both are possible) too.

Some say this should be handled by the rendering system similar to Han-Unification, but I strongly disagree with this. French words are often used in German and vice versa. Currently there is no way to render a German loan word with Umlaut (e.g. führer) properly in French.

[1] The only acceptable replacement for ü-Umlaut is the combination ue.

noodlesUK · 2024-03-24T19:48:45 1711309725

One thing that is very unintuitive with normalization is that MacOS is much more aggressive with normalizing Unicode than Windows or Linux distros. Even if you copy and paste non-normalized text into a text box in safari on Mac, it will be normalized before it gets posted to the server. This leads to strange issues with string matching.

ttepasse · 2024-03-24T20:14:23 1711311263

Unfun normalisation fact: You can’t have a file named "ss" and a file named "ß" in the same folder in Mac OS.

nradov · 2024-03-25T01:06:51 1711328811

There are people with the surname "Con" and it's impossible to create a file with that name in MS Windows.

https://learn.microsoft.com/en-us/windows/win32/fileio/namin...

bawolff · 2024-03-24T22:16:49 1711318609

That's less a normal form issue and more a case-insensitivity issue. You also can't have a file named "a" and one named "A" in the same folder.

samatman · 2024-03-24T22:40:29 1711320029

That would be true if the test strings were "SS" and "ß", because although "ẞ" is a valid capitalization of "ß", it's officially a newcomer. It's more of a hybrid issue: it appears that APFS uses uppercasing for case-insensitive comparison, and also uppercases "ß" to "SS", not "ẞ". This is the default casing, Unicode also defines a "tailored casing" which doesn't have this property.

So it isn't per se normalization, but it's not not normalization either. In any case (heh) it's a weird thing that probably shouldn't happen. Worth noting that APFS doesn't normalize file names, but normalization happens higher up in the toolchain, this has made some things better and others worse.

FinnKuhn · 2024-03-25T15:32:20 1711380740

That would only explain why "ß" and "ẞ" can't both be files in the same folder. "ß" and "ss" are different letter just like "u" and "ue" for example.

eropple · 2024-03-24T22:15:58 1711318558

This shows up in other places, too. One of my Slacks has a textji of `groß`, because I enjoy making our German speakers' teeth grind, but you sure can just type `:gross:` to get it.

thaumasiotes · 2024-03-25T02:53:49 1711335229

> a textji

This is a weird formation; "ji" means text. It's half of the half of "emoji" that means text: 絵文字, 絵 [e, "picture"] 文字 [moji, "character", from 文 "text" + 字 "character"].

https://satwcomic.com/half-human-half-scandinavian

adrianmonk · 2024-03-25T16:36:46 1711384606

It's weird, but it's also how language evolves sometimes. Once used in a certain way, words or parts of words take on that meaning.

For example, there's an apartment and office building complex on a site near a historic canal and dam. The building development was named after this site. Then in one of the apartments (CORRECTION: offices), a scandalous political event happened. The complex was called Watergate, the scandal was called Watergate too, and now the suffix -gate is used for scandals.

dragonwriter · 2024-03-25T16:43:39 1711385019

> Then in one of the apartments, a scandalous political event happened.

It was one of the offices, not one of the apartments (specifically, it was series of break-ins to and the wiretapping of the headquarters of the Democratic National Committee by people working for President Nixon’s re-election committee.)

adrianmonk · 2024-03-25T16:50:23 1711385423

Oops! I double-checked that detail, made a mental note to say it was an office, and then typed apartment anyway.

eropple · 2024-03-25T03:19:56 1711336796

Yes, but "reactji" is also weird and yet people use it for Slack reactions. It's fine.

yxhuvud · 2024-03-24T21:05:41 1711314341

So what happens if someone puts those two in a git repo and a Mac user checks out the folder?

staplung · 2024-03-24T21:25:41 1711315541

  git clone https://github.com/ghurley/encodingtest
  Cloning into 'encodingtest'...
  remote: Enumerating objects: 9, done.
  remote: Counting objects: 100% (9/9), done.
  remote: Compressing objects: 100% (5/5), done.
  remote: Total 9 (delta 1), reused 0 (delta 0), pack-reused 0
  Receiving objects: 100% (9/9), done.
  Resolving deltas: 100% (1/1), done.
  warning: the following paths have collided (e.g. case-sensitive paths
  on a case-insensitive filesystem) and only one from the same
  colliding group is in the working tree:

  'ss'
  'ß'

Twisol · 2024-03-24T21:41:52 1711316512

I have this issue on occasion with older mixed C/C++ codebases that use `.c` for C files and `.C` for C++ files. Maddening.

Athas · 2024-03-24T22:14:55 1711318495

I never understood the popularity of the '.C' extension for C++ files. I have my own preference (.cpp), but it's essentially arbitrary compared to most other common alternatives (.cxx, .c++). The '.C' extension is the only one that just seems worse (this case sensitivity issue, and just general confusion given how similar '.c' looks to '.C').

But even more than that, I just don't get how C++ turns into 'C' at all. It seems actively misleading.

keketi · 2024-03-24T23:19:28 1711322368

C++

is Incremented C

which is Big C

which is Capital C

Athas · 2024-03-24T23:26:15 1711322775

But C is already capital C! Even .d would have been a better extension.

imtringued · 2024-03-25T07:43:21 1711352601

He is clearly taking about the capital version of capital C.

basil-rash · 2024-03-25T00:17:34 1711325854

You can always reformat as APFS (Case Sensitive)

tzs · 2024-03-25T16:35:09 1711384509

I remember seeing quite a few things in the old days that would have both 'makefile' and 'Makefile'.

tetromino_ · 2024-03-24T21:30:30 1711315830

EEXIST

codesnik · 2024-03-24T20:01:23 1711310483

I was really surprised when realized that at least in hpfs cyrillics is normalized too. For example, no russian ever thinks that Й is a И with some diacritics. It's a different letter on it's own right. But mac normalizes it into two codepoints.

asveikau · 2024-03-24T20:34:38 1711312478

I dislike explaining string compares to monolingual English speakers who are programmers. Similar to this phenomenon of Й/И is people who think ñ and n should compare equally, or ç and c, or that the lowercase of I is always i (or that case conversion is locale-independent).

In something like a code review, people will think you're insane for pointing out that this type of assumption might not hold. Actually, come to think of it, explaining localization bugs at all is a tough task in general.

yxhuvud · 2024-03-24T21:12:30 1711314750

Or that sort order is locale independent. Swedish is a good example here as åäö are sorted at the end, and where until 2006 w was sorted as v. And then it changed and w is now considered a letter of its own.

iforgotpassword · 2024-03-24T21:10:35 1711314635

Well, I do like this behavior for search though. I don't want to install a new keyboard layout just to be able to search for a Spanish word.

NeoTar · 2024-03-24T21:26:28 1711315588

My brother recently asked for help in determining who a footballer (soccer player) was from a photo. Like in many sports, the jerseys have the players name on the rear, and this player’s was in Cyrillic - Шунин (Anton Shunin) - and my brother had tried searching for Wyhnh without success.

Anyway, my point is that perhaps ideally (and maybe search engines do this) the results should be determined by the locale of the searcher. So someone in the English speaking world can find Łódź by searching for Lodz, but a Pole may need to type Łódź. My brother could find Shunin by typing Wyhnh, but a Russian could not…

nradov · 2024-03-25T01:10:32 1711329032

Essentially you are asking for search engines to recognize "Volapuk" encoding.

https://en.wikipedia.org/wiki/Informal_romanizations_of_Cyri...

david-gpu · 2024-03-24T21:25:29 1711315529

Is the convenience of a few foreigners searching for something more important than the convenience of the many native speakers searching for the same?

Maybe we should start modifying the search behavior of English words to make them more convenient for non-native speakers as well. We could start by making "bed aidia" match "bad idea", since both sound similar to my foreign ears.

MrJohz · 2024-03-24T21:43:18 1711316598

In fairness, for search, allowing multiple ways of typing the same thing is probably the best choice: you can prioritise true matches, where the user has typed the correct form of the letter, but also allow for more visual based matches. (Correcting common typos is also very convenient even for native speakers of a language — and of course a phonetic search that actually produced good results would be wonderful, albeit I suspect practically very difficult given just how many ways of writing a given pronunciation there might be!)

david-gpu · 2024-03-25T11:07:40 1711364860

As a counterexample, conflating two different glyphs as if they were the same can lead to the inability to search for a particular term. E.g. in Spanish these two words (cono, coño) have very different meanings. If I'm searching for one I don't want to see results pertaining to the other one. It would be like searching for "sheet" and getting results for "shit".

MrJohz · 2024-03-25T13:35:45 1711373745

It depends on how the search is implemented exactly and what the context is, but assuming I've searched for "cono", I would expect results that directly match "cono" to come first, then results that also match "coño".

Similarly to how I'd expect to still get reasonable results if I type "beleive" instead of "believe".

That said, this is obviously pretty context-dependent, in some settings it will make more sense to do an exact-match search, in which case you'd want to differentiate n and ñ (while still handling different possible unicode variants of ñ if those exist).

makeitdouble · 2024-03-24T21:25:34 1711315534

Search probably needs both modes. A literal and a fuzzy one.

dmckeon · 2024-03-24T23:19:32 1711322372

For similar sounding names, this fuzzy match is pretty effective. https://www.archives.gov/research/census/soundex

nradov · 2024-03-25T01:25:59 1711329959

In terms of phonetic matching algorithms, Soundex is considered badly outdated. Most MDM products use more advanced alternatives.

koliber · 2024-03-26T10:40:46 1711449646

These are different letters for people who speak the language and treating them the same in some usage seems weird.

At the same time, sometimes words containing those letters might show up in context where the user is not familiar with that language. Such users might not know how to enter those letters. They might not even have the capability to type those letters with their installed keyboard layouts. If they are searching for content that contains such letters (e.g. a first name), normalizing them to the visually-closest ASCII is a sensible choice, even if it makes no sense to the speakers of the language.

It's important to understand a situation from different perspectives.

It's not about coming up with a single correct interpretation that makes logical sense. It about making a system work in least-surprising ways to all classes of users.

makeitdouble · 2024-03-24T21:24:22 1711315462

The general reaction I've see until now was "meh, we have to make compromises (don't make me rewrite this for people I'll probably never meet)"

Diacritics exacerbate this so much as they can be shared between two language yet have different rules/handling. French typically has a decent amount and they're meaningful but traditionally ignores them for comparison (in the dictionary for instance). That makes it more difficult for a dev to have an intuitive feeling of where it matters and where it doesn't.

bawolff · 2024-03-24T22:20:18 1711318818

Normalization isn't based on what language the text is.

NFC just means never use combining characters if possible, and NFD means always use combining characters if possible. It has nothing to do with whether something is a "real" letter in a specific language or not.

The whether or not something is a "real" letter vs a letter with a modifier, more comes into play in the unicode collation algorithm, which is a separate thing.

anamexis · 2024-03-24T20:33:02 1711312382

Well, there's no expectation in unicode that something viewed as a letter in its own right should use a single codepoint.

sorenjan · 2024-03-24T20:53:14 1711313594

I sometimes see texts where ä is rendered as a¨, i.e. with the dots next to the a instead of above it even though it's a completely different letter and not a version of a. I managed to track the issue down to MacOS' normalization, but it has happened on big national newspapers' websites and similar. I haven't seen it in a while, maybe Firefox on Windows renders it better or maybe various publishing tools have fixed it. It looks really unprofessional which is a bit strange since I thought Apple prides themselves on their typography.

aidos · 2024-03-24T21:53:54 1711317234

I have never see that on all my years on a Mac (though admittedly I’m not dealing in languages where I encounter it often). I’m assuming there’s an issue with the gpos table in the font you’re using so the dots aren’t negative shifted into position as they should be?

sorenjan · 2024-03-25T13:34:55 1711373695

Well the point is that ä is one character, not two. It shouldn't be "a with two dots on it", it should be ä. It's its own letter with its own key on Swedish keyboards. MacOS apparently normalizes it to be two characters, and then somewhere in the publishing chain it gets mangled and end up as a¨. I have no doubt that it looked ok on the author's Mac.

It's been a while since I last saw it, but it wasn't because of the font since it was published on a Swedish newspaper's website and other texts worked fine.

aidos · 2024-03-25T19:56:44 1711396604

A single Unicode codepoint could be represented in a couple of different ways (either decomposed into 2 or as 1). Assume it’s the single codepoint representation.

The font you’re using can (and probably will) rewrite it as 2 glyphs using the GSUB table. This makes sense because it’s a more efficient way to store the drawing operations. The GPOS table is then responsible for handling the offset to put things in their right place.

Main point is that it’s up to the font to move things about.

Now, that may not be what was going on in your case at all but it’s possible.

iforgotpassword · 2024-03-24T21:06:33 1711314393

I have that in gnome terminal. The dots always end up on the letter after, not before. At least makes it easy to spot filenames in decomposed form so I can fix them.

ogurechny · 2024-03-25T17:38:49 1711388329

Some old system fonts or old character rasterization engines had problems with certain diacritics, like breve, and they were moved to the space between or after characters. Some Wikipedia articles simply mention that

> Characters may not combine well on some computers.

It was easy to detect people typing or editing text on Apple devices because “their” characters appeared broken, unlike usual single codepoints.

dathinab · 2024-03-24T23:05:04 1711321504

While this (probably) still applies to Apple UI elements when they switched to APFS they stopped doing Unicode normalization on filesystem level.

So now on macOS you can have a very mixed bag with some programs normalizing, some not (it's a bug) and many expecting normalized file names.

So it's kinda like other Linux now except a lot of dev assuming normalization is happening (and in some cases still is when the string passes through certain APIs).

Worse due to normalization now being somewhat application/framework dependent and often going beyond basic Unicode normalization it can lead to quite not so funny bugs.

But luckily most users will never run into any of this bugs even if the use characters which might need normalization.

yxhuvud · 2024-03-24T21:03:54 1711314234

On the other hand, stuff written on macs are a lot more likely to require normalization in the first place.

creshal · 2024-03-24T20:02:43 1711310563

MacOS creates so many normalization problems in mixed environments that it's not even funny any more. No common server-side CMS etc. can deal with it, so the more Macs you add to an organization, the more problems you get with inconsistent normalization in your content. (And indeed, CMSes shouldn't have to second-guess users' intentions - diacretics and umlauts are pronounced differently and I should be able to encode that difference, e.g. to better cue TTS.)

And, of course, the Apple fanboys will just shrug and suggest you also convert the rest of the organization to Apple devices, after all, if Apple made a choice, it can't be wrong.

fauigerzigerk · 2024-03-24T20:22:04 1711311724

I'm not sure I understand. On the one hand you seem to be saying that users should be able to choose which normalisation form to use (not sure why). On the other hand you're unhappy about macOS sending NFD.

If it's a user choice then CMSs have to be able to deal with all normalisation forms anyway and shouldn't care one bit whether macOS sends NFD or NFC. Mac users could of course complain about their choice not being honoured by macOS but that's of no concern to CMSs.

creshal · 2024-03-24T22:06:56 1711318016

> On the other hand you're unhappy about macOS sending NFD.

Because MacOS always uses it, regardless of the user's intention, so it decomposes umlauts into diaereses (despite them having different meanings and pronunciations) and mangles cyrillic, and probably more problems I haven't yet run into.

kps · 2024-03-24T22:51:18 1711320678

Unicode doesn't have ‘umlauts’, and (with a few unfortunate exceptions) doesn't care about meanings and pronunciations. From the Unicode perspective, what you're talking about is the difference between Unicode Normalization Form C:

    U+00FC LATIN SMALL LETTER U WITH DIAERESIS

and Unicode Normalization Form D:

    U+0075 LATIN SMALL LETTER U
    U+0308 COMBINING DIAERESIS

Unicode calls these two forms ‘canonically equivalent’.

cozzyd · 2024-03-25T01:56:35 1711331795

For maximum pain, they should start populating folders with .DS_STÖRE

eviks · 2024-03-25T03:37:29 1711337849

But store decomposed form on Tuesdays!

zh3 · 2024-03-24T20:24:31 1711311871

Suspect you're getting downvoted because of the last sentence. However, I do sympathise with MacOS tending to mangle standard (even plain ASCII) text in a way that adds to the workload for users of other OS's.

creshal · 2024-03-24T22:07:49 1711318069

It adds to the workload of everyone, including the Apple users. The latter ones are just in denial about it.

jesprenj · 2024-03-24T19:40:28 1711309228

Should you really change filenames of users' files and depend on the fact that they are valid utf8? Wouldn't it be better to keep the original filename and use that most of the time sans the searches and indexing?

Why don't you normalize latin alphabets filenames for indexing even further -- allow searching for "Führer" with queries like "Fuehrer" and "Fuhrer"?

zeroCalories · 2024-03-24T20:02:36 1711310556

I generally agree that you shouldn't change the file name, but in reality I bet OP stored it as another column in a database.

For more aggressive normalization like that, I think it makes more sense to implement something like a spell checker that suggests similar files.

josephcsible · 2024-03-24T21:09:51 1711314591

IMO, it was a mistake for Unicode to provide multiple ways to represent 100% identical-looking characters. After all, ASCII doesn't have separate "c"s for "hard c" and "soft c".

renhanxue · 2024-03-24T23:06:50 1711321610

The problem in the linked article barely scratches the surface of the issue. You _cannot_ compare Unicode strings for equality (or sort them) without locale information. A simple example: to a Swedish or Finnish speaker, o and ö completely different letters, as distinct as a is from b, and ö sorts at the very end at the alphabet. A user that searches for ö will definitely not expect words with o to appear. However, to an American, a user that searches for "cooperation" when your text data happens to include writings by people who write like in The New Yorker, would probably to expect to find "coöperation".

This rabbit hole goes very, very deep. In Dutch, the digraph IJ is a single letter. In Swedish, V and W are considered the same letter for most purposes (watch out, people who are using the MySQL default utf8_swedish_ci collation). The Turkish dotless i (ı) in its lowercase form uppercases to a normal I, which then does _not_ lowercase back to a dotless i if you're just lowercasing naively without locale info. In Danish, the digraph aa is an alternate way of writing å (which sorts near the end of the alphabet). Hungarian has a whole bunch of bizarre di- and trigraphs IIRC. Try looking up the standard Unicode algorithm for doing case insensitive equality comparison by the way; it's one heck of a thing.

People somehow think that issues like these are only an issue with Han unification or something, but it's all over European languages as well. Comparing strings for equality is a deeply political issue.

bencelaszlo · 2024-03-25T15:36:56 1711381016

> Hungarian has a whole bunch of bizarre di- and trigraphs IIRC

Actually, there is just only one trihraph. "dzs" almost exclusively used for representing "j" from English and other alphabets, for example "Jennifer" is "Dzsennifer" in Hungarian or "jam" is "dzsem" in the same way.

Trigraph and digraphs actually make sense, at least as a native as these really mark similar sounds what you would think you will get by combining the given graphs. These letters doesn't cause too much issues in search in my opinion, but hyphenation is a form of art (see "magyar.ldf" for LaTeX as an example).

To complicate the situation even further we have a/á, e/é, i/í and o/ó/ü/ő and u/ú/ü/ű letters, all of those considered to be separate ones and you can easily type them in a Hungarian desktop keyboard. On the other hand, mobile virtual keyboards usually show a QWERTY/QWERTZ layout where you can only find "long vowels" by long pressing their "short" counterparts, so when you are targeting mobile users you maybe want to differentiate between "o" and "ö", but not between "o" and "ó" nor between "ö" and "ő".

rat87 · 2024-03-25T17:28:56 1711387736

That doesn't seem that strange Russian and I think Ukrainian (maybe some other languages that use cyrilic) have Дж as the closest thing to English J. Д is d and ж is transliterated as zh. Sometimes names are transliterated with dzh instead of j.

josephcsible · 2024-03-25T15:23:21 1711380201

> to an American, a user that searches for "cooperation" when your text data happens to include writings by people who write like in The New Yorker, would probably to expect to find "coöperation".

Unicode shouldn't be responsible for making such searches work, just like it's not responsible for making searches for "analyze" match text that says "analyse".

renhanxue · 2024-03-25T17:17:53 1711387073

My point was simply that the fact that there are multiple representations of characters that look the same is just a tiny part of the complexity involved in making text behave like users want. It's not that uncommon for people to think that "oh I'll just normalize the string and that'll solve my problems", but normalization is just a small part of quote-unquote "proper" Unicode handling.

The "proper" way of sorting and comparing Unicode strings is part of the standard; it's called the Unicode Collation Algorithm (https://unicode.org/reports/tr10/). It is unwieldy to say the least, but it is tuneable (see the "Tailoring" part) and can be used to implement o/ö equivalence if desired. I think it's great that this algorithm (and its accompanying Common Locale Data Repository) is in the standard and maintained by the consortium, because I definitely wouldn't want to maintain those myself.

fhars · 2024-03-24T21:36:56 1711316216

Unicode was never designed for ease of use or efficiency of encoding, but for ease of adoption. And that meant that it had to support lossless round trips from any legacy format to Unicode and back to the legacy format, because otherwise no decision maker would have allowed to start a transition to Unicode for important systems.

So now we are saddled with an encoding that has to be bug compatible with any encoding ever designed before.

striking · 2024-03-24T21:35:49 1711316149

If you take a peek at an extended ASCII table (like the one at https://www.ascii-code.com/), you'll notice that 0xC5 specifies a precomposed capital A with ring above. It predates Unicode. Accepting that that's the case, and acknowledging that forward compatibility from ASCII to Unicode is a good thing (so we don't have any more encodings, we're just extending the most popular one), and understanding that you're going to have the ring-above diacritic in Unicode anyway... you kind of just end up with both representations.

arp242 · 2024-03-24T22:14:27 1711318467

Everything can just be pre-composed; Unicode doesn't need composing characters.

There's history here, with Unicode originally having just 65k characters, and hindsight is always 20/20, but I do wish there was a move towards deprecating all of this in favour of always using pre-composed.

Also: what you linked isn't "ASCII" and "extended ASCII" doesn't really mean anything. ASCII is a 7-bit character set with 128 characters, and there are dozens, if not hundreds, of 8-bit character sets with 256 characters. Both CP-1252 and ISO-8859-1 saw wide use for Latin alphabet text, but others saw wide use for text in other scripts. So if you give me a document and tell me "this is extended ASCII" then I still don't know how to read it and will have to trail-and-error it.

I don't think Unicode after U+007F is compatible with any specific character set? To be honest I never checked, and I don't see in what case that would be convenient. UTF-8 is only compatible with ASCII, not any specific "extended ASCII".

adrian_b · 2024-03-24T22:51:59 1711320719

In my opinion, only the reverse could be true, i.e. that Unicode does not need pre-composed characters because everything can be written with composing characters.

The pre-composed characters are necessary only for backwards compatibility.

It is completely unrealistic to expect that Unicode will ever provide all the pre-composed characters that have ever been used in the past or which will ever be desired in the future.

There are pre-composed characters that do not exist in Unicode because they have been very seldom used. Some of them may even be unused in any language right now, but they have been used in some languages in the past, e.g. in the 19th century, but then they have been replaced by orthographic reforms. Nevertheless, when you digitize and OCR some old book, you may want to keep its text as it was written originally, so you want the missing composed characters.

Another case that I have encountered where I needed composed characters not existing in Unicode was when choosing a more consistent transliteration for languages that do not use the Latin alphabet. Many such languages use quite bad transliteration systems, precisely because whoever designed them has attempted to use only whatever restricted character set was available at that time. By choosing appropriate composing characters it is possible to design improved transliterations.

arp242 · 2024-03-25T00:25:39 1711326339

> It is completely unrealistic to expect that Unicode will ever provide all the pre-composed characters that have ever been used in the past or which will ever be desired in the future.

I agree it's unlikely this will ever happen, but as far as I know there aren't really any serious technical barriers, and from purely a technical point of view it could be done if there was a desire to do so. There are plenty of rarely used codepoints in Unicode already, and while adding more is certainly an inconvenience, the status quo is also inconvenient, which is why we have one of those "wow, I just discovered Unicode normalisation!" (and variants thereof) posts on the front-page here every few months.

Your last paragraph can be summarize as "it makes it easier to innovate with new diacritics". This is actually an interesting point – in the past anyone could "just" write a new character and it may or may not get any uptake, just as anyone can "just" coin a new word. I've bemoaned this inability to innovate before. That is not inherent to Unicode but computerized alphabets in general, and I that composing characters alleviates at least some of that is probably the best reason I've heard for favouring compose characters.

I'm actually also okay with just using composing characters and deprecating the pre-composed forms. Overall I feel that pre-composed is probably better, partly because that's what most text currently uses and partly because it's simpler, but that's the lesser issue – the more important one that it would be nice to move towards "one obviously canonical" form that everything uses.

adrian_b · 2024-03-25T00:42:03 1711327323

There is also another reason that makes the composing characters very convenient right now.

Many of the existing typefaces, even some that are quite expensive, do not contain all the pre-composed characters defined by Unicode, especially when those characters have been added in more recent Unicode versions or when they are used only in languages that are not Western European.

The missing characters can be synthesized with composing characters. The alternatives, which are to use a font editor to add characters to the typeface or to buy another more complete and more expensive version of the typeface, are not acceptable or even possible for most users.

Therefore the fact that Unicode has defined composing characters is quite useful in such cases.

arp242 · 2024-03-25T01:41:20 1711330880

Every avenue opens inconveniences for someone, but I'd rather choose the relatively rare inconvenience of font designers over the relatively common inconvenience of every piece of software ever written. Especially because this can be automated in font design tools, or even font formats itself.

zokier · 2024-03-24T22:28:18 1711319298

For roundtripping e.g. https://en.wikipedia.org/wiki/VSCII you do need both composing characters and precomposed characters.

kps · 2024-03-24T22:58:16 1711321096

> I don't think Unicode after U+007F is compatible with any specific character set?

The ‘early’ Unicode alphabetic code blocks came from ISO 8859 encodings¹, e.g. the Unicode Cyrillic block follows ISO 8859-5, the Greek and Coptic block follows ISO 8859-7, etc.

¹ https://en.wikipedia.org/wiki/ISO/IEC_8859

bandrami · 2024-03-24T22:35:19 1711319719

> Unicode doesn't need composing characters

But it does, IIRC, for both Bengali and Telugu.

arp242 · 2024-03-24T22:50:37 1711320637

Only because they chose to do it like that. It doesn't need to.

kbolino · 2024-03-25T02:00:03 1711332003

Considering that Unicode did not invent combining diacritics, it follows that simple compatibility with existing encodings demanded it. Now that Unicode's goals have expanded beyond simply representing what already exists, precomposed characters would be too limiting.

pavel_lishin · 2024-03-24T22:09:45 1711318185

It might not be ludicrous to suggest that the English letter "a" and the Russian letter "а" should be a single entity, if you don't think about it very hard.

But the English letter "c" and the Russian letter "с" are completely different characters, even if at a glance they look the same - they make completely different sounds, and are different letters. It would be ludicrous to suggest that they should share a single symbol.

josephcsible · 2024-03-24T23:17:11 1711322231

If they're always supposed to look the same, then Unicode should encode them the same, even if they mean different things in different contexts.

pavel_lishin · 2024-03-24T23:20:59 1711322459

Two counterpoints:

1. Unicode isn't a method of storing pixel or graphic representations of writing systems; it's meant to store text, regardless of how similar certain characters look.

2. What do you do about screen readers & the like? If it encounters something that looks like a little half-moon glyph that's in the middle of a sentence about foreign alphabets that reads "Por ejemplo, la letra 'c'", should it pronounce it as the English "see" or as Russian "ess"?

tzs · 2024-03-25T17:04:44 1711386284

> 1. Unicode isn't a method of storing pixel or graphic representations of writing systems; it's meant to store text, regardless of how similar certain characters look.

I'm not sure that that is really possible without something way bigger or more complicated than Unicode. Consider the string "fart". In English that means to emit gas from the anus. In Swedish it means speed. Does that mean Unicode should have separate "f", "a", "r", and "t" for English and Swedish?

> 2. What do you do about screen readers & the like? If it encounters something that looks like a little half-moon glyph that's in the middle of a sentence about foreign alphabets that reads "Por ejemplo, la letra 'c'", should it pronounce it as the English "see" or as Russian "ess"?

What would a human do if that was in a book and they were reading it aloud for a blind friend?

Izkata · 2024-03-25T00:18:20 1711325900

For 8 minutes of this (among other translation mistakes), you've reminded me of Peggy Hill's understanding of Spanish in the cartoon King of the Hill - https://www.youtube.com/watch?v=g62A1vkSxB0

(IIRC, she learned the language entirely from books so has no idea of the correct pronunciation and thinks she's fluent)

patal · 2024-03-25T08:06:13 1711353973

1. "graphic representation of writing systems" and "text" mean the same thing to me. Do you mean text as spoken?

2. I think the pronunciation should not be encoded into the text representation on a general scale. You would need different encodings for "though" and "through" in english alone. Your example leaves the meaning open, even if being read as text. If I was the editor, and the distinction was important, I'd change it to "For example, the cyrillic letter 'c'".

I understand that Unicode provides different code points for same-looking characters, mostly because of history, where these characters came from different code sheets in language-specific encodings.

pavel_lishin · 2024-03-25T12:54:06 1711371246

I mean text as in the platonic ideal of "c" and "с". Just because they look the same, does not make them the same character. If we're going to be encoding characters that happen to have pixel-identical renderings in certain fonts, the next logical step is to encode identical letters that look different in different fonts or writing styles as separate code points as well - for example, the English letter "g" is a fucking orthographic nightmare.

kps · 2024-03-25T16:23:45 1711383825

Imagine if, say, English people normally wrote an open ‘g’ and French normally wrote a looped ‘g’, and you have the essence of the Han Unification debates.

Joker_vD · 2024-03-25T00:03:03 1711324983

What about Latin "k" and Cyrillic "к"? Do they look the same in your font of choice? Should they?

ogurechny · 2024-03-25T20:35:37 1711398937

Heh.

“Cyrillic” isn't the same everywhere. Bulgarian fonts differ from Russian fonts, some letters are “latinized”, some borrow from handwritten forms:

https://bg.wikipedia.org/wiki/Българска_кирилица

Colored example has the third alternative for Serbian cursive.

So without some external lang metadata we don't even know how your message should look.

However, Russian “Кк” traditionally is different from Latin “Kk” in most recognized families. In the '90s, font designers regularly thrashed ad-hoc font localization attempts which ignored the legacy of pre-digital era, and blindly copied the Latin capital into capital and minuscule forms.

josephcsible · 2024-03-25T02:02:13 1711332133

Those look different, so I have no issue with them being different code points.

eviks · 2024-03-25T03:43:28 1711338208

But they don't "fundamentally" look different, it's font dependent(there are fonts where they look the same), just like the same Latin k will look different depending on a font, so you need a better rule to make your own simple Unicode

Joker_vD · 2024-03-26T20:03:44 1711483424

He's probably the guy who decided to add fraktur/double-strike/sans-serif/small-caps/bold/script/etc variants of Latin letters to the Unicode because, you know, they look different! so they should get their own special code points.

It was a joke, by the way.

rat87 · 2024-03-25T17:36:57 1711388217

What about Cyrillic T: Т? It looks the same uppercase (but not lowercase. And in italic/cursive, which I believe is not encoded in Unicode, it looks sort of like an m).

allarm · 2024-03-25T07:58:39 1711353519

The capitalized "K" and "К" look exactly the same though.

josephcsible · 2024-03-25T15:29:29 1711380569

When I look at your post, in "K", the lower diagonal line branches off of the upper diagonal line, slightly breaking horizonal symmetry, but "К" is horizontally symmetrical.

loa_in_ · 2024-03-26T07:16:56 1711437416

The latter glyph has a little bend on the top diagonal part

pavel_lishin · 2024-03-25T12:54:47 1711371287

Not in my font!

mzs · 2024-03-25T16:13:36 1711383216

C vs С is so strange to me. They look the same upper and lower case, italic, cursive, even are at the same location on keyboards. It's not like W is a different character in Slavic languages that use latin script even though the sound is completely different in English.

rat87 · 2024-03-26T23:21:48 1711495308

I was thinking of Russian letter г and Ukrainian letter г.

Or the whole eh/ye flip En/UK/Ru Eh/е/э Ye/є/е

г/е are unified and that's probably as it should be but there are downsides.

bawolff · 2024-03-24T22:24:21 1711319061

Maybe, but then you can no longer round trip with other encodings, which seems worse to me.

layer8 · 2024-03-24T17:36:47 1711301807

The more general solution is specified here: https://unicode.org/reports/tr10/#Searching

bawolff · 2024-03-24T22:25:38 1711319138

Collation and normal forms are totally different things with different purposes and goals.

Edit: reread the article. My comment is silly. UCA is the correct solution to the author's problem.

blablabla123 · 2024-03-24T20:15:54 1711311354

As a German macOS user with US keyboard I run into a related issue every now and then. What's nice about macOS is I can easily combine Umlaute but also other common letters from European languages without any extra configuration. But some (Web) Applications stumble upon it, while entering because it's like: 1. ¨ (Option-u) 2. ü (u pressed)

kps · 2024-03-24T22:41:45 1711320105

Early on, Netscape effectively exposed Windows keyboard events directly to Javascript, and browsers on other platforms were forced to try to emulate Windows events, which is necessarily imperfect given different underlying input systems. “These features were never formally specified and the current browser implementations vary in significant ways. The large amount of legacy content, including script libraries, that relies upon detecting the user agent and acting accordingly means that any attempt to formalize these legacy attributes and events would risk breaking as much content as it would fix or enable. Additionally, these attributes are not suitable for international usage, nor do they address accessibility concerns.”

The current method is much better designed to avoid such problems, and has been supported by all major browsers for quite a while now (the laggard Safari arriving 7 years from this Tuesday).

https://www.w3.org/TR/uievents

chuckadams · 2024-03-24T19:39:39 1711309179

Clearly the author already knows this, but it highlights the importance of always normalizing your input, and consistently using the same form instead of relying on the OS defaults.

makeitdouble · 2024-03-24T21:33:49 1711316029

The larger point is probably that search and comparison are inherently hard as what humans understand as equivalent isn't the same for the machine. Next stop will be upper case and lower case. Then different transcriptions of the same words in CJK.

mckn1ght · 2024-03-24T20:53:23 1711313603

Also, never trust user input. File names are user inputs. You can execute XSS attacks via filenames on an unsecured site.

userbinator · 2024-03-24T20:26:54 1711312014

its[sic] 2024, and we are still grappling with Unicode character encoding problems

More like "because it's 2024." This wouldn't be a problem before the complexity of Unicode became prevalent.

bornfreddy · 2024-03-24T20:52:36 1711313556

You mean this wouldn't be a problem if we used the myriad different encodings like we did before Unicode, because we would probably not be able to even save the files anyway? So true.

userbinator · 2024-03-24T21:38:01 1711316281

Before Unicode, most systems were effectively "byte-transparent" and encoding only a top-level concern. Those working in one language would use the appropriate encoding (likely CP1252 for most Latin languages) and there wouldn't be confusion about different bytes for same-looking characters.

deathanatos · 2024-03-24T22:40:54 1711320054

A single user system, perhaps.

I've worked on a system that … well, didn't predate Unicode, but was sort of near the leading edge of it and was multi-system.

The database columns containing text were all byte arrays. And because the client (a Windows tool, but honestly Linux isn't any better off here) just took a LPCSTR or whatever, it they bytes were just in whatever locale the client was. But that was recorded nowhere, and of course, all the rows were in different locales.

I think that would be far more common, today, if Unicode had never come along.

bawolff · 2024-03-24T22:29:20 1711319360

My understanding is way back in the day, people would use ascii backspace to combine an ascii letter with an ascii accent character.

kps · 2024-03-24T23:19:41 1711322381

ASCII 1967 (and the equivalent ECMA-6) suggested this, and that the characters ,"'`~ could be shaped to look like a cedilla, diaeresis, acute accent, grave accent, and raised tilde respectively for that purpose. But I've never once seen or heard of that method used.

ASCII also allowed the characters @[\]^{|}~ to be replaced by others in ‘national character allocations’, and this was commonly used in the 7-bit ASCII era.

In the 8-bit days, for alphabetic scripts, typically the range 0xA0–0xFF would represent a block of characters (e.g. an ISO 8859¹ range) selected by convention or explicitly by ISO 2022². (There were also pre-standard similar methods like DEC NRCS and IBM's EBCDIC code pages.)

¹ https://en.wikipedia.org/wiki/ISO/IEC_8859

¹ https://en.wikipedia.org/wiki/ISO/IEC_2022

bawolff · 2024-03-25T00:28:30 1711326510

Googling i saw people link to http://git.savannah.gnu.org/cgit/bash.git/tree/doc/bash.0 as an example of overstriking (albeit for bold not accents). The telnet rfc also makes reference to it. I also see lots of references in the context of APL.

I suppose in the 60s/70s it would be in the era of teletypewriters where maybe over striking would more naturally be a thing.

I also found references to less supporting this sort of thing, but seems to be about bold and underline, not accents.

kps · 2024-03-25T03:31:34 1711337494

nroff did do overstriking for underlining and bold. I don't remember if it did so for accents, but in any case it was for printer output and not plain text itself.

APL did use overstriking extensively, and there were video terminals that knew how to compose overstruck APL characters.

TheRealPomax · 2024-03-25T16:53:49 1711385629

SHIFT-JIS and EUC would like a word.

n2d4 · 2024-03-24T20:57:15 1711313835

You make it sound like non-English languages were invented in 2024

mschuster91 · 2024-03-24T23:44:14 1711323854

> This wouldn't be a problem before the complexity of Unicode became prevalent.

It was a problem even before then. It worked fine as long as you had countries that were composed of one dominant ethnicity that sharted upon how minorities and immigrants lived (they were just forced to use a transliterated name, which could be one hell of a lot of fun for multi-national or adopted people) - and even that wasn't enough to prevent issues. In Germany, for example, someone had to go up to the highest public-service courts in the late 70s [1] to have his name changed from Götz to Goetz because he was pissed off that computers were unable to store the ö and so he'd liked to change his name rather than keep getting mis-named, but German bureaucracy does not like name changes outside of marriage and adoption.

[1] https://www.schweizer.eu//aktuelles/urteile/7304-bverwg-vom-...

bawolff · 2024-03-24T22:26:47 1711319207

Combining characters go back to the 90s. The unicode normal forms were defined in the 90s. None of this is new at this point.

_nalply · 2024-03-24T19:24:55 1711308295

Sometimes it makes sense to reduce to Unicode confusables.

For example the Greek letter Big Alpha looks like uppercase A. Or some characters look very similar like the slash and the fraction slash. Yes, Unicode has separate scalar values for them.

There are Open Source tools to handle confusables.

This is in addition to the search specified by Unicode.