Hacker News new | past | comments | ask | show | jobs | submit login

One thing that is very unintuitive with normalization is that MacOS is much more aggressive with normalizing Unicode than Windows or Linux distros. Even if you copy and paste non-normalized text into a text box in safari on Mac, it will be normalized before it gets posted to the server. This leads to strange issues with string matching.



Unfun normalisation fact: You can’t have a file named "ss" and a file named "ß" in the same folder in Mac OS.


There are people with the surname "Con" and it's impossible to create a file with that name in MS Windows.

https://learn.microsoft.com/en-us/windows/win32/fileio/namin...


That's less a normal form issue and more a case-insensitivity issue. You also can't have a file named "a" and one named "A" in the same folder.


That would be true if the test strings were "SS" and "ß", because although "ẞ" is a valid capitalization of "ß", it's officially a newcomer. It's more of a hybrid issue: it appears that APFS uses uppercasing for case-insensitive comparison, and also uppercases "ß" to "SS", not "ẞ". This is the default casing, Unicode also defines a "tailored casing" which doesn't have this property.

So it isn't per se normalization, but it's not not normalization either. In any case (heh) it's a weird thing that probably shouldn't happen. Worth noting that APFS doesn't normalize file names, but normalization happens higher up in the toolchain, this has made some things better and others worse.


That would only explain why "ß" and "ẞ" can't both be files in the same folder. "ß" and "ss" are different letter just like "u" and "ue" for example.


This shows up in other places, too. One of my Slacks has a textji of `groß`, because I enjoy making our German speakers' teeth grind, but you sure can just type `:gross:` to get it.


> a textji

This is a weird formation; "ji" means text. It's half of the half of "emoji" that means text: 絵文字, 絵 [e, "picture"] 文字 [moji, "character", from 文 "text" + 字 "character"].

https://satwcomic.com/half-human-half-scandinavian


It's weird, but it's also how language evolves sometimes. Once used in a certain way, words or parts of words take on that meaning.

For example, there's an apartment and office building complex on a site near a historic canal and dam. The building development was named after this site. Then in one of the apartments (CORRECTION: offices), a scandalous political event happened. The complex was called Watergate, the scandal was called Watergate too, and now the suffix -gate is used for scandals.


> Then in one of the apartments, a scandalous political event happened.

It was one of the offices, not one of the apartments (specifically, it was series of break-ins to and the wiretapping of the headquarters of the Democratic National Committee by people working for President Nixon’s re-election committee.)


Oops! I double-checked that detail, made a mental note to say it was an office, and then typed apartment anyway.


Yes, but "reactji" is also weird and yet people use it for Slack reactions. It's fine.


So what happens if someone puts those two in a git repo and a Mac user checks out the folder?


  git clone https://github.com/ghurley/encodingtest
  Cloning into 'encodingtest'...
  remote: Enumerating objects: 9, done.
  remote: Counting objects: 100% (9/9), done.
  remote: Compressing objects: 100% (5/5), done.
  remote: Total 9 (delta 1), reused 0 (delta 0), pack-reused 0
  Receiving objects: 100% (9/9), done.
  Resolving deltas: 100% (1/1), done.
  warning: the following paths have collided (e.g. case-sensitive paths
  on a case-insensitive filesystem) and only one from the same
  colliding group is in the working tree:

  'ss'
  'ß'


I have this issue on occasion with older mixed C/C++ codebases that use `.c` for C files and `.C` for C++ files. Maddening.


I never understood the popularity of the '.C' extension for C++ files. I have my own preference (.cpp), but it's essentially arbitrary compared to most other common alternatives (.cxx, .c++). The '.C' extension is the only one that just seems worse (this case sensitivity issue, and just general confusion given how similar '.c' looks to '.C').

But even more than that, I just don't get how C++ turns into 'C' at all. It seems actively misleading.


C++

is Incremented C

which is Big C

which is Capital C


But C is already capital C! Even .d would have been a better extension.


He is clearly taking about the capital version of capital C.


You can always reformat as APFS (Case Sensitive)


I remember seeing quite a few things in the old days that would have both 'makefile' and 'Makefile'.


EEXIST


I was really surprised when realized that at least in hpfs cyrillics is normalized too. For example, no russian ever thinks that Й is a И with some diacritics. It's a different letter on it's own right. But mac normalizes it into two codepoints.


I dislike explaining string compares to monolingual English speakers who are programmers. Similar to this phenomenon of Й/И is people who think ñ and n should compare equally, or ç and c, or that the lowercase of I is always i (or that case conversion is locale-independent).

In something like a code review, people will think you're insane for pointing out that this type of assumption might not hold. Actually, come to think of it, explaining localization bugs at all is a tough task in general.


Or that sort order is locale independent. Swedish is a good example here as åäö are sorted at the end, and where until 2006 w was sorted as v. And then it changed and w is now considered a letter of its own.


Well, I do like this behavior for search though. I don't want to install a new keyboard layout just to be able to search for a Spanish word.


My brother recently asked for help in determining who a footballer (soccer player) was from a photo. Like in many sports, the jerseys have the players name on the rear, and this player’s was in Cyrillic - Шунин (Anton Shunin) - and my brother had tried searching for Wyhnh without success.

Anyway, my point is that perhaps ideally (and maybe search engines do this) the results should be determined by the locale of the searcher. So someone in the English speaking world can find Łódź by searching for Lodz, but a Pole may need to type Łódź. My brother could find Shunin by typing Wyhnh, but a Russian could not…


Essentially you are asking for search engines to recognize "Volapuk" encoding.

https://en.wikipedia.org/wiki/Informal_romanizations_of_Cyri...


Is the convenience of a few foreigners searching for something more important than the convenience of the many native speakers searching for the same?

Maybe we should start modifying the search behavior of English words to make them more convenient for non-native speakers as well. We could start by making "bed aidia" match "bad idea", since both sound similar to my foreign ears.


In fairness, for search, allowing multiple ways of typing the same thing is probably the best choice: you can prioritise true matches, where the user has typed the correct form of the letter, but also allow for more visual based matches. (Correcting common typos is also very convenient even for native speakers of a language — and of course a phonetic search that actually produced good results would be wonderful, albeit I suspect practically very difficult given just how many ways of writing a given pronunciation there might be!)


As a counterexample, conflating two different glyphs as if they were the same can lead to the inability to search for a particular term. E.g. in Spanish these two words (cono, coño) have very different meanings. If I'm searching for one I don't want to see results pertaining to the other one. It would be like searching for "sheet" and getting results for "shit".


It depends on how the search is implemented exactly and what the context is, but assuming I've searched for "cono", I would expect results that directly match "cono" to come first, then results that also match "coño".

Similarly to how I'd expect to still get reasonable results if I type "beleive" instead of "believe".

That said, this is obviously pretty context-dependent, in some settings it will make more sense to do an exact-match search, in which case you'd want to differentiate n and ñ (while still handling different possible unicode variants of ñ if those exist).


Search probably needs both modes. A literal and a fuzzy one.


For similar sounding names, this fuzzy match is pretty effective. https://www.archives.gov/research/census/soundex


In terms of phonetic matching algorithms, Soundex is considered badly outdated. Most MDM products use more advanced alternatives.


These are different letters for people who speak the language and treating them the same in some usage seems weird.

At the same time, sometimes words containing those letters might show up in context where the user is not familiar with that language. Such users might not know how to enter those letters. They might not even have the capability to type those letters with their installed keyboard layouts. If they are searching for content that contains such letters (e.g. a first name), normalizing them to the visually-closest ASCII is a sensible choice, even if it makes no sense to the speakers of the language.

It's important to understand a situation from different perspectives.

It's not about coming up with a single correct interpretation that makes logical sense. It about making a system work in least-surprising ways to all classes of users.


The general reaction I've see until now was "meh, we have to make compromises (don't make me rewrite this for people I'll probably never meet)"

Diacritics exacerbate this so much as they can be shared between two language yet have different rules/handling. French typically has a decent amount and they're meaningful but traditionally ignores them for comparison (in the dictionary for instance). That makes it more difficult for a dev to have an intuitive feeling of where it matters and where it doesn't.


Normalization isn't based on what language the text is.

NFC just means never use combining characters if possible, and NFD means always use combining characters if possible. It has nothing to do with whether something is a "real" letter in a specific language or not.

The whether or not something is a "real" letter vs a letter with a modifier, more comes into play in the unicode collation algorithm, which is a separate thing.


Well, there's no expectation in unicode that something viewed as a letter in its own right should use a single codepoint.


I sometimes see texts where ä is rendered as a¨, i.e. with the dots next to the a instead of above it even though it's a completely different letter and not a version of a. I managed to track the issue down to MacOS' normalization, but it has happened on big national newspapers' websites and similar. I haven't seen it in a while, maybe Firefox on Windows renders it better or maybe various publishing tools have fixed it. It looks really unprofessional which is a bit strange since I thought Apple prides themselves on their typography.


I have never see that on all my years on a Mac (though admittedly I’m not dealing in languages where I encounter it often). I’m assuming there’s an issue with the gpos table in the font you’re using so the dots aren’t negative shifted into position as they should be?


Well the point is that ä is one character, not two. It shouldn't be "a with two dots on it", it should be ä. It's its own letter with its own key on Swedish keyboards. MacOS apparently normalizes it to be two characters, and then somewhere in the publishing chain it gets mangled and end up as a¨. I have no doubt that it looked ok on the author's Mac.

It's been a while since I last saw it, but it wasn't because of the font since it was published on a Swedish newspaper's website and other texts worked fine.


A single Unicode codepoint could be represented in a couple of different ways (either decomposed into 2 or as 1). Assume it’s the single codepoint representation.

The font you’re using can (and probably will) rewrite it as 2 glyphs using the GSUB table. This makes sense because it’s a more efficient way to store the drawing operations. The GPOS table is then responsible for handling the offset to put things in their right place.

Main point is that it’s up to the font to move things about.

Now, that may not be what was going on in your case at all but it’s possible.


I have that in gnome terminal. The dots always end up on the letter after, not before. At least makes it easy to spot filenames in decomposed form so I can fix them.


Some old system fonts or old character rasterization engines had problems with certain diacritics, like breve, and they were moved to the space between or after characters. Some Wikipedia articles simply mention that

> Characters may not combine well on some computers.

It was easy to detect people typing or editing text on Apple devices because “their” characters appeared broken, unlike usual single codepoints.


While this (probably) still applies to Apple UI elements when they switched to APFS they stopped doing Unicode normalization on filesystem level.

So now on macOS you can have a very mixed bag with some programs normalizing, some not (it's a bug) and many expecting normalized file names.

So it's kinda like other Linux now except a lot of dev assuming normalization is happening (and in some cases still is when the string passes through certain APIs).

Worse due to normalization now being somewhat application/framework dependent and often going beyond basic Unicode normalization it can lead to quite not so funny bugs.

But luckily most users will never run into any of this bugs even if the use characters which might need normalization.


On the other hand, stuff written on macs are a lot more likely to require normalization in the first place.


MacOS creates so many normalization problems in mixed environments that it's not even funny any more. No common server-side CMS etc. can deal with it, so the more Macs you add to an organization, the more problems you get with inconsistent normalization in your content. (And indeed, CMSes shouldn't have to second-guess users' intentions - diacretics and umlauts are pronounced differently and I should be able to encode that difference, e.g. to better cue TTS.)

And, of course, the Apple fanboys will just shrug and suggest you also convert the rest of the organization to Apple devices, after all, if Apple made a choice, it can't be wrong.


I'm not sure I understand. On the one hand you seem to be saying that users should be able to choose which normalisation form to use (not sure why). On the other hand you're unhappy about macOS sending NFD.

If it's a user choice then CMSs have to be able to deal with all normalisation forms anyway and shouldn't care one bit whether macOS sends NFD or NFC. Mac users could of course complain about their choice not being honoured by macOS but that's of no concern to CMSs.


> On the other hand you're unhappy about macOS sending NFD.

Because MacOS always uses it, regardless of the user's intention, so it decomposes umlauts into diaereses (despite them having different meanings and pronunciations) and mangles cyrillic, and probably more problems I haven't yet run into.


Unicode doesn't have ‘umlauts’, and (with a few unfortunate exceptions) doesn't care about meanings and pronunciations. From the Unicode perspective, what you're talking about is the difference between Unicode Normalization Form C:

    U+00FC LATIN SMALL LETTER U WITH DIAERESIS
and Unicode Normalization Form D:

    U+0075 LATIN SMALL LETTER U
    U+0308 COMBINING DIAERESIS
Unicode calls these two forms ‘canonically equivalent’.


For maximum pain, they should start populating folders with .DS_STÖRE


But store decomposed form on Tuesdays!


Suspect you're getting downvoted because of the last sentence. However, I do sympathise with MacOS tending to mangle standard (even plain ASCII) text in a way that adds to the workload for users of other OS's.


It adds to the workload of everyone, including the Apple users. The latter ones are just in denial about it.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: