Hacker News new | past | comments | ask | show | jobs | submit login

Case transformations are locale-dependent. That is, in French lower case of "I" would be "i", and upper case of "i" would be "I". In Turkish, which uses largely the same letters, lower case of "I" would be "ı", and upper case of "i" would be "İ". Also, in German, upper case of "s" is "S", but upper case of "ß" would be "SS", and you have to guess what lower case of "SS" would be.

Universal case insensitivity is hard if not impossible. It's best to preserve both a "canonical case" version and the raw data in a search index, if different.




There is an uppercase ß, actually: https://en.wikipedia.org/wiki/Capital_%E1%BA%9E

Not that it has any impact on your point. But as a German who only learned about it fairly recently, I have rather ambivalent feelings about it. ;-)


Nice! German has funnier problems, though, because a vowel with an umlaut can be represented as that vowel + e, that is, "fuer" is a legit representation of "für". How do you normalize that? Turning to one canonical representation works most of the time, but sometimes you also need letter-to-letter correspondence.


It also gets interesting with proper names. It's perfectly legitimate to transliterate "Herr Schröder" to "Herr Schroeder" (and if you don't have ö at your disposal you have to), but a proper name that starts out with the transliteration, like "Dr. Oetker" can usually not be transliterated to "ö".

In some cases that might be because the "oe" was never an "ö" in this case, even if people might have shifted to pronouncing it that way. But in other cases I imagine that it was intended and did start out as an "ö" sound, but people just decided to write the name with "oe".

(Same with the other Umlauts.)


Is that really legit in German? I know Finnish umlauts (ä, ö) get sometimes mangled to ae or oe, but they are definitely not valid alternative spellings nor are they pronounced even close to similar.


This brings up something that is perhaps easy to forget: the interpretation of diacritics is far from universal. The diacritic in the character 'ä' can be one of two semantically different diacritics. It can be a diaeresis, a diacritic whose function is to mark that vowel starts the next syllable rather than existing as part of a diphthong; or it can be an umlaut, whose purpose is to indicate that it is a different vowel sound altogether. As far as any charset is concerned, though, despite those very different semantics, the two things are the same diacritical mark [1].

Even beyond the issue with two concepts using the same glyph, the interpretation of the same diactrics among different languages is inconsistent. English tends to drop diacritics to the point that many people think that English doesn't use them; German uses expansion (so ä becomes ae). As you mention, some languages are incomprehensible either way, so they need to be preserved. And sorting and collation is even more fun!

[1] This does mean that Unicode's insistence that characters represent semantic differences rather than graphical differences can come across as rather arbitrary. The original purpose of Unicode was to unify different character sets together, so it preserves character differentiation that existed in antecedent charsets but tends to otherwise unify characters in practice.


Pretty much the only place you see diaeresis in English is in the New Yorker, whenever they use a word like coördination.


Yeah, completely legit and not uncommon at all (though advances in locale support have probably made it less necessary in recent decades). As an official transliteration, documented and taught in school, it is definitely pronounces the same.

It seems like German could actually be at fault for your bogus transliteration issues in Finnish, then. Sorry about that! 8)


The handling of "ß" was recently changed; "ẞ" is the new uppercase form of "ß" since 2017 (Unicode has it since 2007)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: