Hacker News new | past | comments | ask | show | jobs | submit login




While we're having fun with that:

    # Capitalising an eszett changes the string length.
    >>> "straße".upper()
    'STRASSE'
    
    # If you don't specify the locale, round-trip upper/lower case
    # messes up the dotless i used in turkic languages.
    >>> 'ı'.upper().lower()
    'i'


Up until a few years ago, the first conversion would have been the official way to write a German word that contains the ligature ß (a lower-case letter) in all caps, because there was no corresponding upper-case letter. However, in 2017, an upper-case variant [1] was introduced into the official German orthography, so the conversion cited here should no longer be necessary.

[1] https://en.wikipedia.org/wiki/Capital_%E1%BA%9E


The upper-case ẞ remains very unconventional, and the Unicode casing algorithm continues to specify upper-case conversion to SS.


Correct. This raises the question what should be the basis of the Unicode casing algorithm: what is commonly practiced by users of a specific language (how to measure this reliably in less clear cases than this one?) or what an official "specification" of the language defines (if such a thing exists, and is widely accepted, especially when the language is spoken in more than one country)?


IMO: the official specification. If it doesn't match common usage after some time, the spec should be revised.

Pretty sure the actual question would be: what if there are multiple conflicting official specs?


> If it doesn't match common usage after some time, the spec should be revised

Well, the problem with any kind of language spec change is that it can take decades until it gets accepted widely.

Young people are the first adopters as they get it force-fed by schools, but getting the age bracket 40+ to adopt is a real challenge. Germany's 1996 project led to years-long battles and partial reversals in 2004/06, and really old people to this day haven't accepted it.


It is extremely unusual, and the addition to Unicode was very controversial.

Basically, some type designers liked to play with capital ß and thought it cool to have it included into Unicode. There was a whole campaign for inclusion, and it was a big mistake.

Because even though existing use must be shown to merit inclusion, they only managed to find a single book (a dictionary) printed in more than a few copies. From East Germany. And that book only used the capital ß for a single edition (out of dozens of editions) and reverted straight back to double s. Somehow, that still was enough for Unicode.

Capital ß is a shit show, and it only confuses native speakers, because close to none of them have ever seen that glyph before.

It has no real historic lineage (like the long s, for example, that pretty much nobody under 80 knows, either), it is a faux-retro modern design, an idle plaything.


Unicode attempts to render all human-written documents. That's why U+A66E (ꙮ), a character found once in one single document, is still considered acceptable.

ẞ is not a good capital letter of ß, but if that one single book ever gets transcribed into digital text, a unicode character code is necessary for that to happen.

I doubt systems that can't deal with the ß -> SS transcription will be able to deal with ẞ in the first place.


It‘s not as simple as you make it to be. Klingon script was rejected, partly because of "Lack of evidence of usage in published literature", despite many web sites and more than a handful of published books using it.


ꙮ is a farce like many other Unicode inclusions. Good for a laugh but best forgotten. If you want to encode that level of stylistic choice then you really need to start encoding the specific font used too.

Next you are going to ask Unicode to include all fancy stylized initials which are of course specific to the paragraph in a single book they are used in? At that point just embed SVGs directly in Unicode because the fight for any semantic meaning has been lost.


Well, what can you do... amtlich is amtlich.

And the behavior of a function that converts ß to double upper-case S can certainly be discussed too, if only for the fact that a round-trip toUpper().toLower() will not give you back the original word.


> if only for the fact that a round-trip toUpper().toLower() will not give you back the original word.

It is an inherently destructive round trip, especially in a language that makes excessive use of upper case when comapred to english. When you have the word "wagen" did it originally refer to a "car" or did it refer to someone "trying"?


I'm not a German speaker but this is definitely not exclusive to German nor is it caused by 'excessive' capitalization. The use of capital letters to denote nouns only helps to disambiguate it in German. While in English, this distinction is never clear just from the spelling of the isolated word alone. E.g. 'contract' can either mean 'an agreement' as a noun or 'to decrease in size' as a verb. There are plenty of other examples.

I agree though that this makes the round-trip inherently destructive.


Note that the addition to Unicode came first, and the addition to the official German orthography only came nine years later (and might not have happened at all without the Unicode inclusion). In addition, it remains only a variant in the latter, alongside the existing SS.


> Capital ß is a shit show, and it only confuses native speakers, because close to none of them have ever seen that glyph before.

I don't think it's particularly confusing, but it is a great addition. As of 2017 the rules allow both "SS" and "ẞ".

Moreover, the current version of DIN 5008 (Rules for Writing and Design in Text and Information Processing) actually recommends preferring "ẞ" over "SS".


What's confusing for me is that the German government seems to change its mind radically around the question every few years. If I counted properly, there was already two orthographic changes since I was schooled (German as a second language).


I'm not sure what radical changes you are referring to. There was a orthography reform in 1996 (with a transitional period until 2005). But other than that only minor changes occurred.


The government isn't involved in orthography. There is an official orthography that public officials must use, but it follows the rules of the relevant non-governmental organizations.


No, the government set the rules in 1996 and modified them in 2006. that was new, before that, orthography was set by convention and common usage.

Ostensibly, the reform only applied to officials in government agencies, but that also included both teachers and students in schools, and today using the old orthography in university exams is marked as errors (although many lecturers don‘t care and won‘t mark it up). Just went through it a few months ago.

Today, new orthography is not optional, at workplaces you will be corrected (and sometimes chastised) for using old orthography. Just recently a colleague went through a document I wrote and changed every "daß" to "dass". Nothing else was changed.

The ship has sailed, though, since many cohorts of students have gone through it now. I would just like people to be tolerant of what we older people learned in school. I don‘t want to re-learn everything. Just leave me in peace.


As a conlanger, I much appreciated the addition of a capital ß to Unicode. I use this letter in my conlang, and several words start with it, so it'd be natural to have a capital version of it to start sentences with. I was relieved to learn there is a defined spot for this letter in Unicode.


As a non native German speaker I really don't understand all the fuß around the capitalneSS of ß


Heh, thank goodness I don’t have to deal with all that! This code is ascii-only because it arose from working on the DNS. There are other protocols that are ascii-case-insensitive so it turns up a lot in the hot path of many servers.


This is presumably Rust's u8::to_ascii_lowercase rather than C's tolower since tolower is locale sensitive (which you don't care about) and has Undefined Behaviour (because of course it does it's a C function and who cares about correctness)

Or possibly u8::make_ascii_lowercase which is the same function but with in-place mutation.


There is a difference between strings used internally, usually as IDs, and text entered by a human. For the former you'd always use straight ASCII in 8-bit encoding, for the latter ... things get difficult. A straightforward example are DNS addresses - they can technically contain almost any Unicode, but that is always converted to a very limited subset of ASCII for actual DNS resolution, which in turn is case-insensitive.

There are of course things like programming languages with case-insensitive identifiers that support all human writing systems in Unicode. If that's what you're dealing with, you have my condolences.


On the wire DNS questions and answers are case preserving but not case sensitive which is important for correctness. DNS was designed a long time ago and doesn't have enough places to hide randomness (~30 bits at most are available) to protect it against a realistic attacker, so, for most names we do a terrible trick in the real world - we randomize the case during querying. This is called DNS-0x20 for obvious reasons.

Suppose a web browser wants to know news.ycombinator.com AAAA? but bad guys know it's about to ask this (e.g. they use JS to force that lookup when they wanted), they can shove a billion bogus answers (one for every possible random value) onto the wire and have a great chance to trick the browser into accepting one of these answers which seemingly is to the question it just asked. But, if we instead pick random cases we're asking about, say, NeWS.yCOmbinAtOR.cOM and we can ignore answers for nEWS.yCOMBINATOR.cOM or news.ycombinator.com or NEWS.YCOMBINATOR.COM or any other spelling. Bad guys now need to do many orders of magnitude more expensive work for the same results.


> For the former you'd always use straight ASCII in 8-bit encoding

why is that?

IMO internal strings should be treated as black-box byte-arrays, i.e. the specific content does not matter to the software, except for checking equality between to strings. In this case it should not matter to the software if it is unicode or whatever.


> There are of course things like programming languages with case-insensitive identifiers that support all human writing systems in Unicode. If that's what you're dealing with, you have my condolences.

Fun times when an upgrade of the Unicode library used by your compiler changes the semantics of you program.


For what it’s worth, with IDNs you’re still going to do a kind of case folding using stringprep before doing the query, and that isn’t really better than the table GP linked. ASCII-only case-insensitive matching is indeed a thing, but it’s usually mutually exclusive with (non-programmer) user-entered data.


Note for the first example which transforms the German maße to MASSE that the German language has an uppercase Eszett as well: ẞ

This is as of now not widely deployed and few fonts support it, but in theory it is there now.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: