Just a small detail that isn't mentioned in the article:
in NFC form, "base characters and modifiers are combined into a single rune whenever possible"
the interesting detail is "whenever possible": since NFC works by first decomposing, and then recomposing... there're some cases in which if you run NFC normalization on it, the characters will remain decomposed
an example is 𝅘𝅥𝅮 (U+1D160) which its normalized composed form is made of 3 different codepoints
I tried to look at the algorithm for generating the composition table, and it seems it's generated from the decomposition table... if that's so, I can't understand how it could happen that some code points have an NFC form longer than 1
1. It's decompose, reorder, compose. So you can see some weird stuff like ḍ̇=ḋ○̣ → NFD=d○̣○̇ → NFD=ḍ○̇
2. It's not compression, it's normalisation. So it's not compose everything you can. I cannot tell you exact the algorithm off the top of my head, but:
the reason for U+1D160 — it's in CompositionExclusions list.
> When a character with a canonical decomposition is added to Unicode, it must be added to the composition exclusion table if there is at least one character in its decomposition that existed in a previous version of Unicode. If there are no such characters, then it is possible for it to be added or omitted from the composition exclusion table. The choice of whether to do so or not rests upon whether it is generally used in the precomposed form or not.
Yeah, I get that. It's just that you might assume that the strings functions would operate on character boundaries (as defined in the blog post) and not based on runes (code points). Leaky abstractions and all that...
The purpose of the normalization package is to help you work with text under these constraints. I can't imagine many situations where strings.Replace would be sufficient for reliably manipulating natural language. The cafe example is to demonstrate the why you might need the package.
I wasn't thinking that I'd really want to pluralize text like this, but maybe you'd want to turn people's names into links in HTML source or something. If someone's name ends with an accent, and if the unicode isn't normalized, strange things are bound to happen. The blog post is great at pointing this out, and it sounds like people are working on a go.text/search package to help, so that's good. I'm not saying Go is broken, just that this kind of stuff can be really surprising.
It's not for bashing the parent comment, but I find it funny. From the beginning of time, 99.99% languages have horrific unicode support (and 99.999% programmers have not got a bit of clue in this area) and then suddenly…
This is useful for example, to ensure that users don't try and spoof each other's usernames. Simply create and store a skeleton string for each username, and keep a unique constraint on it
That confusables list is a good starting point, although you'll need to make additions, and probably scale back a couple of the over-zealous ones (eg. rn -> m)
I'm coming at this from a comment spam point of view, not usernames, btw.
Yes and no. The swiss would write the former, other German speaking (writing) countries would write the latter. It is incorrect in Germany (after ie, au, eu, ... you must not write ss, unless it's a name, such as the city Neuss)
The upper case of weiß would be WEISS. But it's hard from the upper case WEISS to determine if the lower case is weiss or weiß. (This is why one should never write people's names in bibliographies in small caps.)
For a normal ligature, if http://golang.org/src/pkg/unicode/letter_test.go?h=ToLower is anything to go by, then no for your question, but yes to your code, just not the way you think it works. Which is to say, strings.ToUpper("\u0133") appears to produce "\u0132" as a result.
But \u00DF appears to be a special case, as there's no uppercase for it. If I had to guess, I'd say it should return \u00DF. I mean, if I uppercase "+", do I expect something else back? Doubtful.
• NamesList:
= Eszett
• German
• uppercase is "SS"
• in origin a ligature of 017F and 0073
→ (greek small letter beta - 03B2)
→ (latin capital letter sharp s - 1E9E)
(in origin a ligature of 017F and 0073 is not undisputed)
U+1E9E (LATIN CAPITAL LETTER SHARP S ẞ) is not officially allowed in German orthography
• NamesList:
• lowercase is 00DF
→ (latin small letter sharp s - 00DF)
• Designated in Unicode 5.1
in NFC form, "base characters and modifiers are combined into a single rune whenever possible"
the interesting detail is "whenever possible": since NFC works by first decomposing, and then recomposing... there're some cases in which if you run NFC normalization on it, the characters will remain decomposed
an example is 𝅘𝅥𝅮 (U+1D160) which its normalized composed form is made of 3 different codepoints
I tried to look at the algorithm for generating the composition table, and it seems it's generated from the decomposition table... if that's so, I can't understand how it could happen that some code points have an NFC form longer than 1
more details: http://stackoverflow.com/questions/17897534/can-unicode-nfc-...
does anyone knows the cause behind this?