Text normalization in Go

berdario · on Nov 27, 2013

Just a small detail that isn't mentioned in the article:

in NFC form, "base characters and modifiers are combined into a single rune whenever possible"

the interesting detail is "whenever possible": since NFC works by first decomposing, and then recomposing... there're some cases in which if you run NFC normalization on it, the characters will remain decomposed

an example is 𝅘𝅥𝅮 (U+1D160) which its normalized composed form is made of 3 different codepoints

I tried to look at the algorithm for generating the composition table, and it seems it's generated from the decomposition table... if that's so, I can't understand how it could happen that some code points have an NFC form longer than 1

more details: http://stackoverflow.com/questions/17897534/can-unicode-nfc-...

does anyone knows the cause behind this?

lelf · on Nov 27, 2013

1. It's decompose, reorder, compose. So you can see some weird stuff like ḍ̇=ḋ○̣ → NFD=d○̣○̇ → NFD=ḍ○̇

2. It's not compression, it's normalisation. So it's not compose everything you can. I cannot tell you exact the algorithm off the top of my head, but:

the reason for U+1D160 — it's in CompositionExclusions list.

berdario · on Nov 27, 2013

Thanks, after looking up CompositionExclusions I discovered the rationale:

http://unicode.org/reports/tr15/#Primary_Exclusion_List_Tabl...

> When a character with a canonical decomposition is added to Unicode, it must be added to the composition exclusion table if there is at least one character in its decomposition that existed in a previous version of Unicode. If there are no such characters, then it is possible for it to be added or omitted from the composition exclusion table. The choice of whether to do so or not rests upon whether it is generally used in the precomposed form or not.

Nogwater · on Nov 27, 2013

That "café" -> "cafeś" replacement is pretty scary. It looks like the built in strings.Replace function makes the same mistake:

  fmt.Println(strings.Replace("multiple cafe\u0301", "cafe", "cafes", 1)) // multiple cafeś

rsc · on Nov 27, 2013

fmt.Println(strings.Replace("multiple cafeterias", "cafe", "cafes", 1))

Nogwater · on Nov 27, 2013

Yeah, I get that. It's just that you might assume that the strings functions would operate on character boundaries (as defined in the blog post) and not based on runes (code points). Leaky abstractions and all that...

enneff · on Nov 27, 2013

The purpose of the normalization package is to help you work with text under these constraints. I can't imagine many situations where strings.Replace would be sufficient for reliably manipulating natural language. The cafe example is to demonstrate the why you might need the package.

Nogwater · on Nov 27, 2013

I wasn't thinking that I'd really want to pluralize text like this, but maybe you'd want to turn people's names into links in HTML source or something. If someone's name ends with an accent, and if the unicode isn't normalized, strange things are bound to happen. The blog post is great at pointing this out, and it sounds like people are working on a go.text/search package to help, so that's good. I'm not saying Go is broken, just that this kind of stuff can be really surprising.

enneff · on Nov 27, 2013

Yep, working with natural languages is scary. :-)

lelf · on Nov 27, 2013

It's not for bashing the parent comment, but I find it funny. From the beginning of time, 99.99% languages have horrific unicode support (and 99.999% programmers have not got a bit of clue in this area) and then suddenly…

hmmdar · on Nov 27, 2013

Looks like this issue is pervasive in other languages as well. Out of curiosity ran the same test in Javascript and received the same result.

  s = "We went to eat at multiple cafe\u0301"
  "We went to eat at multiple café"
  s.replace('cafe', 'cafes');
  "We went to eat at multiple cafeś"

Interesting thing is when the text is copy-pasted backspacing first deletes the accent. At least in chrome.

lstamour · on Nov 27, 2013

FYI:

Node.js - https://github.com/walling/unorm YMMV, but looks good.

It can also serve as a polyfill for the eventual http://people.mozilla.org/~jorendorff/es6-draft.html#sec-str...

argon81 · on Nov 27, 2013

I actually took this a step further a few months back and implemented unicode's "Skeleton" algorithm https://github.com/mtibben/tr39-confusables-go

This is useful for example, to ensure that users don't try and spoof each other's usernames. Simply create and store a skeleton string for each username, and keep a unique constraint on it

zellyn · on Nov 27, 2013

That confusables list is a good starting point, although you'll need to make additions, and probably scale back a couple of the over-zealous ones (eg. rn -> m)

I'm coming at this from a comment spam point of view, not usernames, btw.

jcampbell1 · on Nov 27, 2013

This document is good, but doesn't mention the case of ligatures. German's "ß" is a problem, and it is not obvious how go handles it.

In javascript:

     "ß".toUpperCase().length !== "ß".length;

Does weiss == weiß ?

lelf · on Nov 27, 2013

1) It is ligature only in historic sense, so it's not;

2) Ligatures (e.g. ﬃ as ffi) are deprecated in Unicode;

3) weiss ≠ weiß in any sense

Edit: 4) x.toUpperCase().length ≢ x.length, upcasing can change length;

5) length in JS (in 100000 other languages) count codepoints (at best), it's useful for nothing here

patrickg · on Nov 27, 2013

Does weiss == weiß ?

Yes and no. The swiss would write the former, other German speaking (writing) countries would write the latter. It is incorrect in Germany (after ie, au, eu, ... you must not write ss, unless it's a name, such as the city Neuss)

The upper case of weiß would be WEISS. But it's hard from the upper case WEISS to determine if the lower case is weiss or weiß. (This is why one should never write people's names in bibliographies in small caps.)

qznc · on Nov 27, 2013

Well, toUpperCase() is kind of a broken API. It should be something like "weiß".toUpperCase("de-DE") to distinguish to "weiß".toUpperCase("de-CH").

qznc · on Nov 27, 2013

You can upper case of weiß as WEIß. It is mandatory for taxes and other documents and recommended by the Post.

Technically, Unicode has a capital sharp s since 5.1.0, so we could write WEIẞ.

patrickg · on Nov 27, 2013

Yes, you can do that. But that's evil and ugly (mixing uppercase and lowercase letters that way). I know it has do be done sometimes.

And I am glad that U+1E9E (LATIN CAPITAL LETTER SHARP S) is not official part of German orthography.

draegtun · on Nov 27, 2013

>Does weiss == weiß ?

You need a case folding function/method to check for this.

For eg. in Perl, see the fc function - http://perldoc.perl.org/functions/fc.html

  fc("weiss") eq fc("weiß");   # true

lstamour · on Nov 27, 2013

For a normal ligature, if http://golang.org/src/pkg/unicode/letter_test.go?h=ToLower is anything to go by, then no for your question, but yes to your code, just not the way you think it works. Which is to say, strings.ToUpper("\u0133") appears to produce "\u0132" as a result.

But \u00DF appears to be a special case, as there's no uppercase for it. If I had to guess, I'd say it should return \u00DF. I mean, if I uppercase "+", do I expect something else back? Doubtful.

patrickg · on Nov 27, 2013

Unicode tells me:

• Special Casing: Lowercase: 00DF [ ‌ß ] Uppercase: 0053 0053 [ ‌S ‌S ] Titlecase: 0053 0073 [ ‌S ‌s ]

• NamesList: = Eszett • German • uppercase is "SS" • in origin a ligature of 017F and 0073 → (greek small letter beta - 03B2) → (latin capital letter sharp s - 1E9E)

(in origin a ligature of 017F and 0073 is not undisputed)

U+1E9E (LATIN CAPITAL LETTER SHARP S ẞ) is not officially allowed in German orthography • NamesList: • lowercase is 00DF → (latin small letter sharp s - 00DF) • Designated in Unicode 5.1

abvdasker · on Nov 27, 2013

Not that there's anything wrong with it, but why are there so many HN articles about Go?

titraxx · on Nov 27, 2013

Is there an rss for articles of this blog ? Didn't found it.

patrickg · on Nov 27, 2013

There is an atom feed:

    <link rel="alternate" type="application/atom+xml" title="blog.golang.org - Atom Feed" href="http://blog.golang.org/feed.atom"/>