Hacker News new | past | comments | ask | show | jobs | submit login
Text normalization in Go (golang.org)
99 points by enneff on Nov 27, 2013 | hide | past | favorite | 26 comments



Just a small detail that isn't mentioned in the article:

in NFC form, "base characters and modifiers are combined into a single rune whenever possible"

the interesting detail is "whenever possible": since NFC works by first decomposing, and then recomposing... there're some cases in which if you run NFC normalization on it, the characters will remain decomposed

an example is 𝅘𝅥𝅮 (U+1D160) which its normalized composed form is made of 3 different codepoints

I tried to look at the algorithm for generating the composition table, and it seems it's generated from the decomposition table... if that's so, I can't understand how it could happen that some code points have an NFC form longer than 1

more details: http://stackoverflow.com/questions/17897534/can-unicode-nfc-...

does anyone knows the cause behind this?


1. It's decompose, reorder, compose. So you can see some weird stuff like ḍ̇=ḋ○̣ → NFD=d○̣○̇ → NFD=ḍ○̇

2. It's not compression, it's normalisation. So it's not compose everything you can. I cannot tell you exact the algorithm off the top of my head, but:

the reason for U+1D160 — it's in CompositionExclusions list.


Thanks, after looking up CompositionExclusions I discovered the rationale:

http://unicode.org/reports/tr15/#Primary_Exclusion_List_Tabl...

> When a character with a canonical decomposition is added to Unicode, it must be added to the composition exclusion table if there is at least one character in its decomposition that existed in a previous version of Unicode. If there are no such characters, then it is possible for it to be added or omitted from the composition exclusion table. The choice of whether to do so or not rests upon whether it is generally used in the precomposed form or not.


That "café" -> "cafeś" replacement is pretty scary. It looks like the built in strings.Replace function makes the same mistake:

  fmt.Println(strings.Replace("multiple cafe\u0301", "cafe", "cafes", 1)) // multiple cafeś


fmt.Println(strings.Replace("multiple cafeterias", "cafe", "cafes", 1))


Yeah, I get that. It's just that you might assume that the strings functions would operate on character boundaries (as defined in the blog post) and not based on runes (code points). Leaky abstractions and all that...


The purpose of the normalization package is to help you work with text under these constraints. I can't imagine many situations where strings.Replace would be sufficient for reliably manipulating natural language. The cafe example is to demonstrate the why you might need the package.


I wasn't thinking that I'd really want to pluralize text like this, but maybe you'd want to turn people's names into links in HTML source or something. If someone's name ends with an accent, and if the unicode isn't normalized, strange things are bound to happen. The blog post is great at pointing this out, and it sounds like people are working on a go.text/search package to help, so that's good. I'm not saying Go is broken, just that this kind of stuff can be really surprising.


Yep, working with natural languages is scary. :-)


It's not for bashing the parent comment, but I find it funny. From the beginning of time, 99.99% languages have horrific unicode support (and 99.999% programmers have not got a bit of clue in this area) and then suddenly


Looks like this issue is pervasive in other languages as well. Out of curiosity ran the same test in Javascript and received the same result.

  s = "We went to eat at multiple cafe\u0301"
  "We went to eat at multiple café"
  s.replace('cafe', 'cafes');
  "We went to eat at multiple cafeś"
Interesting thing is when the text is copy-pasted backspacing first deletes the accent. At least in chrome.


FYI:

Node.js - https://github.com/walling/unorm YMMV, but looks good.

It can also serve as a polyfill for the eventual http://people.mozilla.org/~jorendorff/es6-draft.html#sec-str...


I actually took this a step further a few months back and implemented unicode's "Skeleton" algorithm https://github.com/mtibben/tr39-confusables-go

This is useful for example, to ensure that users don't try and spoof each other's usernames. Simply create and store a skeleton string for each username, and keep a unique constraint on it


That confusables list is a good starting point, although you'll need to make additions, and probably scale back a couple of the over-zealous ones (eg. rn -> m)

I'm coming at this from a comment spam point of view, not usernames, btw.


This document is good, but doesn't mention the case of ligatures. German's "ß" is a problem, and it is not obvious how go handles it.

In javascript:

     "ß".toUpperCase().length !== "ß".length;
Does weiss == weiß ?


1) It is ligature only in historic sense, so it's not;

2) Ligatures (e.g. ffi as ffi) are deprecated in Unicode;

3) weiss ≠ weiß in any sense

Edit: 4) x.toUpperCase().length ≢ x.length, upcasing can change length;

5) length in JS (in 100000 other languages) count codepoints (at best), it's useful for nothing here


Does weiss == weiß ?

Yes and no. The swiss would write the former, other German speaking (writing) countries would write the latter. It is incorrect in Germany (after ie, au, eu, ... you must not write ss, unless it's a name, such as the city Neuss)

The upper case of weiß would be WEISS. But it's hard from the upper case WEISS to determine if the lower case is weiss or weiß. (This is why one should never write people's names in bibliographies in small caps.)


Well, toUpperCase() is kind of a broken API. It should be something like "weiß".toUpperCase("de-DE") to distinguish to "weiß".toUpperCase("de-CH").


You can upper case of weiß as WEIß. It is mandatory for taxes and other documents and recommended by the Post.

Technically, Unicode has a capital sharp s since 5.1.0, so we could write WEIẞ.


Yes, you can do that. But that's evil and ugly (mixing uppercase and lowercase letters that way). I know it has do be done sometimes.

And I am glad that U+1E9E (LATIN CAPITAL LETTER SHARP S) is not official part of German orthography.


>Does weiss == weiß ?

You need a case folding function/method to check for this.

For eg. in Perl, see the fc function - http://perldoc.perl.org/functions/fc.html

  fc("weiss") eq fc("weiß");   # true


For a normal ligature, if http://golang.org/src/pkg/unicode/letter_test.go?h=ToLower is anything to go by, then no for your question, but yes to your code, just not the way you think it works. Which is to say, strings.ToUpper("\u0133") appears to produce "\u0132" as a result.

But \u00DF appears to be a special case, as there's no uppercase for it. If I had to guess, I'd say it should return \u00DF. I mean, if I uppercase "+", do I expect something else back? Doubtful.


Unicode tells me:

• Special Casing: Lowercase: 00DF [ ‌ß ] Uppercase: 0053 0053 [ ‌S ‌S ] Titlecase: 0053 0073 [ ‌S ‌s ]

• NamesList: = Eszett • German • uppercase is "SS" • in origin a ligature of 017F and 0073 → (greek small letter beta - 03B2) → (latin capital letter sharp s - 1E9E)

(in origin a ligature of 017F and 0073 is not undisputed)

U+1E9E (LATIN CAPITAL LETTER SHARP S ẞ) is not officially allowed in German orthography • NamesList: • lowercase is 00DF → (latin small letter sharp s - 00DF) • Designated in Unicode 5.1


Not that there's anything wrong with it, but why are there so many HN articles about Go?


Is there an rss for articles of this blog ? Didn't found it.


There is an atom feed:

    <link rel="alternate" type="application/atom+xml" title="blog.golang.org - Atom Feed" href="http://blog.golang.org/feed.atom"/>




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: