On Unicode Normalization – or why normalization insensitivity should be rule

adekok · on Jan 18, 2015

They're right.

I spent the last 5 years trying to get i18n working in a protocol:

https://tools.ietf.org/html/draft-ietf-radext-nai-15

The working group came to the same conclusion. Whatever crap the client puts in the packet is inviolate. Don't change it. You're not smart enough to do that.

Instead, if you need to do comparisons, compare on the normalized strings. You create a "normalized" version of the string, and then use that for comparisons.

I wish the IETF precis working group had a "recommendations" document. It would make everyone's lives much simpler.

jmount · on Jan 18, 2015

Having the luxury of only watching Unicode from afar, it looks like a right horror show. As a mathematician I respect how much harder it can be to work over equivalence classes (especially when you don't have obvious normal forms) than to work over mere instances. I am not sure all the Unicode players really appreciate this (though I definitely agree: you can't safely re-normalize externally supplied strings, you do have to work with equivalence classes at some point).

What I keep hoping is the Unicode committees will eventually pile on one bad idea too many and make string comparison as hard as a general instance of the Post correspondence problem http://en.wikipedia.org/wiki/Post_correspondence_problem and thus undecidable. They are presumably not there yet (due to unambiguity, reading direction, and bounded length), but I keep hoping.

The idea is many characters in Unicode have more than one equivalent representation (and these can have different lengths). So, roughly, checking sequences of codepoints represent the same string becomes a grouping problem:

For a sequence of integers 1...t define a ordered sequence partition P as a sequence of contiguous sets of integers (P1,...,Ph) such that:

  1) Each Pi is a contiguous sequence of integers u...v with 1<=u<=v<=t.

  2) Union_i Pi = {1,...,t}

  3) max(Pi) < min(Pj) for all 1<=i<j<=h.

Check for:

   a1,a2,...am 
   and b1,b2,...,bn

can the sequence of integers 1...m and 1...n be order sequence partitioned into called A=(A1,...,Au) and B=(B1,...,Bu) such that: the sequence of code points a_{Ai} and b_{Bi} (i=1...u) represent the same equivalent Unicode characters?

What stops this from encoding generally hard problems (NP hard) is lack ambiguity in the code-point -> character dictionaries ensuring there is only one partition of each code point sequence such that all elements are valid code-points. Thus we can find the unique partition of each code point sequence using a left to right read and then we just check if the partitions match.

So all we have to do to successfully encode NP hard problems is trick the standards committee into introducing a sufficient number of ambiguous grouping (things like code-points "a1 a2 a3 a4" such that both "(a1 a2 a3) (a4)" and "(a1 a2) (a3 a4)" are valid groupings). This will kill the obvious comparison algorithms, and with some luck we get a code dictionary that will allow us to encode NP-hard problems as string equivalence.

To get undecidable problems we just have to trick the committee into introducing enough "fix insertions" where "a1 a2 a3 a4" can be grouped into "(a1 x1 a2) (a3 a4 x2 x3)" by the insertion of the implied or "fix" code-points x1, x2, x3. Then, with some luck, we could build a code dictionary that could encode enough instances of Post correspondence problems to make Unicode string comparison Turing complete and thus undecidable. (I need to check if a single fixed code-book is enough, or if we need some trick to allow in many different code-books.)

So I think all we need is some clever design (to actually get a dangerously expressive encoding, not just the suspicion there is one) and stand up a stooge linguist or anthropologist to claim a few additional "harmless lexical equivalences" are needed to faithfully represent a few more languages.

----

What would I have liked instead of Unicode? Some sort of tagged string-region implementation. Each string-region would have to specify what lexical dictionary (and version of the dictionary) it follows and only one language/lexical context is used for the whole string-region. So each section you are working with has a single set of rules meant to represent only one language (instead of hoping for a universal representation). Some care has to be taken that the region blocks can never be confused with in-block stuff to ensure we can identify the block boundaries before really looking into any region (perhaps indicating that block designations should not be in the string, a lot like when we added length to strings instead of relying on C-style zero ending).