Never understood that either. Why is this so rare? Even in technical discussions...

jcranmer · on June 20, 2016

Navigating a UTF-8 string on codepoint level is a fairly simple algorithm, since UTF-8 is self-synchronizing. This means it can easily be done without relying on external libraries or data files. It's also stable with respect to Unicode version--it always produces the same result independent of what version of the Unicode tables you use.

Moving to grapheme cluster boundaries means that the algorithm may work incorrectly if you input a string of Unicode N+1 to an implementation that only supports Unicode N. It also makes the "increment character" function very complicated. In the UTF-8 version, this looks roughly like:

    char *advance(char *str) {
      uint8_t c = (uint8_t)*str;
      /* Count the number of leading 1's */
      int num1s = __builtin_clz(~c) - 24;
      if (num1s == 0) return str + 1;
      return str + num1s;
    }

Grapheme-based indexing looks like this:

    char *advance_grapheme(char *str) {
      while (true) {
        uint32_t codepoint = read_codepoint(str);
        str = advance(str);
        uint32_t nextCodepoint = read_codepoint(str);
        /* This is typically something like table[table2[codepoint >> 4] * 16 + codepoint & 15]; */
        GraphemeClusterBreak left = lookupProp(codepoint);
        GraphemeClusterBreak right = lookupProp(nextCodepoint);
        /* Several rules based on left versus right... */
      }
      return str;
    }

See the vast difference in the two implementations? It's a lot of complexity, and it's worth asking if that complexity needs to be built into the main library (strings are a fundamental datatype in any language). It's also important to note that it's questionable whether such a feature implemented by default is going to actually fix naive programmers' code--if you read UTR #29 carefully, you'll notice that something like क्ष will consist of two grapheme clusters (क् and ष), which is arguably incorrect. Internationalization is often tied heavily to GUI and, especially for problems like grapheme clusters, it arguably makes more sense for toolkits to implement and deal with the problems themselves and provide things like "text input widget" primitives to programmers rather than encouraging users to try to implement it themselves.

mtviewdave · on June 20, 2016

History has shown that, when it comes to strings, developers have a hard time getting even something as simple as null-termination correct. If grapheme handling is complex, that's an argument for having it implemented by a small team of experts exactly once. The resulting abstraction might not be leak-proof, but then no abstraction is.

lomnakkus · on June 20, 2016

> (strings are a fundamental datatype in any language)

(Probably a bit "unfair" to "pounce" on an off-hand paranthetical like this, but I'm in a bit of a pedantic mood...)

This is not true for e.g. Haskell. In Haskell it's defined as [Char], i.e. a list of characters. (Of course the Haskell community is suffering from that decision, but that's another story.)

I'm not sure why strings would need to be a fundamendal type, though. Sure, they would probably be part of the standard library for almost all languages, but they don't need to be "magical" in the way most fundamental types (int, etc.) are.

MrBuddyCasino · on June 20, 2016

Thanks for the detailed response. This looks like a performance problem, big lookup tables trash the cashes.

I with mtviewdave that more complexity means this should really be in the std lib. Java's charAt(int i) is misleading at best.