I don't understand the argument for "characters = grapheme clusters". From my pe...

Avernar · on June 20, 2016

We're pretty much on the same page here. When you want to slice a string (because you can only display or store a certain amount) or you want to do text selection and other cursor operations you can't so it by code point. That's where you want to break at character boundries which are graphemes or grapheme clusters.

For parsing it's easier to just scan for a byte sequence in UTF-8 because you know what you're looking for ahead of time. If you're looking for a matching quote, brace, etc. you just need to scan for a single byte in your text stream. Adding a smart iterator to the process that moves to the start of each code unit is not necessary and will slow things way down.

I just gave JSON and XML as examples and not an exhaustive list. If you know the code points you are scanning for it's way more effecient to scan for their code units. The state machine in a paraer will be operating at the byte level anyways.

I have yet to see a good example where processing/iterating by code point is the better choice (other than the grapheme code of the unicode library).

dietrichepp · on June 20, 2016

I'm not convinced that state machines will operate at the byte level. First of all, not all tokenizers are written using state machines. Even if that is the mathematical language we use to talk about parsers, it's still relatively common to make hand-written parsers. Secondly, if you take a Unicode-specified language and convert it to a state machine that operates on UTF-8, you can easily end up with an explosion in the number of possible states. Remember, this trick doesn't really change the size of the transition table, it just spreads it out among more states. On the other hand, you can get a lot more mileage out of using equivalency classes, as long as you're using something sensible like code points to begin with.

If you're curious, here's the V8 tokenizer header file:

https://github.com/v8/v8/blob/master/src/parsing/scanner.h

You can see that it works on an underlying UTF-16 code unit stream which is then composed into code points before tokenization. This extra step with UTF-16 is a quirk of JavaScript.

If you think that V8 shouldn't be processing by code point, feel free to explain that to them.

Avernar · on June 20, 2016

State machines would have to operate on the byte level. Otherwise each state would have to have have 65536 entries per state. The trick to handle UTF-8 would be to have 0-127 run as a state machine and > 127 break out to functions to handle the various unicode ranges that are valid for identifiers.

For languages that only allow non ascii in string literals a pure state machine would suffice.

Not sure why you're mentioning parsers. At that point you you're dealing with tokens.

As for UTF-16 it's an ugly hack that never should have existed in the first place. Unfortunately the unicode people had to fix their UCS-2 mistake.

Since Javascript is standardised to be either UCS-2 or UTF-16 it probably made sense to make the scanner use UTF-16.

dietrichepp · on June 22, 2016

State machines don't have to operate on the byte level because the tables can use equivalency classes. This will often result in smaller and faster state machines than byte-level state machines, if your language uses Unicode character classes here and there.

Avernar · on June 20, 2016

Looks like Javascript source code is required to be processed as UTF-16:

ECMAScript source text is represented as a sequence of characters in the Unicode character encoding, version 3.0 or later. [...] ECMAScript source text is assumed to be a sequence of 16-bit code units for the purposes of this specification. [...] If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16.

dietrichepp · on June 22, 2016

Right, but the UTF-16 is read code point by code point, not code unit by code unit. At that point, it might as well be UTF-8 or UTF-32.

ridiculous_fish · on June 20, 2016

Walking over grapheme clusters is common in UIs, e.g. visually truncating a string.

Really I think you are arguing against the notion of "default iteration" altogether. As you say, the right type of iteration is context dependent, and it ought to be made explicit.

dietrichepp · on June 20, 2016

I'm not sure it is so common in UIs. Truncation is done by a single library function, so that's one case where it's used. Another case is for character wrapping, but that's fairly uncommon. I'm having trouble coming up with another case where it's used. Font shaping is done by a font shaping engine, which applies an enormous number of rules specific to the script in use. Text in a text editor isn't deleted according to grapheme cluster boundaries, and the text cursor doesn't fall on grapheme cluster boundaries either. These are all rules that change according to the script in use.