> *Folks start implying that code points mean something, and that O(1) indexing ...

chrismorgan · on June 26, 2022

I think you’ve misunderstood Manish’s point (which is made clearer later in the article—the wording of the sentence you quoted is poor).

Manish’s contention is that there is no virtue whatsoever in O(1) access to the nth code point, because code points aren’t a useful abstraction.

torstenvl · on July 6, 2022

I believe I understand his point, I just think he overstated his case. Consider the two forms of the following argument against UTF-32/UCS-4:

A: Although reasoning about code points is more complex when dealing with UTF-8, the complexity can usually be remedied by an inner loop to iterate between code units, which is very simple in UTF-8, as (ch & 0b11000000) || !(ch & 0b10000000) being true will indicate the first byte of a code point. For this reason, the downside is limited and the upside - not having to convert to/from UTF-8, the most common encoding for Unicode text at rest - is significant.

B: There is no downside to working in UTF-8. If you find it easier to reason about UTF-32/UCS-4, you're just wrong. There's no virtue whatsoever in doing it that way.

Manish seems to be arguing B, and I simply can't agree. A strikes me as true enough without having to exaggerate like that.

gaganyaan · on June 26, 2022

> The main time you want to be able to index by code point is if you’re implementing algorithms defined in the unicode spec that operate on unicode strings (casefolding, segmentation, NFD/NFC).

> But for application logic, dealing with code points doesn’t really make sense.

I think you're agreeing with him?

morelisp · on June 26, 2022

> if string slicing requires anything more than slicing at code point boundaries, you're going to need fast indexing to code points.

Why? I can safely slice anywhere in UTF-8 code units, and align to the next (or previous) grapheme cluster just as easily. And this is actually useful if I have some storage limitations, unlike codepoints.

marcosdumay · on June 26, 2022

> without indexing into the code points to find their combining classes.

Did you mean without scanning the code points? There isn't a need to index them if you don't want them (and why would one ever wat them?)