Hacker News new | past | comments | ask | show | jobs | submit login

> Folks start implying that code points mean something, and that O(1) indexing or slicing at code point boundaries is a useful operation.

It isn't possible to determine where to slice a string at grapheme cluster boundaries without indexing into the code points to find their combining classes.

The author's point here is self-contradictory, because if string slicing requires anything more than slicing at code point boundaries, you're going to need fast indexing to code points.

What the author appears to want to argue is that the trade-offs are worth it. However, it's possible to make that argument without the bad faith claim that there's no value to the other perspective.




I think you’ve misunderstood Manish’s point (which is made clearer later in the article—the wording of the sentence you quoted is poor).

Manish’s contention is that there is no virtue whatsoever in O(1) access to the nth code point, because code points aren’t a useful abstraction.


I believe I understand his point, I just think he overstated his case. Consider the two forms of the following argument against UTF-32/UCS-4:

A: Although reasoning about code points is more complex when dealing with UTF-8, the complexity can usually be remedied by an inner loop to iterate between code units, which is very simple in UTF-8, as (ch & 0b11000000) || !(ch & 0b10000000) being true will indicate the first byte of a code point. For this reason, the downside is limited and the upside - not having to convert to/from UTF-8, the most common encoding for Unicode text at rest - is significant.

B: There is no downside to working in UTF-8. If you find it easier to reason about UTF-32/UCS-4, you're just wrong. There's no virtue whatsoever in doing it that way.

Manish seems to be arguing B, and I simply can't agree. A strikes me as true enough without having to exaggerate like that.


> The main time you want to be able to index by code point is if you’re implementing algorithms defined in the unicode spec that operate on unicode strings (casefolding, segmentation, NFD/NFC).

> But for application logic, dealing with code points doesn’t really make sense.

I think you're agreeing with him?


> if string slicing requires anything more than slicing at code point boundaries, you're going to need fast indexing to code points.

Why? I can safely slice anywhere in UTF-8 code units, and align to the next (or previous) grapheme cluster just as easily. And this is actually useful if I have some storage limitations, unlike codepoints.


> without indexing into the code points to find their combining classes.

Did you mean without scanning the code points? There isn't a need to index them if you don't want them (and why would one ever wat them?)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: