Hacker News new | past | comments | ask | show | jobs | submit login

This is misleading. UTF-16 doesn't actually provide random access, because codepoints outside the basic multilingual plane are encoded with two UTF-16 code units (4 bytes). UTF-32 guarantees 4 bytes for every codepoint, which is quite wasteful, but even then, random access by codepoint is generally a bad idea because codepoints and graphemes aren't synonymous.

> but you can barely perform random access on a utf8 string.

This really isn't true, or at least, isn't a problem in practice. If you need indices into UTF-8 strings, then you can record them by decoding the string. This is sufficient for most string related algorithms except for the "give me the first N characters" variety, which actually turns out to be a relative good thing since "give me the first N characters" should require applying Unicode's grapheme algorithm, which is never amenable to random access in UTF-8, UTF-16 or UTF-32.




> random access by codepoint is generally a bad idea because codepoints and graphemes aren't synonymous.

That's a good point.

>If you need indices into UTF-8 strings, then you can record them by decoding the string.

Sure but the context is that I was responding to "If you use a rope or tree representation (like most editors do these days), random access is O(log n) at best, in many implementations typically O(sqrt n) or even O(n), whereas concatenation string is O(1).".

Decoding the string is O(n); not O(1) (if GP meant "concatenation string" as in a string where the text is concatenated into a single buffer - if GP meant "concatenating a string in a rope or tree is O(1)" then I'm off on a wild tangent).


But you only need to decode it once (or otherwise receive those indices without even deciding) whereas random access to a rope/tree is always o(log n) or o(n).

Use case is everything.


>Use case is everything.

amen.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: