Hacker News new | past | comments | ask | show | jobs | submit login

Not to disagree, but you can barely perform random access on a utf8 string. You need to explode it out to utf16 or utf32 which isn't what most languages have built in. Rust and go largely work with utf8 while c and c++ love them byte arrays (not sure I've even seen std::wstring in the wild)



This is misleading. UTF-16 doesn't actually provide random access, because codepoints outside the basic multilingual plane are encoded with two UTF-16 code units (4 bytes). UTF-32 guarantees 4 bytes for every codepoint, which is quite wasteful, but even then, random access by codepoint is generally a bad idea because codepoints and graphemes aren't synonymous.

> but you can barely perform random access on a utf8 string.

This really isn't true, or at least, isn't a problem in practice. If you need indices into UTF-8 strings, then you can record them by decoding the string. This is sufficient for most string related algorithms except for the "give me the first N characters" variety, which actually turns out to be a relative good thing since "give me the first N characters" should require applying Unicode's grapheme algorithm, which is never amenable to random access in UTF-8, UTF-16 or UTF-32.


> random access by codepoint is generally a bad idea because codepoints and graphemes aren't synonymous.

That's a good point.

>If you need indices into UTF-8 strings, then you can record them by decoding the string.

Sure but the context is that I was responding to "If you use a rope or tree representation (like most editors do these days), random access is O(log n) at best, in many implementations typically O(sqrt n) or even O(n), whereas concatenation string is O(1).".

Decoding the string is O(n); not O(1) (if GP meant "concatenation string" as in a string where the text is concatenated into a single buffer - if GP meant "concatenating a string in a rope or tree is O(1)" then I'm off on a wild tangent).


But you only need to decode it once (or otherwise receive those indices without even deciding) whereas random access to a rope/tree is always o(log n) or o(n).

Use case is everything.


>Use case is everything.

amen.


Well, depends if you want the random access to be by bytes or code points - you might have the byte indices from a previous access.

Also, factor (and iirc python too) keeps it at 8 bits if all codepoints fit in 8 bits, 16 if it fits in 16, and 32 otherwise.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: