UTF-8 is at odds with efficient array indexing. I like pythons approach where by...

int_19h · on Jan 10, 2019

Modern Python uses whatever representation is sufficient to ensure 1-unit-per-codepoint for a given string (which it can do on creation, since strings are immutable). So you get ASCII, UTF-16 sans surrogate pairs, or UTF-32.

This is great for high-level code, but painful to work with from native code, because it usually needs some specific encoding to call into other libraries, and it's usually UTF-8 - so you need to re-encode all the time.

colatkinson · on Jan 10, 2019

I actually had to work with Python strings at the C level recently, and their approach is pretty clever. IIRC, the runtime can take any common form of Unicode, and will store it. When you access that string, the accessor requests a specific encoding, and the runtime will convert if need be, and then store it in the string object.

So it handles the (very) common case of needing the same encoding multiple times (e.g. for all file paths on Windows), while not introducing too much overhead in memory or speed.

I could be mistaken on exact details though, especially since I recall there being multiple implementations even within py3.x.

Skunkleton · on Jan 11, 2019

Any idea how it handles indexing? Does it convert everything to 32 bit chars and ignore graphemes?