Hacker News new | past | comments | ask | show | jobs | submit login

It definitely gets a bit murky when dealing with mbcs, when you want characters spanning multiple bytes rather than individual bytes.

I understand the topic is strXxx() funcs which are ascii only, but it does need to be said that size!=len for wide and multi char sets.




Yeah, that's an important observation especially in today's unicode world. It just strengthens my point that these "string" functions are really just bytes/memory functions in disguise.

Honestly "string" is a very harmful word that we've all grown used to. As an abstraction it sits somewhere between raw bytes and properly encoded text with proper unicode functions such as those provided by ICU. Python 3 finally forced people to start thinking about this stuff and nobody liked it.


The str functions aren't ASCII-only, they work perfectly fine with multi-byte strings such as UTF-8-encoded strings. The "length" just isn't the number of "characters", but the definition of a "character" itself is murky and bytes are what what you're usually interested in anyways.


> and bytes are what what you're usually interested in anyways.

Bytes are relevant when I have to allocate memory otherwise some definition of "character" is often more relevant. Even if I trim text to fit in a buffer I don't want to trim inside a "character" but get the most number of fitting "characters" Now "characters" are of course complicated as grapheme clusters are what is useful the most for human interaction ... but those are quite out of scope for a "simple" string library ...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: