Hacker News new | past | comments | ask | show | jobs | submit login

Hm but only really related to the recent Joel and the he's not blogging anymore situation.

What he says is still very much valid. It's a nice intro into Unicode. Quite refreshing to read.




Actually I just miss that he doesn't state anything about the downsides of UTF-8. Like that you need to go through the string to determine how many characters it has, due to their (potentially) variable length.


I love how people then tend to bring up Win32 "wide strings", Java and .NET as alternatives, all of which use UTF-16, which is also a variable width encoding.


I'm still trying to find one valid use for length of string in unicode characters. What one usually needs to know is length of string as it's rendered by some output device, which is not related to count of unicode characters in any useful way. Even for fixed point fonts you can have glyphs that are composed from multiple unicode characters or characters whose glyphs occupy two consecutive positions.


Twitter has a limit of 140 "codepoints". Not bytes. Not glyphs.


That's weird, I thought its limit was deliberately low enough to fit into an SMS message, which has a limit of 140 octets (160 characters in some 7-bit encoding GSM uses). Do they actually allow, say, 140 kanji?



That post basically just says go look at this wiki page: https://twitterapi.pbworks.com/Counting-Characters

Why not link to that in the first place?


More importantly, you need to go through the string to determine where the nth character is. You can't jump to a character by index.


Which you can fix by storing an unsigned long at the front of the string which holds the size.

If it would overflow, you just set the long to the max size, byte the bullet and read the entire string. If you are using a dynamic language just throw whatever InfinitelyLargeNumber class it has in the size column and you're good.

If you're worried about ram that much you can just use plain strings when you need to.


"Like that you need to go through the string to determine how many characters it has, due to their (potentially) variable length."

You have to do that with all Unicode encodings, since a semantic symbol can be composed of multiple combining characters.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: