Substrings exhibit similar problems and those are used quite often. It's just that in this case the effect of seeing it fail is a little more dramatic (i.e., l̈ – which doesn't even seem to render properly here).
Verdana doesn't seem to properly support U+0308, apparently. It's wrong (with that font) in Chrome, IE 10, Firefox and Word 2010. Other operating systems might substitute a different font that works better, perhaps.
Right, but what's the count you want there? It's either a byte count or a grapheme cluster count. The .count() on most current languages' string types doesn't correspond to either of those, so isn't really useful.
What do you use it for? Unless you have a monospaced font the number of characters do not mean much. So unless you are implementing command line tools or text editors it should not be that common.
Truncating with ellipsis in the GUI in a desktop app. I can measure rendered length on a desktop, so I can truncate down to the desired number of pixels, round down to the nearest char, and then tack on "...". I would hate to see a semantically-important accent mark lost this way.
I have a database field limited to 100 "characters" [1]. The user sent me a form submission with 150. I need to do something to resolve that. This is incredibly common. Truncation to a defined size is routine.
[1]: I'm leaving "characters" undefined here, because no matter what Unicode-aware definition you apply here, you've got trouble.
This is a good real-world example and the response is an armchair programmer informing you that you are doing it wrong. The internet is rife with know-it-alls. "Just do X." Well, I cannot because I am contractually obligated to write the software as specified and not cowboy up and do whatever I like.
Maybe someone decided 100 characters was a reasonable cutoff and that field is not important enough to reject (read: increase bounce rate) on if someone manages to send too much.
Maybe the 100 characters is a short string generated from an unrestricted long string and cached on a separate server.
"I have a database field limited to 100 "characters"."
Well there's your problem right there...
"The user sent me a form submission with 150. I need to do something to resolve that."
Any software that defines "do something" here as "silently discard 1/3rd of the user input" is software I'm going to throw in the trash. If you must have fixed-length fields, surely telling the user "much characters, wow overflow" is better than just chopping the input.
Since this seems to be confusing people, I'm providing a small hypothetical example here.
"Any software that defines "do something" here as "silently discard 1/3rd of the user input" is software I'm going to throw in the trash."
You are reading far more in than I put in. I merely said somehow you need to resolve this; you put a particular resolution in my mouth, then attacked.
I did choose the web for one reason, which is that you can't avoid this case; you can try to limit the UI to generating 100 characters only (and I still haven't defined "characters"...), but it's 15 seconds for a user to pull open Firebug and smash 150 characters into your form submission anyhow. Somehow, you better resolve this, and as quickly as you mounted the high horse when faced with the prospect of mere truncation, throwing the entire request out for that will cause somebody else to mount an equally high horse....
What if it's a batch ETL process where there is no "user" to tell that it went wrong?
The point that when you're worrying about string length, it's often an indicator of a separate problem is a good one. But some things really do need the ability to measure/truncate strings and not every situation allows just throwing the software in the trash as an option.
Have you checked how your database counts? Does it count code points or does it try to count graphemes? I assume the former, but I guess you would still have to cut the input at a grapheme border when truncating the input.
Ellipsisising text when it does not fit into a label, for example. And if you just remove code points from the end (instead of graphemes) until the string (including ellipsis) fits then you might just drop a diacritic.
I've been waiting for someone to ask me to reverse a string in an interview, so I can tell them why the code I just wrote for them (using the XOR trick, which is what they're usually expecting), is wrong.
When I've asked people to reverse a char* in the past, it's just been to see if they understand the basics of pointers. The XOR "trick" hasn't been impressive since high school. :)
Had such a case a few months back. Strings of single-byte characters are Endian-agnostic but multi-byte character encoding is affected by Endianness. To cope with it I read the sequence as single byte, then reversed, then changed the encoding to proper encoding and reversed again. The data came from a binary dump where I only needed a section that contained a few strings.
I admit it's dirty but it was throwaway code for an isolated case.
Edit: eh, guys, as I stated the string came from a binary dump. I didn't get to choose the encoding, it came from ROM in an embedded system with a different Endianness. I had to figure out a way to make it human readable.
language will not store unicode string internally with UTF8. Yes, we use it as input and output, but in memory, utf8 is terrible for random access characters. endian is only an issue (normally) for input and output, not really an issue for internal storage. especially when using UTF16 and UTF32 you know exactly the size of items.
UTF-16 is just as bad as UTF-8 regarding variable-width code points. The only thing you always have (unless using compression schemes like SCSU) is random access to code units. Only UTF-32 also allows random access to code points. However, that's still of questionable value because when dealing with text you often want to handle graphemes, not code points, code units or bytes.
I don't think I've ever written code to do that outside of homework assignments and interviews.