Do people really need to reverse strings in the real world? I don't think I've e...

ygra · on Nov 27, 2013

Substrings exhibit similar problems and those are used quite often. It's just that in this case the effect of seeing it fail is a little more dramatic (i.e., l̈ – which doesn't even seem to render properly here).

jeltz · on Nov 27, 2013

"l̈" renders just fine for me, maybe your font does not include it.

ygra · on Nov 27, 2013

Verdana doesn't seem to properly support U+0308, apparently. It's wrong (with that font) in Chrome, IE 10, Firefox and Word 2010. Other operating systems might substitute a different font that works better, perhaps.

jeltz · on Nov 27, 2013

Yes, I am running Debian without having installed the Microsoft core fonts so Verdana is substituted for DejaVu Sans.

abcd_f · on Nov 27, 2013

May be not reversing, but trimming a Unicode string to certain character count is a close relative and it is a very common operation.

lmm · on Nov 27, 2013

Right, but what's the count you want there? It's either a byte count or a grapheme cluster count. The .count() on most current languages' string types doesn't correspond to either of those, so isn't really useful.

jeltz · on Nov 27, 2013

What do you use it for? Unless you have a monospaced font the number of characters do not mean much. So unless you are implementing command line tools or text editors it should not be that common.

Pxtl · on Nov 27, 2013

Truncating with ellipsis in the GUI in a desktop app. I can measure rendered length on a desktop, so I can truncate down to the desired number of pixels, round down to the nearest char, and then tack on "...". I would hate to see a semantically-important accent mark lost this way.

jerf · on Nov 27, 2013

I have a database field limited to 100 "characters" [1]. The user sent me a form submission with 150. I need to do something to resolve that. This is incredibly common. Truncation to a defined size is routine.

[1]: I'm leaving "characters" undefined here, because no matter what Unicode-aware definition you apply here, you've got trouble.

reginaldjcooper · on Nov 27, 2013

This is a good real-world example and the response is an armchair programmer informing you that you are doing it wrong. The internet is rife with know-it-alls. "Just do X." Well, I cannot because I am contractually obligated to write the software as specified and not cowboy up and do whatever I like.

Maybe someone decided 100 characters was a reasonable cutoff and that field is not important enough to reject (read: increase bounce rate) on if someone manages to send too much.

Maybe the 100 characters is a short string generated from an unrestricted long string and cached on a separate server.

al2o3cr · on Nov 27, 2013

"I have a database field limited to 100 "characters"."

Well there's your problem right there...

"The user sent me a form submission with 150. I need to do something to resolve that."

Any software that defines "do something" here as "silently discard 1/3rd of the user input" is software I'm going to throw in the trash. If you must have fixed-length fields, surely telling the user "much characters, wow overflow" is better than just chopping the input.

jerf · on Nov 27, 2013

Since this seems to be confusing people, I'm providing a small hypothetical example here.

"Any software that defines "do something" here as "silently discard 1/3rd of the user input" is software I'm going to throw in the trash."

You are reading far more in than I put in. I merely said somehow you need to resolve this; you put a particular resolution in my mouth, then attacked.

I did choose the web for one reason, which is that you can't avoid this case; you can try to limit the UI to generating 100 characters only (and I still haven't defined "characters"...), but it's 15 seconds for a user to pull open Firebug and smash 150 characters into your form submission anyhow. Somehow, you better resolve this, and as quickly as you mounted the high horse when faced with the prospect of mere truncation, throwing the entire request out for that will cause somebody else to mount an equally high horse....

glhaynes · on Nov 27, 2013

What if it's a batch ETL process where there is no "user" to tell that it went wrong?

The point that when you're worrying about string length, it's often an indicator of a separate problem is a good one. But some things really do need the ability to measure/truncate strings and not every situation allows just throwing the software in the trash as an option.

jeltz · on Nov 27, 2013

Have you checked how your database counts? Does it count code points or does it try to count graphemes? I assume the former, but I guess you would still have to cut the input at a grapheme border when truncating the input.

ygra · on Nov 27, 2013

Ellipsisising text when it does not fit into a label, for example. And if you just remove code points from the end (instead of graphemes) until the string (including ellipsis) fits then you might just drop a diacritic.

ashashwat · on Nov 27, 2013

You have a search query and you want to remove stopwords and normalize the query.

chiph · on Nov 27, 2013

I've been waiting for someone to ask me to reverse a string in an interview, so I can tell them why the code I just wrote for them (using the XOR trick, which is what they're usually expecting), is wrong.

zwily · on Nov 27, 2013

When I've asked people to reverse a char* in the past, it's just been to see if they understand the basics of pointers. The XOR "trick" hasn't been impressive since high school. :)

jmpe · on Nov 27, 2013

Had such a case a few months back. Strings of single-byte characters are Endian-agnostic but multi-byte character encoding is affected by Endianness. To cope with it I read the sequence as single byte, then reversed, then changed the encoding to proper encoding and reversed again. The data came from a binary dump where I only needed a section that contained a few strings.

I admit it's dirty but it was throwaway code for an isolated case.

Edit: eh, guys, as I stated the string came from a binary dump. I didn't get to choose the encoding, it came from ROM in an embedded system with a different Endianness. I had to figure out a way to make it human readable.

justincormack · on Nov 27, 2013

Use UTF8, no endian issues. Thats yet another reason why UTF16 and UTF32 are broken.

monkeyninja · on Nov 27, 2013

language will not store unicode string internally with UTF8. Yes, we use it as input and output, but in memory, utf8 is terrible for random access characters. endian is only an issue (normally) for input and output, not really an issue for internal storage. especially when using UTF16 and UTF32 you know exactly the size of items.

ygra · on Nov 27, 2013

UTF-16 is just as bad as UTF-8 regarding variable-width code points. The only thing you always have (unless using compression schemes like SCSU) is random access to code units. Only UTF-32 also allows random access to code points. However, that's still of questionable value because when dealing with text you often want to handle graphemes, not code points, code units or bytes.

jeltz · on Nov 27, 2013

You cannot do random access at all in Unicode, not even UTF-32 (and absolutely not UTF-16), due to combining characters.

jmdavis · on Nov 27, 2013

UTF16 is variable length just like UTF8 is (so don't assume 2 bytes == 1 character).

jeltz · on Nov 27, 2013

For this you do not need to reverse the string in a unicode aware way. You need to operate on the raw bytes.

markild · on Nov 27, 2013

The same mechanism that is used for reversing a string can be very useful though.

Think in the lines of python's:

>>> 'abcd'[:-1]

'abc'