Hacker News new | past | comments | ask | show | jobs | submit login

Do people really need to reverse strings in the real world?

I don't think I've ever written code to do that outside of homework assignments and interviews.




Substrings exhibit similar problems and those are used quite often. It's just that in this case the effect of seeing it fail is a little more dramatic (i.e., l̈ – which doesn't even seem to render properly here).


"l̈" renders just fine for me, maybe your font does not include it.


Verdana doesn't seem to properly support U+0308, apparently. It's wrong (with that font) in Chrome, IE 10, Firefox and Word 2010. Other operating systems might substitute a different font that works better, perhaps.


Yes, I am running Debian without having installed the Microsoft core fonts so Verdana is substituted for DejaVu Sans.


May be not reversing, but trimming a Unicode string to certain character count is a close relative and it is a very common operation.


Right, but what's the count you want there? It's either a byte count or a grapheme cluster count. The .count() on most current languages' string types doesn't correspond to either of those, so isn't really useful.


What do you use it for? Unless you have a monospaced font the number of characters do not mean much. So unless you are implementing command line tools or text editors it should not be that common.


Truncating with ellipsis in the GUI in a desktop app. I can measure rendered length on a desktop, so I can truncate down to the desired number of pixels, round down to the nearest char, and then tack on "...". I would hate to see a semantically-important accent mark lost this way.


I have a database field limited to 100 "characters" [1]. The user sent me a form submission with 150. I need to do something to resolve that. This is incredibly common. Truncation to a defined size is routine.

[1]: I'm leaving "characters" undefined here, because no matter what Unicode-aware definition you apply here, you've got trouble.


This is a good real-world example and the response is an armchair programmer informing you that you are doing it wrong. The internet is rife with know-it-alls. "Just do X." Well, I cannot because I am contractually obligated to write the software as specified and not cowboy up and do whatever I like.

Maybe someone decided 100 characters was a reasonable cutoff and that field is not important enough to reject (read: increase bounce rate) on if someone manages to send too much.

Maybe the 100 characters is a short string generated from an unrestricted long string and cached on a separate server.


"I have a database field limited to 100 "characters"."

Well there's your problem right there...

"The user sent me a form submission with 150. I need to do something to resolve that."

Any software that defines "do something" here as "silently discard 1/3rd of the user input" is software I'm going to throw in the trash. If you must have fixed-length fields, surely telling the user "much characters, wow overflow" is better than just chopping the input.


Since this seems to be confusing people, I'm providing a small hypothetical example here.

"Any software that defines "do something" here as "silently discard 1/3rd of the user input" is software I'm going to throw in the trash."

You are reading far more in than I put in. I merely said somehow you need to resolve this; you put a particular resolution in my mouth, then attacked.

I did choose the web for one reason, which is that you can't avoid this case; you can try to limit the UI to generating 100 characters only (and I still haven't defined "characters"...), but it's 15 seconds for a user to pull open Firebug and smash 150 characters into your form submission anyhow. Somehow, you better resolve this, and as quickly as you mounted the high horse when faced with the prospect of mere truncation, throwing the entire request out for that will cause somebody else to mount an equally high horse....


What if it's a batch ETL process where there is no "user" to tell that it went wrong?

The point that when you're worrying about string length, it's often an indicator of a separate problem is a good one. But some things really do need the ability to measure/truncate strings and not every situation allows just throwing the software in the trash as an option.


Have you checked how your database counts? Does it count code points or does it try to count graphemes? I assume the former, but I guess you would still have to cut the input at a grapheme border when truncating the input.


Ellipsisising text when it does not fit into a label, for example. And if you just remove code points from the end (instead of graphemes) until the string (including ellipsis) fits then you might just drop a diacritic.


You have a search query and you want to remove stopwords and normalize the query.


I've been waiting for someone to ask me to reverse a string in an interview, so I can tell them why the code I just wrote for them (using the XOR trick, which is what they're usually expecting), is wrong.


When I've asked people to reverse a char* in the past, it's just been to see if they understand the basics of pointers. The XOR "trick" hasn't been impressive since high school. :)


Had such a case a few months back. Strings of single-byte characters are Endian-agnostic but multi-byte character encoding is affected by Endianness. To cope with it I read the sequence as single byte, then reversed, then changed the encoding to proper encoding and reversed again. The data came from a binary dump where I only needed a section that contained a few strings.

I admit it's dirty but it was throwaway code for an isolated case.

Edit: eh, guys, as I stated the string came from a binary dump. I didn't get to choose the encoding, it came from ROM in an embedded system with a different Endianness. I had to figure out a way to make it human readable.


Use UTF8, no endian issues. Thats yet another reason why UTF16 and UTF32 are broken.


language will not store unicode string internally with UTF8. Yes, we use it as input and output, but in memory, utf8 is terrible for random access characters. endian is only an issue (normally) for input and output, not really an issue for internal storage. especially when using UTF16 and UTF32 you know exactly the size of items.


UTF-16 is just as bad as UTF-8 regarding variable-width code points. The only thing you always have (unless using compression schemes like SCSU) is random access to code units. Only UTF-32 also allows random access to code points. However, that's still of questionable value because when dealing with text you often want to handle graphemes, not code points, code units or bytes.


You cannot do random access at all in Unicode, not even UTF-32 (and absolutely not UTF-16), due to combining characters.


UTF16 is variable length just like UTF8 is (so don't assume 2 bytes == 1 character).


For this you do not need to reverse the string in a unicode aware way. You need to operate on the raw bytes.


The same mechanism that is used for reversing a string can be very useful though.

Think in the lines of python's:

>>> 'abcd'[:-1]

'abc'




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: