Hacker News new | past | comments | ask | show | jobs | submit login
String types are fine. How about your code? (eteeselink.wordpress.com)
107 points by skrebbel on Nov 28, 2013 | hide | past | favorite | 62 comments



Haskell does this somewhat nicely by shunning the built-in String type and having ByteString and Text types. All there (and others) can be created using string literals, though that can be dangerous, but Text is a UTF-16 encoded, ICU-backed human-text monster type which handles upcasing ligatures and even more complex collation send the sort (which is, btw, how you solve the phone book issue, and it's just one C library away).

ByteString is a series if bytes that just may happen to be ok to print as human text for debugging. The system makes it hard for you to treat it otherwise by moving the "Char8-assuming" functions to different modules and packages which must be explicitly imported and carry warnings.

You convert between them using functions in the Text.Encoding module which may fail like "decodeUtf8'" and "encodeLatin1". There's also a slew of normalizing functions.

I really encourage anyone interested in this problem to peruse the Text and Text.ICU documentation.

http://hackage.haskell.org/package/text

http://hackage.haskell.org/package/text-0.11.3.1/docs/Data-T...

http://hackage.haskell.org/package/text-icu

http://hackage.haskell.org/package/text-icu-0.6.3.7/docs/Dat...

http://hackage.haskell.org/package/text-icu-0.6.3.7/docs/Dat...


Misses the point, I think? There are plenty of times you must alter human text. For example, a word processor with a find/replace function must be able to replace the correct text and not move extra symbols around while doing so. There are thousands of similar examples.


Yes, but functionality that is about dealing with text for human consumption is properly the function of a library, not the core language, or even a single type in the standard library.

Human language and culture is too complex, fuzzy, and variable to hardwire its rules into a programming language specification. Character boundaries and character transformations are really only the beginning of it.

Consider finding the end of an English word. Is an apostrophe at the end an actual apostrophe denoting the possessive case (of a plural word) or meant as a closing single quote (even where there are different symbols, they are often mixed up when doing data entry)? How do you pluralize words? You have to consider exceptions to the usual rules (e.g., a dictionary), words that only exist in the plural, languages that don't have the concept of plural words, etc.


Nonetheless, that's what Wolfram is trying to do with his "programming" language. At least it's at the same level. Provide human level rules and knowledge embodied in a computer programming language.


Entirely misses the point. The original document was discussing the problem of text normalization, which still remains.

If I have a database of employee bios and I write an app so HR can search that database, it's no problem for them to type in "management" and get a list of all employees with the word "management" in their bio. But when a division does the same thing in a non-English-speaking country where there's an é in the word, then that word can be composed of either two characters or one and searching one way will tend to miss any results which were composed of text the other way. The solution here is to normalize the data before it is stored in the DB and the search term before it is searched for.

It's a simple concept. Not all programmers are aware of it, however, and this guy totally missed the point.


Actually, one of the points that I tried to make is that the amount of situations where you have to do that is remarkably small. You don't code a word processor every day. And, if you do need to do that, and you need to support more than a single language, you're talking about a rather complex, probably slightly fuzzy, major feature.

I bet the MS Word team that made internationalized search/replace spent a lot of time getting it even somewhat right.


The author argues that when string processing becomes complicated, it does not matter and incorrect results are usually OK.

Everybody knows about Bobby Tables? How about Mr. เจ้าพระยาบดินทรเดชา Unicode?

What if somebody told you that if certain exotic names are entered into the system by certain input method, government officials can't find those people in their searches, including full text search to a police report?


SQL injection is caused by exactly the confusion between human- and machine- readable text that the author is warning against.


> There are thousands of similar examples

Not actually


Text search, text replacement, ellipsizing text at the end or middle (as done commonly on iOS), fuzzy search, password word limits, password length limits, matching passwords (try using bcrypt on international languages without first normalizing), detecting and filtering rude words in public chat, ... Basically anywhere that text needs to be checked or changed.

Just off the top of my head. There are far more. You can't even have a native language user friendly login system without normalizing text.


Text search,

search, that counts

text replacement,

why?

ellipsizing text at the end or middle (as done commonly on iOS),

you need to know width for that, and you don't, bad idea

fuzzy search,

search

password word limits, password length limits, matching passwords (try using bcrypt on international languages without first normalising),

i18n password? very bad idea

detecting and filtering rude words in public chat

search&replace, anyway you won't be able to, because f͌̊ͦuͥcͥ͑ͩ͛k͂͌̿ͣ̾


> you need to know width for that, and you don't, bad idea

You also need to not chop off accents. "café" cannot become "cafe…"

> i18n password? very bad idea

Because string types are so broken.


Finally someone who at least touches on one of my massive pet peeves: something I call suck typing (string-duck typing). Have a DateTime? ToString it. Have XML? IndexOf (this one really gets to me). Got some binary data? Base64 it, store in XML (use IndexOf and Substring to access it) and then put that XML in a binary field in the database. I am a strong advocate that suck typing should be a fireable offence.

While I think unicode support in languages could be better; there is a lot of truth in this article that surrounds his subject core.


String typing, or stringly typed interfaces (parody of strong typing) is a phrase in common use.

HTTP is a good example of something normally using string typing; most web servers treat headers, for example, as text, rather than as data, and so you end up with a lot of invalid stuff because of misunderstandings about what the values actually are or how they should be interpreted or written.

That's a trap that I'm dodging in rust-http; in there, (known) headers are data, serialised to text only for transmission, not text. (I'll freely admit, however, that this is an approach I would never take in Python; I'm confident it would lead to more errors than the alternative.)


Have you looked at how Happstack leverages the type system?


I adopted the naming I once saw someone use for this, which is "stringly typed code".


You've seen code that uses basic string manipulation (IndexOf, Substring) to get values from XML? That's nice - although I guess it probably seems natural to those who insist on creating XML by string concatenation...


Imagine that there is a performance bottleneck there and substring is, like 500x faster than engaging a proper XML parser; what would you do?

Obviously it is likely an effect of a global design flaw, but such things are very hard to fix.


In that case I would do whatever was necessary to get that required performance - but only after doing extensive performance testing to ensure that's really where the bottleneck is and providing copious comments and possibly supporting documentation to justify this approach.

99.9% of the time I have seen code do crazy things "because it is faster" it's not performance critical code anyway and there is no explanation provided.


The one time I've seen this particular problem in action this was exactly the case.

Additionally the XML and question was all precisely to the byte level the same until you hit the giant (100s of Megs) based 64 blob that was the content. The parser stripped X number of bytes from the start of the file, and from the end, and de-base-64ed the center - which if I recall it then sent off to another parser as the content was in some old but standard record format from the 80s.

Anyhow - I'd say using XML in this case was the abuse, not the substring. But we were in no position to get the vendor to change their data format so...


Find a different XML parser or adopt a different way of reading the XML (DOM vs. SAX; or just a different library that performs better). I see where you are coming from though. The problem with XML is that it is used to solve problems that it shouldn't be solving - it's a great technology when used correctly (XMPP is a great example of how XML can make other transfer formats look like a dress rehearsal). In most cases, as you said, "global design flaw" - a good indicator that you are abusing XML is if you are not using xmlns attributes and if you do not have multiple namespace (because in that case JSON is simpler, faster and makes more sense).


What's the advantage of XMPP? I find the one time when I don't hate xml is when it doesn't have namespaces or schemata, as then it's just a slightly more verbose JSON.


Not much is that special in terms of the XEPs (extension protocols) that they have defined. When you innovate with it though, man you really see the power of correct XML.

You can slap custom elements pretty much anywhere you want, as long as you have your own namespace (and it's recommended you only place them under <message> or <iq> elements). Say you have some proprietary technology in a client application, with XMPP you can throw an element under the <message> that your client can recognise and act on. For everyone else provide a hyperlink within the <body> element and serve up a web page for them. If they are using your client "bam!" instant added functionality - but if they are on device X which you do not support they are not left out in the cold.


Do it abstractly. Nobody should know how your XML processor is getting its data, simply that it is an extremely brittle, extremely fast choice for one or two parse steps. If you can't pay for speed with brittleness then your bottleneck is unfixable.


Get/create a faster proper XML parser.

(See https://github.com/mpw/Objective-XML)


If you know that the XML has been generated in a particular way, you can often beat any compliant XML server by employing knowledge of the exact structure.

E.g. you can sometimes skip over chunks of characters without every accessing them, and get speedups of magnitudes over even "just" checking every byte in the input.

There's nothing a faster proper XML parser can do about a custom parser like that.

(Obviously this is a brittle solution and a last resort optimisation, and should be accompanied by ample warnings, but sometimes there are no alternatives)


Surely the only way you could possibly do this is if you controlled both sides - in which case you could achieve the same speedup less brittlely by switching to a binary system like Protocol Buffers?


I have seen people store XML as VARCHAR in a database and send that very same raw XML string as a payload over a SOAP web service. That company also had their own GUI framework that consisted of generating XML by string concatenation and then doing XSLT transformations. I suppose that is what happens when you have team leaders, managers and a chief software architect who have little understanding of technology and its proper use.


Let me guess - the XML was sent over a SOAP web service that looked pretty much like:

   string DoStuff(string arg)
Where the service call takes a string that actually contains XML and returns a string that actually contains XML....

My own "Worst use of XML" was someone wanting to use quite complex XML documents as a key in a database....


I bet you would love Tcl.


Or NSIS, which I've worked in a lot—the string is the only type. Integer operations will parse it as a number, perform the operation and turn it back into a decimal once more for storage in the string.


Tcl originally did the same thing. Semantically it still does, but values in the Tcl interpreter were changed to a string/union pair so that conversions can be cached.


Talk about throwing out the baby with the bathwater..

The original article was wrong because it proposed replacing strings with arrays of code points. Clearly, that doesn't work.

This article is wrong because adding more string types just shuffles the problem around. There is nothing "machine consumable" about strings encoded in a certain anglo-centric character encoding! Just don't even think that thought. "abc" is absolutely not more "machine consumable" than "東京". You don't "hash prose" by transliterating text into ascii characters.

It's not impossible to fix the existing string types. Principle of least surprise holds. In cases when it doesn't, the locale is the tie-breaker. E.g. "Scheiße".upper(locale.DE) may be different from "Scheiße".upper(locale.RU).


Please tell me how to correctly implement search (ideally in scala), such that "Lodz" will match "Łódź". Heck, making "noël" match "noël" is nontrivial. I don't think existing string types are adequate for this (e.g. java.lang.String.equals() gives exceedingly surprising results).


You are looking for a distance metric on strings. If d(s1, s2) = 0, where s1 and s2 are strings, then they are equivalent.

"Łódź" may be the same as "Lodz" to you in the same way that "komputor" may be the same as "computer" to a Russian speaker. To someone else the names "Anderson" and "Andersson" are equivalent. Now you see the problem -- exact matching is futile and you should use fuzzy matching instead, like normalized Levenshtein distance, and rank the results based on similarity.

Even that is not enough if you want to support non-Latin alphabets because they have different ideas about what a character is but it should get you started.


I shouldn't have to implement this from scratch though. It's not like this problem is unique to my program; programming languages should have some support for solving this kind of problem in their standard libraries (or a readily available library)


Why would the strings match? Aren't they completely different? (What language are we talking about, anyway?)


> Why would the strings match? Aren't they completely different?

In some theoretical sense, yes. In terms of solving users' problems and providing business value, I need to make it possible for users to find the entry for "Łódź" without typing accents.

> (What language are we talking about, anyway?)

We're talking about Scala; it runs on the JVM and so java.lang.String is the string type.


This right here is where we start to mix up two totally different things by using the same words for them.

This thread is talking about two completely different types of Equals; we shouldn't be using the same word for them and certainly not the same function name in code:

- SomeString1.Equals(SomeString2) -> are all the bytes in array 1 equal to the bytes in array 2?

- HumanSimilarity(t1, t2) - given text s1 and text s2, give me a number that tells me how similar a person would perceive these strings to be. You could even go further:

SimilarityForReaderInLocale(locale, t1, locale1, t2, locale2) - for a human reader in locale, given text t1 written by a human in locale1 and text t2 written by a human in locale2, how similar would the reader perceive these two pieces of text?

When we talk about 'least surprise' in manipulations, what I think we really mean is that text manipulation should be defined by a locale; no actually, a superset of a locale. That superset being your typical human reader in that locale.


Oh, so you're coding around your users' inability to type the letters they actually want? That sounds basically impossible... you'd have to pick letters according to how similar they look instead of their meaning in any context. And, I meant human language! :)


"impossible" isn't good enough - I could write a bunch of hacks that would cover many of the cases, but the programming language (or ideally the unicode standard) should provide a standard answer, rather than each programmer having to implement this themselves.

Human language? English speakers talking about cities in a variety of countries (which should be correctly named, but searchable by english speakers).


Impossible does not stop many people from doing exactly this.


Search is always a hairy problem, you would probably store a normalized version of your searchable text and match a normalized input on that. What should actually match will be different depending on what you're doing, so I wouldn't rely on something like string.equals for it.


> Search is always a hairy problem, you would probably store a normalized version of your searchable text and match a normalized input on that. What should actually match will be different depending on what you're doing

Sure, but the language should offer support for this, even if it requires some level of configuration. It's not a problem we want every programmer solving anew for every program.

> I wouldn't rely on something like string.equals for it.

True enough. But I think the existing String.equals method is broken: it behaves very surprisingly and leads programmers to introduce bugs. Likewise e.g. String.subString (which can chop a character in half). These methods cannot be fixed (because existing code assumes their current functionality) and are very difficult to use effectively; they should be deprecated, which in practice means a new String type.


>It's not a problem we want every programmer solving anew for every program.

Yeah, I completely agree. You can make an argument that this stuff belongs in a library (or even several libraries) rather than the language, simply because of how many decisions are involved:

- do you know what language your search input is in? can it be any of several languages? any at all? only one specific language, ever?

- what language is the text you're searching on? do you know? are you potentially searching across text in multiple languages at once?

- how exact does the match need to be? do you want phonetic matches? do you want to match characters that "look" similar but many sound different?

- do you need a binary or a fuzzy match? (e.g. are you doing a search ranked by relevance). Do you need to compute some sort of Levenshtein distance?

So the use cases can range from a simple byte-level exact comparison all the way to a full search server like Solr. How much of that should the language implement? (not a rhetorical question, I actually don't think there's an obvious answer).

No arguments on the existing String methods :)


The conclusion doesn't really match the title.

> If there is any takeaway from this entire discussion, it may be that there is a need for multiple string types in strongly-typed languages

Yes, yes there is. Until we get that, string types are not fine.


Good point!

It is my opinion, however, that string types are fine, just not perfect. I should have maybe made that clearer.


Yeah, I was bit confused when your conclusions seem to match the original "strings are broken" posts conclusions.

The whole problem is that current string types enable broken unsafe behavior on Unicode ("human only" in your parlance) strings. Current string types are broken because they do not enforce the requirement that string operations are done only on plain ASCII ("machine only") strings.


what if I have to create strings with non ASCII chars?

Calling ASCII "machine only" is totally wrong. I agree that encoding/decoding is a pain, but we have different string encodings for a reason.


> Similarly, why would you ever need to take a substring of someone’s name

In my experience, you need to do this any time you're displaying someone's name, a place name, an article title, or whatever. Often the display area just isn't that wide, and shifting around other content may not be an option. You need to display enough to let people know what's there, but eventually it needs to be cut off.


You need a designer if you ever plan to internationalize. What you're doing won't work - you'll make nonsense phrases.


The problems arise not from strings, they are well understood computationally, and most languages provide sound functions for working with them. The problems arise when strings are held responsible for the side effects of other functions which manipulate the data within strings.

In particular, problems arise when some function renders a sequence of characters contained in a string [fundamental string operations such as concatenation tend not to be problematic]. These problems are due to the transition from the mathematical certainty of strings to the heuristics of text rendering. The compromises required to map semantics of human writing systems onto strings via Unicode contributes to this problem...glyphs are not necessarily ordered sequentially or without resolution under a context.


The post makes a good point about the distinction between two types of text - symbols and human text. Sadly, this is made very complex in our world through the introduction of the markup (and markdown) languages which mix text targeting both machines and humans. As long as you're having to work with embedded structuring or formatting codes, woe betide you. You'll have to deal with script injection, sql injection and what not.


I disagree with the premise of the article. Text is for humans only. If you are connecting computers with no human interaction, binary is much more efficient.


Great, so now that he explained to us how to use strings for human readable text, the last missing bit is a blogpost telling the author how to choose the colors for letters & background. What's the sense in optimizing string usage when half of the audience can't read the lightgray text on slightly-lighter-gray background ?


Hey, thanks for the feedback. I just picked a random Wordpress.com theme to get started fast (I started blogging 3 days ago). I'll do more effort to pick (or even gasp make) a better theme!


The colors pass all the accessibility checks on foreground vs. background I've found. Maybe your monitor is a bit off


If you got this far, you’ll probably want to hire me as a consultant.

Nope.


Damn. I'll live, I guess. Thanks for reading till the end anyway!


I say leave it. Don't reduce your chance of getting paid work just because some smug clown on the internet wanted to score a psuedo-zinger by yelling at his television screen.

If anything, change 'probably' to 'might'. Confidence is good, but some people could take 'probably' as an imputation on their ability to comprehend the subject matter. I'd always aim at "That was just the beginning. Whip out your chequebook and then you'll REALLY see what I have to offer" rather than "You sound like you're in trouble."


Hah :-)

Thanks for the support, but if I make an arrogant remark, I have to expect to get snarky responses, right :-)

The line was supposed to be taken as a joke, in reference to people signing off their blog posts with "If you read this far, you should probably follow me on Twitter" and the likes.

Thanks for your feedback though! I hadn't thought about how people could interpret it as a slight insult to their understanding of the subject matter, so next time I make an arrogant joke I'll try and take stuff like that into account.


Please consider replacing "may want" with "please consider" or nothing at all. Don't know if you want that...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: