Hacker News new | past | comments | ask | show | jobs | submit login
Should String Be An Abstract Class? (appsandsecurity.blogspot.com)
48 points by danbjson on May 25, 2013 | hide | past | favorite | 52 comments



98% of this article is a discussion of the use of strings in HTTP headers. The last 4 or 5 sentences tries to generalize the conclusion to strings at large.

I feel a bit misled by the title. It would have been nice to have 3 or 4 examples, and a much larger argument that strings in general should be abstract.

However, even with all of this underpinning, it doesn't work. This is like an argument against integers because integers can express so many things: age, number of arms, days of the year, and so forth. Base types can cover a lot of things! Film at eleven.

The discussion should have been about proper API construction, and when to subclass. Or even particular problems with HTTP Header APIs in Java -- I really liked that part. If the author had stuck with it, I wouldn't have felt a bit bamboozled.


> This is like an argument against integers because integers can express so many things: age, number of arms, days of the year, and so forth.

Yes, and just like the strings case it shows how useless most primitive types are from a program correctness and documentation standpoint. Languages like C++ and Java encourage programmers to treat them as mystical bags-of-holding and forget about their limitations or relation to the data being processed.


> This is like an argument against integers because integers can express so many things: age, number of arms, days of the year, and so forth.

And each of those uses should be given its own type, and the compiler should refuse to compile if you pass setPersonAge() a value of type ArmNumber even if the relevant types are both implemented as 32-bit half-words at the machine level. If you are writing systems software or encryption software and need to care about machine-level details, that should be encoded into the type system as well. The semantic meaning of a type should be decoupled from its machine representation; for example, the type PersonAge could be implemented as a string in part of the program and still be the same type, because the compiler can handle auto-conversion like that no problem but the semantics of the value haven't changed.

Or, at least, that's one way to look at types.


Gah. The very interesting question that the author asks is whether we should use the type system to help us enforce validation constraints, and in particular, security constraints -- a great idea that I've seen discussed elsewhere. But unfortunately he takes a wrong turn and assumes that this implies a subclassing relationship, where these validated string class derive from an abstract string class. This makes no sense from an OO perspective; nothing about a string class is "abstract," and it's better to favor composition over inheritance.


Most "stringly-typed" systems seem inherently quirky to me. In something like Javascript, everything else is so loose, it's like why not use strings, but I've always wondered why things like HTTP headers couldn't be handled using enums. I think it would cover a lot of the cases mentioned in the article without having a weird-at-first-glance abstract string paradigm to deal with.

As a practical matter, I think enums couldn't be used in the Java spec because it was written before enums were available in the language. Enums also suffer from extensibility issues, but that could be solved by having stringly-typed overrides. I've also been contemplating the possible merits of extensible enums for solving issues like this. So for example you could have an enum type that extends the HttpHeader enum. The ordinality would be weird if there were multiple sub-enums, but it might not matter for most practical cases. I can think of a few language-level ways to deal with the ordinality issue too.


While I agree with the author's sentiment, I think it is a dangerous practice to assert that "Nothing is any of 100,000 characters and anything between 0 and 2 billion in length." By all means, let's impose more rigorous structure on data items that require it, like HTTP headers, but why impose artificial restrictions on things that don't need it?

No single /thing/ might be any of 100,000 characters and between 0 and 2 billion in length, but a group of /thing/s that share enough similarities as to be functionally identical might very well subsume most of those 100,000 characters and have no intrinsic limits on length. If I have learned anything in my quarter century of developing software, it's that the moment I impose an artificial restriction on my data, I will find an item that violates the restriction and now requires special-case handling to do its job.


There's also a really obvious example of a thing that might be any of 100,000 characters and between zero and 2 billion characters in length: the contents of a plain text file.


Right, but is that a genuine use case? If you're writing an editor you probably want it to have a stronger notion of the data representations people might want to edit. If you're just considering "user-supplied free text" that's probably constrained away from certain characters.


I get the impression that the author doesn't really understand why he's doing what he's doing. The obvious solution in this case is to have addHeader validate its arguments. The reason why you might want to have a separate class that does validation on construction is if you were going to end up validating the string multiple times: then you can just validate once in the constructor, and use the type system to ensure that each header string has been validated at least once.

But is that a realistic concern in this case? The author proposes replacing this:

    response.addHeader("Custom-Language", 
                   request.getParameter("lang"));
With this:

    response.addHeader(new HttpHeaderName("Custom-Language"), 
                   new HttpHeaderValue(request.getParameter("lang")));
This gives absolutely zero benefits over just doing the authentication in addHeader. Every string is still being validated the same number of times, but now the programmer is less productive and the code is more cluttered because of the added boilerplate.

There might be a slight advantage in saving the constant header name somewhere, and reusing it:

    customLanguageHeader = new HttpHeaderName("Custom-Language");
    ...
    response.addHeader(customLanguageHeader, 
                   new HttpHeaderValue(request.getParameter("lang")));
But that's even uglier. What might be nice to have is a way to run the validation at compile time for static strings, so that you can save validation passes at runtime while still being able to use the literal "Custom-Language" at the location of the addHeader call. Automatic casting would also be nice, so that you can just pass the naked string for an HttpHeaderName parameter and the compiler will insert the constructor. But I don't think you can do these things in Java.

For this particular language and this particular problem, it may be best to have the Response object validate all headers while it's being serialized.


If you do want to do this kind of thing in Java you can use the JSR308 checkers framework.


This reminds me quite a bit of this article: http://blog.moertel.com/posts/2006-10-18-a-type-based-soluti..., although Haskell's type system is likely much better for expressing this sort of constraint.


String shouldn't be abstract, just like ArrayList shouldn't be abstract. Why? Because you should use composition, not inheritance. You should introduce types for your concepts in your code to avoid these type of problems. But it's always tempting to avoid it in order to simplify the design. It's always a tradeoff, but personally I think more use of typing would be good in most projects that I've seen.


Every primitive type is a problem. General purpose programming languages are too "general-purpose" to be useful. Java will happily let you add two integers even though one may represent a quantity in metric and the other in imperial.

Invariably you should create your own DSL-like layer to organize your code and enforce consistency and correctness.


F# has this concept of allowing units on numbers (http://msdn.microsoft.com/en-us/library/dd233243.aspx). So instead of just having "5" you could have "5 meters" or "5 dollars". It doesn't fundamentally change the number type, it just gives you some metadata to work with.

Making a subclass for every string type strikes me as being a bit heavy, since usually you don't necessarily want new string behavior, you just want to classify it and define conversions.

I think it would be neat if (with language support obviously) instead of overloading the class, you could instead just specify "units" for a string, and conversions between those units.


>Making a subclass for every string type strikes me as being a bit heavy, since usually you don't necessarily want new string behavior, you just want to classify it and define conversions.

Many languages can optimize such a construct away so that it only exists at compile time, or have another construct that declares "represented as type X, but treat it as a different type at compile time" (e.g. Haskell's newtype)


Making string abstract or creating new string subtypes does not solve the underlying problem: Your type system isn't good enough.

Now what I say this, let be clear: I'm talking about everybody's type systems. That goes for you Haskell-ers too. And the post-Haskell rocket surgeons doing crazy advanced typing insanity.

Type systems are abstractions. The ones you are used to, like the Java-ish OOP ones or the ML/Haskell functional ones, are designed to detect and prevent a wide variety of programming errors, while also enabling analysis that will improve execution performance. However, there are many sorts of "type systems" that solve different problems.

- There are tree schemas for validation and data generation.

- There are database schemas for indexing and query planning.

- There are grammars for parsing and validating languages.

- There are contracts for ensuring preconditions, postconditions, and invariants.

- The list goes on for quite a while.

The idea that you can assign a single named type in a single kind of type system to a value and 100% verify the correctness of your software is just bogus. You need optional and pluggable type systems, so that you can bring a particular type system to meet a particular problem.


In pascal, you could have user-typed primitives, with conversion code for cast. For example, you could have celcius and fahrenheit float types. If you used a celcius when fahrenheit was wanted, the conversion code would automatically be called.

I saw an article years ago (by Joel?) about applying this mechanism to html-encoding: so if you used an unescaped type where an escaped type was required, it was automatically converted. This provided security against injection attacks.

Similarly, you can have validated and unvalidated types. If your libraries/frameworks already used these, there's almost no work left for you to do.

Of course, java doesn't have user-typed primitives, nor type-conversion code. In XSD datatypes you can specify a string type in terms of a regular expression http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/datatypes...


This reminded me on Joel's leaky abstractions. And you could hardly find more leaky than strings.

Amusingly, the history of the evolution of C++ over time can be described as a history of trying to plug the leaks in the string abstraction. Why they couldn't just add a native string class to the language itself eludes me at the moment.


Integers are also rarely "just integers" -- checksums, hashes, bitmasks, counters, numbers -- you name it.



Should int be an abstract class?

Actually, in my recent C I've taken to wrapping most ints in one-member structs, so the compiler will catch it when I pass a foo id where I meant to pass a quantity...


Yes. Composition over inheritance (http://c2.com/cgi/wiki?CompositionInsteadOfInheritance). Oftentimes, these string-like aren't really strings, but have a string representation that dominates their other uses. For example, http headers are, AFAIK, (key,value) pairs that have the string representation "key: value". A good implementation should discriminated between such objects and their implementation.

If your language introduces needless friction when you try to use composition, I would blame the language before I blamed the string class.

[Aside: in Cocoa, NSString is an abstract class, but for a different reason (http://developer.apple.com/library/ios/ipad/#documentation/g...).

That approach is only possible in languages that allow abstract classes that are instantiable, such as Objective-C and Dylan (http://opendylan.org/books/dpg/modularity.html#abstract-conc...]


I love this technique. It can catch so many errors. It's also easily extensible to slightly more complicated cases. For example, I was working on some gnarly code that was working with a bunch of times with different epochs (e.g. time since startup on the local computer, time since startup on a remote computer, and time since the UNIX epoch). Rather than try to remember what was what, or try to painfully encode it in variable names, I simply wrote a struct that contained the number of seconds and an enum indicating what epoch it used. Then any operation on a pair of times (deltas, comparisons) got factored into a function that asserted the time bases of the two times were compatible.

It's interesting how little use this seems to get in C in general.


That sounds more like a tagged union, which are great for dynamic typing when that's needed, but is a somewhat different technique.


Not at all. It's just a struct with two members:

    struct Time {
        double val;
        enum Epoch epoch;
    };
It works just like one-member structs, in that it can be treated as a single value, passed and returned by value when calling functions, and keeps you from accidentally mixing different kinds of values. Having the "epoch" field along for the ride just means you can add some additional smarts.


So yes, like I was thinking; a tagged union - the epoch field determines the interpretation of the val field. The full application of "wrap values in single element structs for better static guarantees" would be to have a different time struct for every epoch. This is a (possibly quite useful) step back from that, since C's lack of polymorphism would mean a need to implement every time function for epoch even when the logic is the same.


I guess that makes sense now that you explain it. Conceptually, each epoch value results in a different type for the other field, meaning it works like a union, even though it's actually implemented using the same primitive type for each.


Right, exactly. It's just the fact that all inhabitants happen to share a representation that lets you avoid the syntactic union.


Shouldn't the spec define a way to _encode_ the name/value in such a way that any string can be used as a header name/value? Then `addHeader` is responsible for handling encoding. For example, if I had a method that takes a name/value argument and converts them into "name=value" syntax to use in a URL query, I would consider it a bug if it didn't url-encode name and value.


While I'm not sure String should be abstract, I see what the author is implying. The idea is that having String abstract would force the programmer to come up with appropriate data structures more often than not. There's definitely some truth to it.


if you have another string type that has a limited character set then it is not a subclass of string because it breaks the liskov substitution principal. in java land allowing user code to subclass string would create massive security issues in the sandbox. i think there is a good case for having different types of strings and providing runtime support so this is cheap but this is not subclassing.


Definitively not. If you need a specialized type: make one instead of bodging existing basic types that are already complex enough as they are.


But I am a bit mystified - I assume that in Java I cannot subclass String? Which seems to be the whole argument.

Well, you know what my answer to that will be :-)

(To be fair I have just written a session-cache- management system for python and did not subclass anything much. I suspect I should look into that :-)


No, the point of the argument was that people should be FORCED to subclass String. Which is kinda pointless, as its easy for lazy programmers to circumvent, and there ARE things that are fit to be represented by just Strings. But I do agree with the author that it should be either subclassed or run through a validator function in a lot of cases.


String is immutable for good reasons. It's explained in Joshua Bloch's Effective Java, which should be mandatory reading for all Java programmers.


Agreed. I cannot see anyone in this thread (blogpost or comments) that has argued for the opposite "string should be mutable".


Well, immutable classes are final. Thus, suggesting to make it abstract does suggest the opposite, IMHO.


The problem in languages like Java, C#, etc..is that best practice would be to wrap up the string in a class, but that's a "heavyweight" solution and developers are lazy. But these "stringy" bugs can be nasty when the wrong string is passed into the wrong argument.

In some languages you can at least "typedef" or "type alias" strings so that in your type declarations and argument types, it's the typedef and not just a raw string that is being passed in.

e.g. (made up pseudocode)

typedef ZipCode as string; State LookupState(ZipCode zip);

So with solutions that don't involve wrapping everything in a class, you can incrementally develop your type by first doing the "typedef" and then adding functions.

So that's the problem with languages like Java. You develop a "I don't need a whole class for this" attitude because you can't incrementally develop your type safety.


Short answer: no.

Yes, the concept of String is problematic. It's an overloaded one that people have variously mapped to:

    a. An array of bytes. (C char is a byte.)
    b. "Words", from a (possibly fuzzy) set of 2-100k specific strings from natural language. 
    c. Arbitrary arrays of characters. 
    d. Arbitrary arrays of *printable* characters.
    e. Compact representations of abstractions, e.g. regexes which represent functions on strings. 
These have conflicting needs. For (a), most seasoned programmers have learned the hard way of the need to separate byte[] from String as concepts, due to Unicode and encoding and various nasty errors you get if you confuse UTF-8 and UTF-16; but also because random access into a byte[] of known structure is often a fast way of getting information while random access into a String is generally inferior to regex matching.

Regarding (b), what you sometimes end up wanting is a symbol type (or, in Clojure, keywords) that gives you fast comparison. You might also want something that lives at a language level (rather than runtime strings) like an enum or tagged union (see: Scala, Ocaml) to get various validation properties.

Regarding (e), I think everyone agrees that regexes belong in their own type (or class).

Where there's some controversy is (c)-(d). There are over a million supposedly valid code points in non-extended Unicode, but only about 150,000 of them are used, and some have special meanings (e.g. endian markers). UTF-8/16 issues get nasty quick if you don't know what you're doing. What all this means is that you can make very few assumptions about an arbitrary "string". You might not even have random access (see: UTF-8/16)! (Although a strong argument can be made that if you need random access into something, a string isn't what you want, but a byte[]. Access into strings is usually done with regexes, not positional indices, for obvious reasons.)

As messy as Strings are over all use cases, the thing about them is that they work and also that they're a fundamental concept of modern computing in practice. We can't get rid of them. We shouldn't. Making them an abstract class I don't like, for the same reasons as most people would agree that making Java's String final was the right decision. (Short version: inheritance mucks up .equals and .hashCode and breaks the world is hard-to-detect ways.)

What we do however need to keep in mind is that when we have a String, we're stuck with something that's meaningless without context. That's always true in computing, but easy to forget. What do I mean by "meaningless without context"? There's almost nothing that you know about something if it's a String.

On the other hand, if you have a wrapper called SanitizedString (some static-typing fu here) that immutably holds a String and the only way to get a value of that type is to pass a String through a SQLSanitize function, you know that it's been sanitized (or, at least, that the sanitizing function was run; whether it's correct is another matter). But this isn't a case of inheritance; it's a wrapper. You can use this to strengthen your knowledge about these objects (a function String -> Option[SanitizedString] returns Some(ss) only if the input string makes sense for your SQL work).

Inheritance I dislike because it tends to weaken knowledge. I think it's the wrong model, except for a certain small class of problem. What good there is in inheritance is being taken over by more principled programming paradigms (see: type classes in Haskell, protocols in Clojure).


We know some things for (a-d). For instance, a string is (in every case I've seen) a monoid, you can append to either end and there exists an "empty" string which turns appending into an identity.

For (e) this may not be true if we want to treat the string as semantically equivalent to whatever its representing---you cannot always arbitrarily append represented objects. That said, a string is not that represented object unless there's an isomorphism. In Haskell types, you'd want to have a `String -> Maybe a` function to capture failure-to-translate. This case includes things like `SQLSanitize`. Also HTML header parsing whenever the headers have an interpretation in the current level of the system.

Note also that this value of (e) does not depend upon a representation of strings as any kind of character-based thing. You can make a monoidal representation of a regular expression from a star-semiring combinator set (and see http://r6.ca/blog/20110808T035622Z.html for a great example).

The remaining semantic troubles seem to be around ideas of mapping or folding over a string. Length depends on this representation---do I want character length or byte length?---as do functions like `toUppercase`. And then there's the entire encoding/byte-representation business.

So they should perhaps be instances of an abstract class of "Monoid" plus a mixin specializing the kind of "character" representation. Or, the outside-in method of Haskell's typeclasses where String, ByteString(.Char8), and Text each specialize to a different use case but all instantiate `Monoid` and each have some kind of `mapping` and `folding` function which specialize to the kind of "character" intended. Finally, there are partial morphisms between each of them which fail when character encoding does.


> the thing about them is that they work

The extent to which you think strings need replacement is equal to the extent that you agree with this particular statement. The charitable view is they let you get up and running without needing to work out every little detail yourself. The uncharitable view is that they appear to work far more often than they actually work.

Because any invariants you may want them to have are ad-hoc, any error with those invariants can very easily slip by the programmer, the compiler, and the unit tests. Depending on what type of programmer you are, this can be The Father Of All Sins.

Strings are low-level. They're a half-step above binary streams. And because they're ubiquitous and all libraries work in terms of strings, there's active pressure against creating new types that represent the invariants and structure that your application may actually need, which often results in those invariants never even being thought through in the first place.

Again, it's subjective, and depends strongly on your opinion of Worse of Better. I agree that your thoughts on Inheritance. I think this problem is difficult to solve in a way that's not strictly worse than the original problem. I would be interested to see a language that only defines strings as a Protocol (or whatever) with a handful of implementations in the stdlib. I honestly have no idea if it would be the Garden of Eden or a steaming pile of ass, but I think it's important that we find out.


The string is a stark data structure and everywhere it is passed there is much duplication of process. It is a perfect vehicle for hiding information. -- Allen J. Perlis


His name is "Alan." I'm not sure how you managed to copy-paste that improperly.


BTW: there is no such thing as generic sanitized string. There may be SQL-escaped string, HTML-escaped, JS-escaped, JS-in-HTML-in-SQL-escaped, etc. It always depends on context (I'm going to invent format that uses ASCII 'a' in escape sequence — sanitize that! ;)


Base64 encode a string and it's generically sanitized. They're a bit difficult to read with the naked eye though.


Unless it's in a URL. (This is why URL-safe Base64 versions exist... Which can then in turn be inappropriate for other places.)


And base64 can use the / character, so it's unsafe for POSIX filenames.


Until the next element in the pipeline chain decodes it and you can then have injection.


Sure. I was just using one example of sanitization: defense against Bobby Tables.


For me, xml and other types of config files can be hard to debug because everything is a string. IDE/smart editors help out a lot when you have types.


XML has data types. You just have to load the schema.


Yeah, the problem is that the values are strings.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: