Hacker News new | past | comments | ask | show | jobs | submit login
The Most Expensive One-byte Mistake (acm.org)
146 points by CowboyRobot on Aug 2, 2011 | hide | past | favorite | 79 comments



In the article he says it is just one byte extra, but that's clearly not the case, since we probably want strings longer than 255 characters. To be practical, even in the sort term, you'd probably need at least two bytes, and that's still a bit limiting. NUL-terminated strings keep you from having to worry about the size of the length specifier and what byte ordering when storing it.

He notes that other languages of the day didn't go with NUL-terminated strings. It could also be interpreted, considering that C is more widely used still than other languages from the time, that by doing something different, C made the right choice.


I read it as "one byte more than just using one byte for NUL". In other words, two bytes.


Even so, 2^16 chars is still absurdly limiting. Furthermore, address/length pairs complicate otherwise very simple programming tasks such as creating string tokenizers (with NULL-terminated strings, you just need to drop in a \0 where the deliminator was).


I disagree that 2^16 chars is absurdly limiting. I remember coding in some language that had a 255-char String limit (maybe that was some kind of Pascal) and while that was somewhat limiting, it was not an issue in some 90% of the strings. 2^16 pretty much takes care of 99% of string usage, especially in the earlier days of programming languages. Anything over 2^16 and using a more specialized data structure for a buffer would've probably been more than acceptable.


2^16 bytes is a mere 64KB. Sure you can get away with small strings if you have to, but in a world where that isn't something that you have to put up with it would be quite frustrating.

For example, say you need to preform some sort of text editing style task, and insert a few chars into the middle of a file. If one of the internal representations of the file happens to be one contiguous char* , then all you have to do is one quick memmove to make some room. With a length-prefixed representation the best case scenario is you do the memmove as before, then also update the length (no biggy really, since you probably keep that around somewhere anyway). However, if you have a 2^16 restriction and have a file larger than that you're suddenly can't use a contiguous piece of memory. This would complicate numerous things including searching, splitting, and (potentially) insertion. Not having a contiguous piece of memory also complicates the process of laying any number of data structures on top of the file data. Even further, it causes issues when you want to just memmap in a file, unless you want all your files to be perpended with the number of chars in them, which causes even more issues...


Right, so you'd need a BigString analogue to BigInt for that case.

Which wouldn't be that bad, really. I mean, a 64k string is plenty for your everyday string needs. And in cases where you're handling really long strings, you'd probably want a specialized data structure anyway. I mean, it's not like char* is exactly efficient when you need to insert something into the middle of a many-megabyte file.


For those kinds of tasks, a string doesn't seem all that appropriate to me. The BigString (that is, a Rope[1]) is more appropriate IMHO.

[1] http://en.wikipedia.org/wiki/Rope_(computer_science)


That's kind of what I mean. The fact that null-terminated strings can get arbitrarily long doesn't seem like a big advantage. I mean, if you're working with really long strings, you probably want to use something more sophisticated than a character array anyway.


I mean, a 64k string is plenty for your everyday string needs.

No, not for my everyday needs.


But the question here isn't actually "What is the ONE TRUE STRING format that the language permits, all others to be rigidly banned by the compiler?", the question is "What shall the default string be in the core C APIs and functions?"

If you still need NULL-terminated strings, you could have chosen them, and if you knew enough to so choose, hopefully you know enough to treat them like the dangerous tools they are. Meanwhile, the core C functions and API and UNIX could have been built around the much safer strings, which wouldn't have been all that hard to upgrade to 4 bytes (or more) later. Or we could have done a UTF-8-like size encoding, or turn the default strings into linked lists if they got large, etc. It would be OK, because raw expanses of memory would still be available to you, it just wouldn't be the default.

NULL-terminated strings are the wrong default, even though they should be available to those who really need them.


Multiple string schemes is exactly the sort of fragmentation this industry does not need. Citation: C++.


You mean, basic_string and charstar?


Hm, It can be a real problem if you have the joy of working with old C++ code: the initial codebase might have had its own string type, and then some more classes added by long-gone programmers who thought the earlier types were slow, and then newer code added by people who actually use std::string (well, you get the idea).

(much to my regret, that is a true story.)


It's not limiting on a PDP/11 though, since that could only address 64K of memory. Presumably when C/Unix was later moved up to VAXen and other 32 bit machines, the field would have been extended to 4 bytes, and today to 8 bytes.


What you would likely do is use an extension bit so there would be no fixed maximum length for the strings. This would of course add a bunch of overhead that you may or may not make up with the other advantages of knowing the length of strings.


* If you're appending data to a string, and the variable-length increases by a byte, will you have to memmove() the entire existing string down 1 byte?

* Is every programmer responsible for detecting this condition?

* (This will make manual string manipulation very complicated and dangerous.)

* Suppose you're concatenating two strings, such that the sum of the lengths requires an additional byte. This could cause a buffer overflow. How would a strcat() function avoid causing a buffer overflow here?

* Does every string need a maximum-length counter too?

* Can you access a random element in the string without having to dereference and decode the length?

On the other hand, if you use a constant-sized length,

* What happens when you overflow the maximum length?

* Can you erroneously create a shorter string by appending text?

* How should string libraries handle this condition? By abort()ing? By returning a special error code? Does every string manipulation need to be wrapped in an if() to detect the special error? How should the programmer handle this condition?

In either case,

* Can you tokenize a string in-place?

* Can an attacker read a program's entire address space by finding an address that begins 0xffffffff and treating it as a string?


I think most of these are answered with the suggestion that no one would have made strings this way without making a matching library that handled the concerns you have here. Honestly char[] strings are much less useful in a UTF-8 world anyway.


This is what DER does to encode ASN.1, and it is basically a nightmare of epic proportions.

The 16-bit tag at the beginning of strings is a reasonable suggestion. 65k charstars are slow and unwieldy anyways.


From the article's comments:

"Rather than ptr+len, I would use a ptr+ptr format, that automatically ports to any size word/memory/address-space without any need for adjustment. -- Poul-Henning Kamp"

"Encode the size of the length value into itself. One way to do this is to set aside the high bit, giving you seven bits of length value storage in each byte. The high bit is set for all bytes of the length value, except the last one. If there is only one byte of length value, its high bit is not set. So for example, strings of length 127 or less would have one byte of length value. Strings of length 128 to 16383 would require two bytes of length value, etc. This way the length value can have arbitrary values, yet still consuming no more space than necessary. Loading and storing the length values could be efficient with hardware support. I would suggest storing the length little-endian. -- F"

"Length-prefixed strings do have many advantages, but I hate to think of the interoperability issues. These days you're probably safe with a 32-bit length, but there'd still be systems around using 16 bits, and you'd try to talk to them on a network and things would be all crazy. Not to mention endinanness issues. As well, null-terminated strings have some optimizations you can do by pointing into part of the string. e.g. you can have the strings "foo bar" and "bar" occupying the same memory space by pointing the latter at the middle of the former. It's common to do this when parsing a string, incrementing a pointer to the part you're interested in. An alternative might be to have length + pointer, pointing to a string somewhere else (which can still be null-terminated for compatibility), instead of length as a prefix. It's worth noting that the Lua language does something like this. You might also be interested to know that a buffer overflow exploit was indeed one of the earliest tricks hackers used to get into the PS3 system. A device known as PSJailbreak exploited such a vulnerability in the kernel's USB device handling to take over the system. (And getting into the console was sort of the first step of the PSN issues, since 1. Sony reacted very poorly and ticked off a lot of hackers, and 2. PSN was designed with the assumption that anything coming from a PS3 could be trusted.)"

And more.


There's another reason for using NUL-terminated strings: It means that a suffix of a string is also a string. Want to convert "hello world" into "hello" and "world"? Just replace the ' ' with a '\0'. With length-prefix strings, you would need to copy the "world" into a new memory allocation -- which is far too wasteful for 1970s programming.


Of course, assuming that unwanted space is there. With addr and len, you could split "helloworld" into "hello" and "world" by simply pointing to the two separate parts, each with len == 5. (This is actually really nice when parsing large strings into smaller strings because you can simply point all the smaller strings into a single, large string).


That assumes that the string characters are not adjacent to the string length field in memory...

(Edited to add) More explicitly: Grandparent specified length-prefix strings, where the string's length is stored in memory immediately before the string's characters. And his claim is correct in that case.

You seem to be assuming a string object which consists of a length and a pointer to the characters. What you suggest certainly would work there, but any straight-forward implementation of this is going to require another pointer's worth of memory and two memory allocations for each string, too.


This is actually what Java does when you call substr, effectively (String objects contain a reference to a char array, a length, and an offset).


That's also because (to my knowledge) that strings in Java are immutable. Same as in Ruby and a bunch of other languages.

So the optimisation that can be done is to put all of the string data into one big (immutable) array and have each string object just reference into it for its data.


And think of the classic code for copying a NUL-terminated string. Unless I'm missing some clever optimization, replacing it with a length-based version would require another counter variable -- meaning the algorithm would use another register (or even possibly two on an 8-bit machine) and require another operation for each character. Not such a big deal now, but it would have been a big advantage for NUL-termination back in when C was designed.


You can make your string pointer a "fat pointer" (basically a (ptr,size) tuple), instead of the char ptr you have now. Then substrings are still strings, you just have to make a new fat pointer to it rather than an ordinary pointer.


Sounds a bit like array slicing. ;-)


The 68K Macintosh used Pascal-style strings; Str255, with a count on the front.

I don't recall it ever noticably improving code quality. And it made it really, really hard to deal with long strings; everyone invented their own formats and interoperability between libraries was a nightmare.


>Another candidate could be IBM's choice of Bill Gates over Gary Kildall to supply the operating system for its personal computer. The damage from this decision is still accumulating at breakneck speed, with StuxNet and the OOXML perversion of the ISO standardization process being exemplary bookends for how far and wide the damage spreads.

First off, IBM did put CPM/86 on the PC and sell it. Many people bought it. We had it at work. The reason CPM/86 failed relative to PC-DOS was simple - CPM/86 cost nearly $300 and PC-DOS cost $40.

There wasn't anything discernably better about CPM/86 at the time, either its programming API or its user interface, and people did the sensible thing and bought the less expensive operating system. It was a no-brainer.

As to the idea that CPM/86 was somehow a secure operating system, I haven't heard that before. It's patently false.

What killed CPM/86, plain and simple, is it was way, way overpriced. I have no idea who made that decision.


pageman has a [dead] reply below linking to http://www.businessweek.com/magazine/content/04_43/b3905109_... — claiming that the price disparity was IBM's doing, intended to drive the cross-platform CP/M out of the market while still 'supporting' it.

It looks like pageman got hellbanned 2 years ago for participating in one of the erlang frenzies. I went to email him about it, but it looks like I'd already done so 1.5 years ago!


I thought the idea was that if this decision had gone the other way then Microsoft wouldn't have achieved its current position.


People should not forget that UNIX was an informal experiment, not a plan for what should dominate the computing world 40 years later. When writing informally-experimental software, you should have a bias toward doing things differently from the mainstream, so that you can put new ideas to the test.

There are plenty of options for people who don't want to write new code using NUL-terminated strings. They can make their own decision. There's no reason to blame a 40-year-old decision for errors today.


If you want your new code to interact with old code, there is a very significant advantage to using the old code's data format.


Whenever I've had to write non-trivial C, I try to use the bstring: http://bstring.sourceforge.net/ library if possible, which provides a length+blob approach with a nice API, and can be converted to/from nul strings fairly easily.


DOS did not "invent" it's own path separator, directories were added to DOS long after command line flags, which already used slash, and COMMAND.COM already supported placing a flag immediately after a command name (without whitespace), making backwards compatibility with their own OS extraordinarily hacky to implement if not impossible.

Also please remember we're talking about the late 70s here. This is one of those annoyingly common vitriolic ideas about Microsoft, really tarnishes the reputation of this article as being well researched by appearing here.


The author mentioned this:

"IBM had decided to use the slash for command flags, eliminating Unix as a precedent, and the period was used between filename and filename extension, making it impossible to follow DEC's example."


DEC also used periods to separate filename from extension.

People forget that CPM's conventions (which were copied by CPM/86 and PC-DOS) were copied from DEC conventions. DEC operating systems were very, very popular at the time.


I wonder whether the most expensive mistake is not so much the NUL-terminated string design, but the persistent use of library string and memory manipulation functions that we now call "unsafe". The culture of constantly writing your own raw-memory-based string manipulation functions and working with memory so directly, without safety constraints.

I remember using strings and STL containers in C++ and wondering why would anybody go back to the clunky ways of malloc'ing, scanning for NULs, using memcpy, strcmp and other things that might "run away" on you so easily. I remember feeling unsafe using strcpy. In contrast, it felt very safe to use C++ strings: I had to out of my way to code a buffer overflow.

Better, but not best, the "safe" replacements of strncpy require you to maintain the correct string length on your own. Basically, there are just so many ways to shoot yourself in the foot.

The industry could've signficantly addressed this by adapting a standard library for manipulating strings and memory, at least with length-restricted "safe" functions (strncpy, but one that always NUL-terminates). Potential language support and compiler support could've been added. All of this should've been done way back in the day, but it wasn't.


That subjective feeling of unsafety is excellent. I can only wish that other people felt that same tinge of doubt when invoking these operations, as one mistake can corrupt memory.

I also find it ridiculous how tolerant the OSS community is about these things. Major projects written in C (Pidgin) often find themselves fixing these sorts of mistakes over and over again. Why is this acceptable? To me it suggests a rushed design, and it potentially puts my information at risk. Remember, all of the speed gains to be had from C are lost the moment you core dump.

It isn't that C is inherently insecure, but we should require good reasons for it to be used, particularly with apps that interact with the network. Right now it seems like it has a bit too much geek cred as the language of alpha developers.



(strncpy, but one that always NUL-terminates)

It is trivial to write a function like strncpy_term() that adds a terminating NUL and drop it into any C project you write.

Potential language support and compiler support could've been added. All of this should've been done way back in the day, but it wasn't.

I would argue that since C became the de facto high-level lingua franca of low-level programming, it was better to keep it minimalistic so that programs written in C would be as portable as possible.


In the performance category, it imposes a cost on string searches, also, unless you know the string length via some other mechanism. Algorithms like Boyer-Moore get their speed by skipping over some characters, but with null-terminated strings it's unsafe to skip any character without first checking if it's null. Granted, that's mostly a cost imposed on long strings being searched for longish substrings.


The case you described as having the cost imposed is literally what boyer-moore excels at, so it doesn't even have the benefit of hurting the pathological case.


> The damage from this decision is still accumulating at breakneck speed, with StuxNet and the OOXML perversion of the ISO standardization process being exemplary bookends for how far and wide the damage spreads.

Bashing Windows doesn't make you cool, it's just tiresome.


Yeah, but it does make you appear cool on Slashdot.


A C string is a convention, not a type and I don't know why we still cannot get a string type in C that can live side by side with 'char *'. I mean we got wchar_t duct-taped to C so why did the C99 guys not introduce a 'string' type (add/len)?


You can just do it yourself, typedef char* (or some other fancy array) as a string type and treat the first x bytes as a length of the string.

There's nothing stopping people from doing this temselves.


If the source string is NUL terminated, however, attempting to access it in units larger than bytes risks attempting to read characters after the NUL. If the NUL character is the last byte of a VM (virtual memory) page and the next VM page is not defined, this would cause the process to die from an unwarranted "page not present" fault.

I don't see how that could happen. Are there any machines out there where VM pages are not aligned to some multiple (even 1x) of the CPU's native word size?

If I'm correct in assuming there isn't, then by definition, if the NUL character is the last byte of a page, then it's word-aligned, and you'd read exactly to the end of the page anyway. Alternatively, if you're starting your read on an unaligned address but making your read size a multiple of the CPU word size (in which case you could read past the end of the page), you're not really gaining anything performance-wise.


The page fault problem seems unlikely, but SIMD-optimized versions of functions like strcspn() will read in 8-byte chunks, causing Valgrind's memcheck to complain about invalid reads when you do a strcspn() on a string where length % 8 != 0.

As for the original problem, I suspect x86 is relatively unique in its tolerance for unaligned memory accesses. The ARM processor I'm working with will return the wrong data if you try a 16-bit or 32-bit read that is not aligned on a 16-bit or 32-bit boundary. malloc() is supposed to return memory that is aligned to the platform's largest alignment requirement, so it takes a bit of deliberate work to create unaligned accesses. It seems unlikely that a NUL-terminated string will be accessed using unaligned multi-byte reads, since optimized libc functions will take alignment into consideration.


The choice of a null terminator vs len+adr is distinct from the issue of buffer overflows. A string implementation could use storage buffers sized with len+adr (with len as capacity) and range check them at runtime, but still use null terminators for the actual string length. This wouldn't necessarily be redundant; null termination has desirable semantics for algorithms that process text as a stream (e.g. scanners, regex) and saves having to keep track of distance to string end in a separate counter.

The deeper problem is lack of memory safety. The state of the art in static checks wasn't as advanced back then though (Pascal, in its original form, was not well received at the systems level), and dynamic checks would be considered too costly.


Interesting read. However, the article doesn't fully consider the costs of the alternative decision. One of the things I really hated about COM programming was BSTRs (prefix the length of the string in memory BEFORE the actual pointer). Granted, BSTRs are a kludge but to me, it illustrated the elegance of the NULL terminated C-string vs. alternatives.


The other thing that came to mind to me was that you have now defined a maximum string length, unless you then had a meta-length number (describing the "length" of the length). And how long should that number be?

I could see where Ken & Co just looked up and figured null-termination was a more elegent answer.


> The other thing that came to mind to me was that you have now defined a maximum string length, unless you then had a meta-length number (describing the "length" of the length). And how long should that number be?

size_t


You could use a variable-width size field similar to how UTF-8 works, but that's getting pretty complicated and inefficient.

e.x.

0xxx xxxx

10xx xxxx xxxx xxxx

110x xxxx xxxx xxxx xxxx xxxx

1110 xxxx xxxx xxxx xxxx xxxx xxxx xxxx

etc


And re-introduces the 'magic character' you're trying to eliminate.


The magic character is in a known location, there's no need to iteratively scan for it.

The use of "magic" characters is not the problem, as I see it.


No it's not. In the first line, it's in [0], in the second line, it's in [1], in the third line, it's in [2]. You have to scan through to find where the '0' appears


True, but it's O(log N) rather than O(N) to determine length.

BTW I'm not sure if it was clear or not but those are bits in my diagram, not bytes/characters. The "x"s are the bits of the length field, not the characters of the string.


But you could just as easily set aside the first 4 bytes of a string to be the length and then it would be O(1).

But does that really matter when you're doing operations on the string that iterate over the whole string anyway? Since iterating the whole string is O(n) anyway, you're not really gaining anything.


Operations to return string length or return characters from the end of the string become far less expensive.

You also have a fair idea of a reasonable sized buffer to move in duplication operations, avoiding single character move loops.


Interestingly it was also Ken Thomson who invented UTF-8 as well.


If we could send one message back in time to Ken, Dennis, and Brian, I nominate this.


BSTR's are not only null-terminated, they're double null-terminated (two bytes of zeroes).


..but that's just a convenience to allow them to play (mostly) nice with wchar APIs. The BSTR itself could have avoided the trailing NULLs if it never had to worry about being treated as a simple wchar buffer. http://blogs.msdn.com/b/ericlippert/archive/2003/09/12/52976...


My point was that it's not fair to point to BSTR's as an example of how the len+data approach is bad because BSTR's force you to do both and (in my opinion) end up having the drawbacks of both approaches and the benefits of neither.


I would argue that a more expensive one-byte mistake was the computer industry failing to standardize on a newline representation (CR+LF versus LF).

In mixed Unix/Windows environments, I cannot even count the number of times that this discrepancy has created issues or slowed down work: config files accidentally saved with Windows newlines and causing cryptic error messages ("No such file or directory: filename" because "filename" from that config file has an invisible CR at the end), discovering that a tool had been silently converting newlines from one format to another when you wanted to preserve them but it is too late to fix because the files have been shipped to customers, tools stripping the LF but not the CR and corrupting text displayed on terminals, $ in a regex not matching the end-of-line because of the presence of CR, of course the pain of editing Unix files on Windows (the only reason why Wordpad exists), etc.


> the pain of editing Unix files on Windows (the only reason why Wordpad exists)

An FYI for anyone who might have this problem, Notepad2 is a drop-in replacement for notepad and can handle Unix/Windows line-ending conversions (plus a few other nice features for a small binary).


C++'s std::string solves this. Not only does it take care of all memory management for you, it also stores the string as length and a pointer. There's also a common 'small string' optimisation that stores strings a few bytes long inside the class itself, so it isn't doing any dynamic allocation. O(1) to get the size and small strings don't allocate - nice! Still, people will continue to moan that C++ is rubbish or something. You can even invent safe stack-allocated strings using templates to mimic exactly how C does it but in a safe and straightforward way, but the idea hasn't caught on - seems std::string is just fine.

To be fair C++ is best described as using both, since string literals are still null-terminated, and in many cases you have to use the std::string::c_str() method for backwards compatibility where a null-terminated string is expected (which is often regularly, in practice).


"To be fair C++ is best described as using both"

I think that is the real reason people commonly complain about C++. It's stuck in some sort of weird twilight zone, halfway towards being a modern safe language, but retaining enough of C to make it dangerous. Arguably more dangerous than C since newcomers may not be aware of what they are getting themselves into.

Kind of like the difference between a pit of quicksand, and a pit of quicksand covered in palm leaves. ;)


I don't agree that it is more dangerous than pure C. First, you can do all your operations on strings in 100% C++, which will be aggressively optimized to essentially what you'd be doing with C anyway. Second, keeping the underlying representation as NUL-terminated allows you to use other APIs that consume char * C-style strings by calling c_str(). There is a compile-time const check that forces you to write a cast when calling APIs that do not consume a const char *. If the APIs that you are calling do not modify the string, but merely read it (which is the case for almost all uses), then you can have a C++ app that is completely safe from string buffer overflows.

This is called abstracting out an easy-to-make-tragic-mistakes-in problem into a small layer (C++ std::string) and using that layer everywhere.


> In other words, this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day; but this one had quite atypical economic consequences.

In other words, it is the exact same kind of pragmatic decisions as all the rest of the author's examples, but is simultaneously different.


"Using an address + length format would cost one more byte of overhead than an address + magic_marker format, ..."

Actually pedantically, assuming an unsigned length, one byte for strings < 256 characters, two bytes for 64K, Etc.

Having heard Dennis at least talk about the development of UNIX the notion that C was 'dangerous' and not for the folks who didn't know what they were doing, was both a conscious decision and expedient. I mean c'mon you can cast a constant into a function pointer. It really was just syntactic sugar over basic assembly.


It'd be one extra byte for 64k etc assuming you scrap the one byte magic marker.


And yet, these days, performing such undefined operations may cause your compiler's optimizer to assume that such code is e.g. dead, and behave unexpectedly.


Oh absolutely. The original author was claiming that using a 'signature' character, NUL, to indicate the end of a string was is the most dangerous one byte 'feature' and speculates on whether or not the folks at Bell Labs thought that through.

In context, C was just a glorified version of assembler (but with better looping constructs) and that one could overrun strings, or randomly go into the weeds if the NUL was missing was 'understood' because that was no more dangerous than simply writing the assembler yourself. I mean 'JSR @R7,0x4bee' isn't really that much different :-)

I think the author was looking at the past through today's understanding of what the C compiler does, as opposed to the purpose it was originally set out for (which was to be more readable than assembly)


What always bothered me with nul termination is that it isn't a data structure, it's a convention.

And as with all conventions there's discipline involved. You can't unit test and get rid of a certain kind of bug once.

To this day I still think that (Borland) Pascal is a better language than C. It has sets, array indices and best of all, a real module concept.


Actually, null-terminated strings made a lot of sense when the string is short. On my 8-bit days, I used both approaches (because, sometimes, strings have to contain a NUL in them)


One OS which did not make this mistake was the Classic Mac OS. I had a thread on Ars Technica where I argued whether Copland would have been more secure than Mac OS X based on that.


>Another candidate could be IBM's choice of Bill Gates over Gary Kildall to supply the operating system for its personal computer. The damage from this decision is still accumulating at breakneck speed, with StuxNet and the OOXML perversion of the ISO standardization process being exemplary bookends for how far and wide the damage spreads.

As it happens, I researched that one for years, and I now think the root cause of that one is likely Bill Gates being an aggressive businessman who treated business as war.




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: