Even so, 2^16 chars is still absurdly limiting. Furthermore, address/length pair...

advisory5739f2 · on Aug 2, 2011

I disagree that 2^16 chars is absurdly limiting. I remember coding in some language that had a 255-char String limit (maybe that was some kind of Pascal) and while that was somewhat limiting, it was not an issue in some 90% of the strings. 2^16 pretty much takes care of 99% of string usage, especially in the earlier days of programming languages. Anything over 2^16 and using a more specialized data structure for a buffer would've probably been more than acceptable.

burgerbrain · on Aug 2, 2011

2^16 bytes is a mere 64KB. Sure you can get away with small strings if you have to, but in a world where that isn't something that you have to put up with it would be quite frustrating.

For example, say you need to preform some sort of text editing style task, and insert a few chars into the middle of a file. If one of the internal representations of the file happens to be one contiguous char* , then all you have to do is one quick memmove to make some room. With a length-prefixed representation the best case scenario is you do the memmove as before, then also update the length (no biggy really, since you probably keep that around somewhere anyway). However, if you have a 2^16 restriction and have a file larger than that you're suddenly can't use a contiguous piece of memory. This would complicate numerous things including searching, splitting, and (potentially) insertion. Not having a contiguous piece of memory also complicates the process of laying any number of data structures on top of the file data. Even further, it causes issues when you want to just memmap in a file, unless you want all your files to be perpended with the number of chars in them, which causes even more issues...

Dove · on Aug 2, 2011

Right, so you'd need a BigString analogue to BigInt for that case.

Which wouldn't be that bad, really. I mean, a 64k string is plenty for your everyday string needs. And in cases where you're handling really long strings, you'd probably want a specialized data structure anyway. I mean, it's not like char* is exactly efficient when you need to insert something into the middle of a many-megabyte file.

dkersten · on Aug 2, 2011

For those kinds of tasks, a string doesn't seem all that appropriate to me. The BigString (that is, a Rope[1]) is more appropriate IMHO.

[1] http://en.wikipedia.org/wiki/Rope_(computer_science)

Dove · on Aug 2, 2011

That's kind of what I mean. The fact that null-terminated strings can get arbitrarily long doesn't seem like a big advantage. I mean, if you're working with really long strings, you probably want to use something more sophisticated than a character array anyway.

burgerbrain · on Aug 2, 2011

I mean, a 64k string is plenty for your everyday string needs.

No, not for my everyday needs.

jerf · on Aug 2, 2011

But the question here isn't actually "What is the ONE TRUE STRING format that the language permits, all others to be rigidly banned by the compiler?", the question is "What shall the default string be in the core C APIs and functions?"

If you still need NULL-terminated strings, you could have chosen them, and if you knew enough to so choose, hopefully you know enough to treat them like the dangerous tools they are. Meanwhile, the core C functions and API and UNIX could have been built around the much safer strings, which wouldn't have been all that hard to upgrade to 4 bytes (or more) later. Or we could have done a UTF-8-like size encoding, or turn the default strings into linked lists if they got large, etc. It would be OK, because raw expanses of memory would still be available to you, it just wouldn't be the default.

NULL-terminated strings are the wrong default, even though they should be available to those who really need them.

burgerbrain · on Aug 2, 2011

Multiple string schemes is exactly the sort of fragmentation this industry does not need. Citation: C++.

tptacek · on Aug 2, 2011

You mean, basic_string and charstar?

copper · on Aug 3, 2011

Hm, It can be a real problem if you have the joy of working with old C++ code: the initial codebase might have had its own string type, and then some more classes added by long-gone programmers who thought the earlier types were slow, and then newer code added by people who actually use std::string (well, you get the idea).

(much to my regret, that is a true story.)

rwmj · on Aug 2, 2011

It's not limiting on a PDP/11 though, since that could only address 64K of memory. Presumably when C/Unix was later moved up to VAXen and other 32 bit machines, the field would have been extended to 4 bytes, and today to 8 bytes.