Also it had sane strings. Strings with sizes, withouth the whole null-terminated...

pjmlp · on Dec 10, 2021

Languages 10 years older than C have proper strings, in a certain sense there are some design decisions common to Go and C, regarding adoption of common features.

masklinn · on Dec 10, 2021

> Also it had sane strings. Strings with sizes

Except they were prefixes on the buffer which was bad and not sane, and greatly limited the size of the original pascal strings.

And technically they were UCSD strings, not standard pascal, and other implementations used e.g. padded strings.

jstimpfle · on Dec 10, 2021

"Sane" strings that may waste 4 bytes of length field for unnecessary count, or may have a 1 or 2 bytes length field that proves to be insufficient. Or "sane" strings that alloc and dealloc and refcount like mad, bringing the application to a stall. "Sane" strings that discourage the developer from just coming up with a simple memory management scheme that fits the situation at hand.

"Sane" strings that lead to incredible bloat and incompatibility, because there is no one true "sane" string type, so every module that doesn't know better forces their own way onto the user.

Jtsummers · on Dec 10, 2021

If the length field is 4 bytes then only 3 bytes are "wasted" compared to C with its null-terminated strings and 1-byte chars. The difference drops if you have wider character types. Not to mention the time saved not having to scan every string to determine its length.

I always find it weird when people fret about bytes but not cycles, especially cycles that have to be spent waiting for memory reads.

yongjik · on Dec 10, 2021

You are thinking as a developer in 2020s, not one in 1990s (or earlier). Memory was incredibly precious: 16-bit x86 had 64KB segments, so if your data didn't fit in, it would be a lot slower. People used nibbles (4 bits) because the extra instructions dealing with bit twiddling was worth the cost.

Basically, no sane programmer in the 90s would be happy with a string type that wasted three bytes per object.

masklinn · on Dec 11, 2021

I think one of the other factors (I fear calling it "mitigating") is that fixed-width strings were a lot more common back in the days. Outside of serialization they're pretty much gone now, but we can see their mark in various oddball "string" functions of the C standard library which were never designed to operate on "C strings" (though strings.h also has a lot of functions which are just plain garbage with no redeeming features).

For instance strncmp and (especially) strncpy make very little sense with C strings, but make sense for NUL-padded strings.

Jtsummers · on Dec 10, 2021

CPU cycles were also incredibly precious. It's a tradeoff. In the 80s and 90s you also had smaller caches so iterating over a string to determine its length was more expensive (more likely to hit RAM) than "just" reading its length parameter and carrying on with your life.

yongjik · on Dec 10, 2021

Sure everything was more expensive, but not by the same factor. Main memory was smaller but also relatively faster compared to CPU. Search for "386 simm memory" and you'll see 60ns modules. Considering that 386 debuted with 12MHz clock, 60 ns is faster than one CPU clock cycle!

In other words, "reading the whole string from memory" could be a performance problem, but a less serious problem for machines of those days, compared to using a few more bytes to store the length.

fiedzia · on Dec 10, 2021

> Basically, no sane programmer in the 90s would be happy with a string type that wasted three bytes per object.

In general, maybe, but we are talking about Borland here, so business logic apps mostly. String size is not a problem there.

musicale · on Dec 13, 2021

> Basically, no sane programmer in the 90s would be happy with a string type that wasted three bytes per object.

Though it was eventually eclipsed by C/C++/Objective-C on Apple platforms, I believe Pascal was the original application programming language for the Apple Lisa and Macintosh, and produced some revolutionary software in the 1980s.

Object Pascal/Delphi certainly enjoyed a fair amount of success in the 1990s.

pjmlp · on Dec 10, 2021

Yet JOVIAL, NEWP, PL/I, PL/S, PL.8 among other Algol dialects managed it.

inkyoto · on Dec 11, 2021

> JOVIAL, NEWP, PL/I, PL/S, PL.8 among other Algol dialects managed it.

… with all of them having been developed and run on 32-bit IBM mainframes and other big irons with «lots» of memory (e.g. 512 kB of RAM would have been considered huge in early 70-s).

C, on the other hand, was developed with limitations of early PDP-11's in mind that were often equipped with 56kB of RAM, so null-terminated strings in C was a rationalised design decision and/or a trade-off. Besides, both, UNIX and C started out as a research and a professional hobby project, not a fully fledged commercial product.

Since the internetworking had been inexistent slightly less than entirely, remote code execution that stemmed from buffer overruns was not an issue, either.

pjmlp · on Dec 11, 2021

You can start with IBM 704 used for Fortran and Lisp in 1954, TT 465L Strategic Air Command Control System in 1960, B5000 in 1961, CDC 6600 in 1964, and then compare with the capabilities from a 1964 PDP-7.

Read the DoD security assessment on Multics, https://multicians.org/b2.html

Afterwards you can read from Denis' own words,

> Although we entertained occasional thoughts about implementing one of the major languages of the time like Fortran, PL/I, or Algol 68, such a project seemed hopelessly large for our resources: much simpler and smaller tools were called for. All these languages influenced our work, but it was more fun to do things on our own.

Taken from https://www.bell-labs.com/usr/dmr/www/chist.html

So thanks to their fun, the world now suffers from C strings.

inkyoto · on Dec 11, 2021

> Although we entertained occasional thoughts about implementing one of the major languages of the time like Fortran, PL/I, or Algol 68, such a project seemed hopelessly large for our resources: much simpler and smaller tools were called for […]

Precisely my point. The definition of fun is up for a personal interpretation.

jstimpfle · on Dec 10, 2021

Scanning for the string length (e.g strlen()) is asymptotically worse than reading a fixed size integer, so obviously don't do that unless it's a good memory/speed tradeoff (i.e. when you know the string is at most say, 16 bytes long).

Overall, it seems you didn't read my comment either. Or was I _that_ unclear?

nwiswell · on Dec 10, 2021

Obviously C-style strings can still remain an option where they are needed, but in most cases using the 4 bytes for a length field is a sane default. How many buffer overflow attacks have been enabled by that four byte savings over the years?

> "Sane" strings that discourage the developer from just coming up with a simple memory management scheme that fits the situation at hand.

"Sane" compilers that discourage the developer from considering the machine level instructions. It's turtles all the way down.

There's a reason that Python is so popular and it's not performance

jstimpfle · on Dec 10, 2021

This is not Python so that's a strawman.

nwiswell · on Dec 10, 2021

By that I meant that there is obviously value in abstracting away tasks and details that the programmer would otherwise need to manage. That is why compilers exist. The value-add for abstracting a string is certainly more than the cost of 4 bytes of memory in the typical case.

Put another way: optimizing the management of strings in memory is almost never the best use of time to make progress toward an organization's objectives, and doubly so when that kind of micro-tuning can actually introduce security risks

mdip · on Dec 10, 2021

This never, ever bit me in the Pascal days. I suspect this was primarily because the stack I was using was either "provided by the Borland Pascal standard library" pieces, or it was my own Pascal or assembler code.

I had a limited number of calls into a library and a need to do a few things that escape me with regard to interacting with -- I think -- an 16550 UART[0] and its driver, but I don't recall them being particularly nasty to deal with. I mean, all things relative -- I was expecting these to be nasty to deal with because they often involved inline assembler, so the problem of "making it behave with the string" wasn't quite as pressing as "what the hell am I actually doing here?" :)

[0] My huge project was a bulletin board system in the 90s.

nerdponx · on Dec 10, 2021

I hardly consider four bytes to track the length of a string "waste".

I also don't really know why you assume that "sane" means "doesn't let you manage memory effectively".

That said, how many applications really are bottlenecked by string processing in the first place? I don't care if processing Unicode graphemes is slow, as long as it's correct and doesn't mangle users' names.

jstimpfle · on Dec 10, 2021

Well I for one optimized an authorization module of an enterprise application written in Delphi by getting rid of standard library strings. Speedups where 100x-1000x, accelerating application startup time from minutes to maybe 3 seconds.

novok · on Dec 10, 2021

He is probably thinking about the context of the 1980s, when a large amount of 3 byte waste (the 1 byte null char is a form of 'waste' itself) might have been a problem back then.

jstimpfle · on Dec 10, 2021

Packed structs with char fields of 8 bytes or so are still common.

adrianmsmith · on Dec 11, 2021

I don't know the details but I doubt that 4 bytes would be used for a length of a string in the 90s. 2 bytes would lead to 64k characters which is surely more than you'd need for the average string.

And surely if you malloc(..) a block of memory in C and store a string in it, the memory allocation system is going to store how large a block that is anyway (even if it isn't visible to either C or Pascal probably.) I know not all strings will be malloc'd but a lot will be. And we seemed to deal with this overhead fine in the 90s?

jstimpfle · on Dec 11, 2021

That you have to weigh 2 vs. 4 byte lengths is exactly making my point. There is also a case for 1 byte lengths, and for string implementations other than length + pointer, like for example rope datastructure implementations. My point is that there is not one sensible string implementation, and acting like there was is in many situations trading short term convenience for long term pains in larger projects.

> I know not all strings will be malloc'd but a lot will be. And we seemed to deal with this overhead fine in the 90s?

It's a common but deeply flawed assumption that allocations and lifetimes are so random that every little object should be individually allocated and later deallocated with malloc()/free() or with another generic allocator. I don't use malloc() to allocate string buffers - except in rare situtations of laziness, knowing it will come back to bite me later. Not only performance concerns but also the practical impossibility of matching each malloc() with a free() forbids that. Systems like RAII come to help to solve the latter issue, but I prefer to take the difficulty of matching everything up as an indication that the general approach is too complicated.

Instead I recommend to allocate strings using a fixed field (eg. char buf[16]) on a stack frame, or using a member in a struct, in order not to add any management overhead. Alternatively, for unbounded and dynamically sized strings, it's often a good idea to allocate them using linear allocators. For example, in a GUI renderer, it's a good idea to have a per-frame allocator that collects all allocations in a list of larger chunks, to minimize the allocation overhead to a few chunks (KBs to MBs) that were individually allocated from the system. With this, everything is freed using a single central function call after the frame was renderered and the data is now not needed any longer.

kaba0 · on Dec 10, 2021

How is iterating over the string each time you want to do anything meaningful better? Also, there is a short string optimization, where you can store the string inside the pointer , eg. c++ does just that.

jstimpfle · on Dec 10, 2021

You didn't read my comment right. My statement is there isn't one true string type. I didn't say you shouldn't use a length field.

Zero terminated strings still make some sense of course - ease of reading when looking at byte level representation, and moderate cost savings in packed structs (4, 8, or 16 byte strings). The former is why I zero terminate by default where possible, even when using a separate length field stored somewhere else (almost always).

johnfn · on Dec 10, 2021

The thing is, not every developer wants to care about memory management. A lot of us just want to solve user problems, and we don’t mind too much if we spend a couple of extra bytes to do so.

jstimpfle · on Dec 10, 2021

It's not primarily about the bytes. Heap allocating RAII style strings can absolutely kill performance. And they are baaaad for modularity. It's all in my OP, why do I even repeat?