Hacker News new | past | comments | ask | show | jobs | submit login
C was not created as an abstract machine (utcc.utoronto.ca)
195 points by pabs3 on Feb 2, 2023 | hide | past | favorite | 216 comments



> The determined refusal of the specification to tie this abstract machine in concrete CPUs is the source of a significant amount of frustration in people who would like, for example, for there to be some semantics attached to what happens when you dereference an invalid pointer. They note that actual CPUs running C code all have defined semantics, so why can't C?

Because while all CPUs running C code have defined semantics for any given construct, not all CPUs have the same defined semantics.

Making C adhere to one CPU architecture for the sake of convenience would make implementations for other CPU architectures decidedly inconvenient. Rather than dereferencing an invalid pointer doing "whatever this CPU does", suddenly all C implementations now have to do whatever, for example, a VAX did.

It would mean that C code running on simple CPUs without memory protection would have to look out for invalid pointer derefs (how?) and simulate access violations if one were mandated by the spec. Or, if the spec said that the address space should be treated as a flat memory area all of which is accessible by any pointer, modern implementations would have to simulate that.


The problem is that dereferencing a null pointer does not actually have undefined semantics, it has system defined semantics. The compiler should compile source code in such a way as to produce machine code that does whatever that system does when a null pointer is dereferenced.

It should do this in part because very large volumes of code will be compiled that way due to the inability to detect in advance whether a pointer is null or not, and compiling it differently when it is known that a pointer is null makes for inconsistent behavior and vulnerabilities like this one.

This is the way relatively sane simple-minded compilers have worked for a long time, and no compiler should be allowed to pretend that certain constructs do not have reliable or at least consistent system defined behavior that will ensue if they are used.

Similarly if a compiler can tell that a signed addition will overflow in some cases it should be required to do whatever a signed addition overflow does on that architecture, and for the same reason. Issue the appropriate warnings about non-portable code, but do something reliable and reliably useful.


> The problem is that dereferencing a null pointer does not actually have undefined semantics, it has system defined semantics.

No it does not. Look in the standard.

Seriously though, I get what you mean but what is the point of having a systems-defined NULL dereference? The semantics of the NULL pointer (in C) is such that you are not meant to dereference it. If you dereference it, you have a bug no matter what the manifestation on a specific system is.

And there are probably actual systems where it would be hard to specify the behaviour. For example MMU-less systems, where you can derefence byte 0 but it might not be statically clear what kind of value is stored there?

Maybe C allows a systems-defined behaviour to be implemented as undefined behaviour? But then the distinction is kinda moot.


I think butlerm was saying that dereferencing a null pointer has system-defined semantics in machine code, and that C should inherit those semantics.

> If you dereference it, you have a bug no matter what the manifestation on a specific system is.

Sure, if you dereference a null pointer you have a bug. But all sufficiently large programs have bugs, and what happens after you trigger the bug is important. The closer the machine code matches your C code, the more likely you are to be able to diagnose the problem from its symptoms.


Yes, that is what I intended to say (in the first sentence), thanks for explaining it better than I did.


> Maybe C allows a systems-defined behaviour to be implemented as undefined behaviour

C standard distinguishes undefined behavior, unspecified behavior and implementation-defined behavior. The first is invalid and inconsistent, the second is valid and inconsistent, the third is valid and consistent (in scope of implementation).

It is rather hard to require invalid but consistent, as the platform itself may produce inconsistent behavior in invalid cases. So what some people want is something like - "be no more inconsistent that would be naive implementation on given platform", but that is both vague and not really useful. And while some of these undefined-based optimizations seems egregious, some are necessary in order to have reasonably performant code (e.g. keeping variable values in registers instead of propagating them from/to memory after each step).


The point of having a system-defined NULL dereference (or whatever else is UB) is predictability both at compile-time and at run-time - given a particular system and a piece of high-level code, you can infer roughly what the compiled low-level code is; this means that given a bug, since odds are good you know what system the code is running on, you probably have a leg up in understanding what went wrong.

It's nice being able to strap your program into GDB and have it breakpoint at the point of a NULL dereference because it triggered a segfault, instead of needing to magically infer why results are wrong anywhere in the program because the compiler decided to inject nasal demons due to some subtle optimization pass.


The interrupt descriptor table in real mode on x86 is stored at the zero page. It is valid to dereference a pointer with value zero under DOS. Modern OS's explicitly avoid mapping at page zero in order to trap NULL pointer dereferences. That does not mean you cannot access memory mapped at zero, just that the system might have mechanisms to catch errant programs. It is not the responsibility of the C compiler to make the distinction of whether or not accessing memory at zero is errant, but rather the architecture and the system which determine how to respond to a process that attempts to do so.


Again, the C standard came later, and in many cases the standard is about consistency across platforms where there's plenty of useful behaviors of specific platforms that you want to take advantage of. And that the modern practice of optimizing compliers emitting nasal demons gets in the way of.


Imagine you have a function that traverses a non-circular linked list. You implement it with a recursive function. The compiler uses Tail Call Optimization to convert it to a loop. Then your program hands this function a circularly linked list.

Your program now behaves differently after optimization. Before, your program crashed with a stack overflow. Now, your program loops forever.

Should the compiler have not done this? Or is it a bug in your program that you passed an invalid input to this function?


Looping forever is the same as if you ran the recursive version on a hypothetical machine with unlimited stack memory, so it didn't really change semantics.


But we aren't running on a hypothetical machine. That's the whole argument with the "define everything" crowd. Those two programs are different on my machine. Is that my fault or the compiler's fault?

People say "I just want my machine to do what my machine does when I dereference a pointer at address 0, why is my compiler making my program do something different?" Why can't I say the same thing in my case?


No, the argument against undefined behavior is that it makes errors silent. I don't care whether dereferencing a null pointer causes a segfault, an exception or something else, but it shouldn't cause the program to run as if nothing is wrong.


In a multithreaded program, one of the implementations I described crashes my program and one makes everything else but the stuck thread run just fine.


> The problem is that dereferencing a null pointer does not actually have undefined semantics, it has system defined semantics. The compiler should compile source code in such a way as to produce machine code that does whatever that system does when a null pointer is dereferenced.

Null pointer dereference is just a special case of invalid pointer dereference. And that does not have consistent results anyways (may segfault or may just return garbage)


At the system level (not the abstract virtual machine in the current specification) it is reliably known whether a null pointer dereference will return a specified value (such as zeroes), unspecified values (semi-random data), or a result in a system exception (segfault) of some sort.

The compiler must produce code that does that because it cannot determine whether a pointer is null in advance in most cases. So letting it do something different when it knows that a pointer is null (due to some sort of coding mistake) vs. when it doesn't know that the pointer is null is gratuitously non-deterministic behavior.

The safe (if somewhat slower) thing to do is to emit the same code regardless, so that a null pointer dereference has the same effect even when inadvertently inserted into the program.

The same operation should produce the same result in the same program, as much as possible anyway, and if not the same result then a similar one. It is not a reasonable assumption that any real world program will never dereference a null pointer. It is helpful that the consequence of doing so be as stable and predictable as the underlying architecture provides for.


Why not just have it trap?


> Rather than dereferencing an invalid pointer doing "whatever this CPU does", suddenly all C implementations now have to do whatever, for example, a VAX did.

Dereferencing an invalid pointer doesn't do "whatever this CPU does," it does completely unpredictable things. I think that's part of the problem (to the extent it's a problem)


But that's the thing: dereferencing an invalid pointer is undefined behaviour, which means the compiler is allowed to assume it never happens; a C program executing undefined behaviour is _invalid C_. Thus, any time you dereference a pointer you also implicitly proimise the compiler that this pointer will _never_ be an invalid pointer. Same with signed arithmetic: you are telling the compiler that your arithmetic is guaranteed to never overflow.

Whether this is a good or bad thing is of course a legitimate (and good!) question, but for writing C today that's how the language is specced and a reality the programmer needs to take care to avoid, just like a bunch of other things that C leaves to the programmer like remembering to clean up allocated resources when they're no longer needed.


> But that's the thing: dereferencing an invalid pointer is undefined behaviour, which means the compiler is allowed to assume it never happens

It would be more helpful for the compiler to assume that it does not know what will happen. This is how C actually worked for many years.

> a C program executing undefined behaviour is _invalid C_.

That was not formerly the case, and it is not always helpful to redefine C in this way. Sometimes you really are not trying to write portable code, and you really do want the behavior you know that the target machine will give you, even if the C spec doesn't require it.


> It would be more helpful for the compiler to assume that it does not know what will happen. This is how C actually worked for many years.

If we don't know what will happen that is Undefined Behaviour.

The contradiction you have within yourself is that you know what you want to happen, but that's not what the specification says. If you want specific behaviour you need to specify what it is - not mumble and make a vague wave of the hand about "behavior you know that the target machine will give you" when you've no promise of any such thing. That would come at a cost, and of course you don't want to pay that cost, but that means you can't have what it buys.


That is certainly one perspective that one can have. The point here is that the language and its usage precede the specification, and a pedantic, narrow-minded adherence to a certain interpretation of a document which was actually a post-hoc rationalization of existing practice has made the language less useful for certain applications.


The C standard could easily make dereferencing a null pointer implementation-defined behavior.

And even more critical: Signed integer overflow should be implementation defined, and each implementation do something sane (different from assuming it doesn't happen). This would have saved us many security vulnerabilities, and unnecessary program crashes.


If your program is crashing because of an overflow you’re lucky because it’s saving you from a security vulnerability.


> If we don't know what will happen that is Undefined Behaviour.

Implementation-defined behaviour is a thing. Not knowing what will happen is not an accurate description of undefined behaviour. What the compiler does is assume that undefined behaviour doesn’t happen. When it does happen, it results in a contradiction, and logically every sentence is a consequence of a contradiction (see e.g. “Bertrand Russell is the pope”). That produces all those infamous bugs. Because just like every sentence is a consequence of a contradiction, every program state can be a result of UB. This is untenable.


> What the compiler does is assume that undefined behaviour doesn’t happen

That is an incorrect assumption, as it clearly does.

It is also incorrect given the standard text.


> That is an incorrect assumption, as it clearly does.

That’s my entire point. Compiler is free to make incorrect assumptions.

> It is also incorrect given the standard text.

According to C11 standard, section 3.4.3, the standard imposes no requirements on undefined behaviour.


A compiler that makes incorrect assumptions is a bad compiler.

In fact, in the rationale of the original spec., I remember reading that the C standard was expressly designed to be a minimal spec, and that just being compliant with the spec was insufficient for the resulting compiler to be fit for purpose.

And of course the original spec did specify a range of acceptable behaviors, and that language is, in fact, still in the standard. It was just made non-binding. However, it is still there, and pretending it is not seems disingenuous at best.


> A compiler that makes incorrect assumptions is a bad compiler.

I agree, but that includes GCC and Clang. ¯\_(ツ)_/¯


Yep. The problem when you rely on free software is that you are not a customer.


That's OK, you can compile your program with -O0 if that's the behavior you want from your compiler.


Unfortunately, `-O0` doesn't actually disable all optimizations. It probably disables any that would affect this though.


There are many optimizations that a compiler can perform without relying on the optimization level to determine how to pervert your program that day. If different optimization levels produce different results that is bad thing, something to be avoided, not encouraged.

If it is really necessary to generate random code when some anomalous situation is encountered, that should be a special option to enable dangerous non-deterministic if-you-made-a-mistake-we-will-delete-parts-of-your-program type behavior. I wouldn't consider that optimization though, more like disabling all your compiler's safety features.


Which optimizations?


Loop unrolling for loops that have a static or range bounded number of iterations is a good example. Others include constant expression evaluation, dead code elimination, common subexpression elimination, and static function inlining.


If you fold float expressions at compile time you will get different results than runtime if the program has changed the fpu control word.

People complain about dead code elimination all the time when we have these discussions.

Inlining break code that try to read the return address off the stack frame or that make assumptions about stack layout.

Loop unrolling might change the order of stores and load, which is visible behaviour if any of those traps.

I assure you that for each optimization, no matter how trivial, it will break someone code


> It would be more helpful for the compiler to assume that it does not know what will happen. This is how C actually worked for many years.

Can't you get that behaviour with -O0 or similar?


Looking at the GCC docs, it seems like it isn't possible to have zero optimizations at any point, even at the lowest optimization levels. To quote the docs "Most optimizations are completely disabled at -O0", so it seems you can't assume you can force correct behavior just by turning off optimization passes.

https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html


Part of the difficulty here is working out which transformations are specifically "optimisations". Some compiler passes are required for correct (or indeed any) code generation -- for example, in the compiler I was employed to work with, the instruction selector was a key pass for generating optimal code, but we only had one: if you "turned off optimisations" then we'd run the same instruction selector, merely on less-optimal input. So we'd disable all the passes that weren't required for correctness or completeness, but we'd not write deliberately non-optimal equivalents for the passes that were required.

Beyond that, you've got a contradiction in your statement -- you can't "force correct behaviour" from a compiler at any point. The compiler always tries to generate correct behaviour according to what the code actually says. If you lie to the compiler, it'll try its best to believe you.

C compilers are intended to accept every correct C program. But they can only do this by also accepting a wide range of incorrect C programs -- if we can prove that the program can't be correct then we can reject it, otherwise we have to trust the programmer. Contrast this with Rust, where the intent is to reject every incorrect Rust program. Again, not every program can be clearly judged correct or incorrect, but in this case we'll err on the side of not trusting the programmer. Of course, "unsafe" in Rust and various warnings that can be enabled in C mean you can tell the Rust compiler to trust the programmer and tell the C compiler to disallow a subset of possibly-correct but unprovable programs, but the general intent still stands.

So if you want to write in a language that's like C but with "correct behaviour" then ultimately you'll have to procure yourself a compiler to do that. Because the authors of the various C compilers try very hard to have correct behaviour, and just because you want to be able to get away with lying to their compilers doesn't magically make them wrong.


Always missing in this argument is the logic of how to go from "undefined" to "can never happen". If the spec did not want it to happen they would have said "can not happen/illegal". but no, it is undefined by the spec. The spec knows that it can and will happen they just did not want to pin down the behavior of the compiler. So the compiler optimization team saying "we can assume this will never happen" is a blind almost maliciously complaint viewpoint.


From the C standard §3.4.3:

    undefined behavior
    
    behavior, upon use of a nonportable or erroneous program construct or of erroneous data,
    for which this International Standard imposes no requirements
    NOTE Possible undefined behavior ranges from ignoring the situation completely with unpredictable
    results, to behaving during translation or program execution in a documented manner characteristic of the
    environment (with or without the issuance of a diagnostic message), to terminating a translation or
    execution (with the issuance of a diagnostic message).
    EXAMPLE An example of undefined behavior is the behavior on integer overflow
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf

The important wording here is "this International Standard imposes no requirements"

i.e. an implementation is allowed to do literally anything in the case of undefined behaviour. It's not quite that the compiler writers are saying "this can never happen", it's more along the lines of "if this does happen, we can do anything at all, including acting as if the conditions were such that UB could not have happened."

So if you multiply two signed ints that the compiler knows are positive, the compiler can assume that the result can't overflow. Because, if it does overflow, the compiler can emit code that does absolutely anything in that case - including acting as if it didn't overflow. Therefore, it can elide checks for a negative result, because either there was no overflow in which case the check is redundant, or there was but the code is allowed to do anything at all - including not performing the check for a negative result.


I agree with this logic. But note there is one caveat: Observable behavior that has already happened when the condition for UB is encountered can then not be affected. The C++ committee later clarified that for UB also the previous observable behavior can be affected. It is unclear whether this affects C as the C committee never added this clarification, but compiler writers often apply this interpretation to C as well. In my opinion the C++ made a mistake here, as this makes UB more dangerous.


When do you consider that "the condition for UB" would be encountered, for any particular behaviour that lacks a definition? The optimiser is -- in general -- allowed to re-order operations and restructure code if it doesn't change the behaviour of the program. In doing this, it needs to trust the programmer that the program is valid. Otherwise it can't even re-order two signed additions, lest the first one overflow.

You might want a language that restricts this, but C is not that language. Or you might want a language that defines all its behaviours, but C is not that language either. Compiler writers put a lot of effort into making their compilers do exactly what the programmer tells them to do.

My personal take is that the correct response to the difficulties of ensuring your C program doesn't exhibit undefined behaviour is probably to avoid writing new code in C. But if you do still need to write C for whatever reason (which I do, occasionally) then it's only sensible to take as much care as the language design expects programmers to take: the compiler trusts the programmer to only attempt operations with defined results.


The condition is always stated in the C standard. "if ... the behavior is undefined". An optimizer is not in general allowed to re-order operations. It is allowed to do this only if it can prove that there is no change in observable behavior.


Indeed, but "no change in observable behaviour" -- along with every other suggestion of correctness from a C compiler -- is only guaranteed in the presence of a well-defined program.

Honestly, I think we'd all be better served by pushing the concept of "undefined behaviour" a bit further into the background. C has defined behaviours, and the standard helpfully makes explicit which behaviours fall outside the definitions. If you want a defined output then your program had better have a defined behaviour when presented with your input.

I'm not suggesting this is ideal -- far from it, I avoid writing new C code. But it's what C does. If you want to avoid needing to make sure that your program only attempts defined operations, switch to a language that doesn't impose that requirement.


This assumes that my computer isn't allowed to be a time machine. I don't see that in the spec anywhere.


Every technical text needs to be read using some common sense. Once you give this away, you can justify everything.


Sure, but the common sense I (and I think I can safely say the compiler writers) are applying is "when the spec says 'the program might do anything', then there is no meaningful difference to the user whether or not we guarantee that everything up to that point was executed correctly". Who cares whether we transferred money from account A to account B when the program is then going to transfer 5 times as much from account B to account A and gift our competitor half of our money while it's at it.

I'm not sure if I agree with your interpretation of the spec, but even if that's the technically correct interpretation, arguing that things went wrong because the compiler miscompiled the program and that it didn't do the things it was supposed to before it was allowed to do literally anything... just isn't an interesting argument. Things went wrong because your program was wrong.


The spec says there are no restriction on the behavior. But now going on saying that when there are no restriction on behavior the term "behavior" now includes impossible things like time travel or magic instead of something any actual machine could possible do, this seems far fetched to me.


Regarding the second point. Sure the program went wrong because it was wrong. But the damage it can do when something went wrong when this can affect previous behavior is much higher. Being able to prove partial correctness of a program is a useful feature (e.g. when a transaction completed correctly you can be sure that and error in the logging function afterwards does not undo this).


> Always missing in this argument is the logic of how to go from "undefined" to "can never happen".

The reasoning is something like "if we assume UB doesn't happen, but it does happen, the resulting behavior is unpredictable. This is allowed by the standard, though, because UB allows for any behavior, including that produced by assuming UB doesn't happen."

In other words, major implementations treat UB as preconditions. Violating those preconditions gets you Interesting Results (TM), but that's allowed by the standard because "unpredictable results " really means unpredictable results.

For example, null pointer dereference is UB. If an implementation assumes null pointers can never be dereferenced, it can better optimize some code. If it turns out a null pointer is dereferenced, the argument is that whatever happens then is still permitted by the standard as the standard does not define any program semantics for programs containing UB.


I agree that this does not follow from the wording in the standard and I am relatively sure that this was originally not implied. But this view point is repeated quite often nowadays. I think this is because prominent compiler developers promoted this point of view and used this for blaming the user ("Because you have UB in your program it is completely invalid. It is now ok that the compiler breaks it, and it is alone your fault".) The other response to your post is correct so, but this explanation would not allow UB to affect prior observable behavior.


Ex falso quodlibet


...is taken as true but is a really lousy principle for modelling informal reasoning.


Compilers and compiler writers don't rely on informal reasoning when deciding whether an optimization is valid


This can be modelled formally: just drop the ex falso quodlibet axiom and its equivalents.


> dereferencing an invalid pointer is undefined behaviour, which means the compiler is allowed to assume it never happens;

Sure, but it didn't have to be like this. They could have said it is unspecified wihout allowing the compiler to assume it doesn't happen.

Would C have been better if the spec was different?


That's not what's happening. Because it's unspecified transforms that are safe in the absence of that behavior are safe to apply since they preserve the semantics.

Even something like register allocation requires knowledge of what pointers point to.


> is undefined behaviour, which means the compiler is allowed to assume it never happens

No it's not. Or let me rephrase that. The standard says the following:

Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).

Which of these is "the compiler is allowed to assume it does not happen"?


> Which of these is "the compiler is allowed to assume it does not happen"?

I agree with where you're coming from, but "ignoring the situation completely with unpredictable results" sounds like it pretty much fits the bill. Pretending something doesn't happen sounds a lot like ignoring it completely to me.

What the standard obviously does exclude is nonsense like intentionally reformatting your drives.


> Pretending something doesn't happen sounds a lot like ignoring it completely to me.

Quite the opposite. Assuming it doesn't happen (not "pretending"), is very much not ignoring the situation, at least if you then act on that assumption that it does not happen.

Ignoring it just lets it happen when it does, so if the program specifies an out-of-bounds access, the compiler generates code for an out-of-bounds access, ignoring the fact that it is an out of bounds access.


> Assuming it doesn't happen (not "pretending"), is very much not ignoring the situation, at least if you then act on that assumption that it does not happen.

I'm not sure how assuming UB doesn't happen is distinguishable from a choosing to ignore the situation every time one comes up. You get the same result either way.

For example, a compiler can assume that null pointers are never dereferenced, or every time a null pointer is/may be dereferenced it can just "ignore the situation" with the dereference. I'm not seeing a functional difference here.

This is, of course, subject to the minor problem that "situation" is arguably underspecified. Compiler writers appear to interpret it as something akin to "code path" (so "ignoring the situation" means "ignoring code paths invoking UB"), while UB-goes-too-far proponents appear to interpret it more broadly, more like "the fact that UB can/will happen" (so "ignoring the situation" means "ignore the fact UB will/may happen").

> Ignoring it just lets it happen when it does, so if the program specifies an out-of-bounds access, the compiler generates code for an out-of-bounds access, ignoring the fact that it is an out of bounds access.

Why wouldn't this fall under "behaving during translation or program execution in a documented manner characteristic of the environment" instead?


I think it's quite clear how "assuming UB doesn't happen is distinguishable from a choosing to ignore the situation."

    strcpy(P, filename);
    free(P);
    if (P[0] == '.') {
        // hidden file
        // do something
    }
Obviously you shouldn't use the above code. However, for illustrative purposes, ignoring the situation [of undefined behavior] probably still results in doing something for hidden files. What people are complaining about is compilers finding that there's UB by static analysis and optimizing out the conditional entirely because they assume dereferencing the pointer to freed memory "doesn't happen."

There are good arguments for both sides, in my opinion. But let's not pretend they're the same thing. Deleting logic because it provably would result in UB is not the same as ignoring the UB.


I'm still not seeing the distinction between "ignoring the situation" and "assuming no UB". Under compiler writers' interpretation, I think there would be two situations in your snippet:

1. This code path is executed. UB will be invoked.

2. This code path is not executed. No UB occurs.

If the compiler "ignores the situation" with UB, code path 1 is ignored (i.e., dropped from consideration). This probably results in the removal of the snippet as dead code.

If the compiler assumes no UB occurs, code path 1 is eliminated as cannot-happen, and the compiler probably deletes the snippet as dead code.

Same result either way (with the obvious caveat that this is one possible interpretation of "ignoring the situation").

> ignoring the situation [of undefined behavior] probably still results in doing something for hidden files.

The problem is that this assumes a very specific definition of "ignoring the situation" which, while understandable, isn't the only interpretation permitted by the Standard.

In addition, there's the fact that such an interpretation would arguably fall under "behaving during translation or program execution in a documented manner characteristic of the environment" instead.

> Deleting logic because it provably would result in UB is not the same as ignoring the UB.

True, but "ignoring the UB" isn't what the Standard says. It says "ignoring the situation", and that's the problem - people can't agree on what "ignoring the situation" is supposed to mean. Compiler writers appear to take it to mean "ignore UB-invoking code paths", UB-goes-too-far proponents take it to mean "ignore the presence of UB".


> If the compiler "ignores the situation" with UB, code path 1 is ignored (i.e., dropped from consideration). This probably results in the removal of the snippet as dead code.

Ignoring something != killing something. There is no dialect of English in which ignoring something is compatible with eliminating it from existence.

EDIT: Note: I am not personally saying compiler writers are wrong. The body text says it imposes no requirements, so at the very least it's a reasonable interpretation to say the footnote text isn't binding and/or that "possible" behaviors are examples, not a full enumeration. But! On the narrow question of "ignoring" the behavior altogether... Hunting for UB via static analysis and then changing your output based on whether you find it simply is not what the word "ignoring" means.


The compiler is not removing or killing invalid code, you can still find it in the source file.

What is doing, in the extreme case, is ignoring it and not generating asm statements for it. Then again, how it could? Code that will trigger UB has no meaning so the compiler wouldn't know what code to generate.

Of course a compiler could assign meaning to some instance to UB.

For example I'm pretty sure that in GCC dereferencing a null pointer is not UB, but it is expected to trap (because POSIX) and the execution not to continue except via abnormal edges (exceptions or longjmp). This means that any code that can be proven to be reachable only through a nullpter dereference is effectively dead code, so in practice it can still introduce bugs if it didn't trap.


At least as far as one is unable to distinguish between the two, sure it is. A compiler that emits code as if UB-containing code paths are not there is essentially performing the dictionary definition of ignoring something, but it's functionally indistinguishable from a compiler that deletes UB-containing code paths.


No.

What you are describing is ignoring the code that has the undefined behaviour due to it having undefined behaviour.

That is not ignoring the undefined behaviour, it is the opposite.


Maybe it's not "ignoring the undefined behaviour", but why isn't it ignoring the situation? The situation is that this code (path) invokes UB. Ignoring "the situation" seems to allow simply not considering that path. Maybe that results in that path not being emitted.

Again, it all comes down to interpreting "the situation". Compiler writers construe it broadly; "Ignore the presence of UB" (i.e., construing "the situation" narrowly) is another possible interpretation, but I don't think it's the one and only definitive one.

In addition, why isn't "ignore the presence of UB" covered by "behaving during translation or program execution in a documented manner characteristic of the environment"? See a null pointer dereference? Just do the "characteristic thing" during translation and emit the dereference. Maybe implementations will need to add documentation somewhere, but that's not exactly the challenging part.


Because it is not ignoring the undefined behaviour. Simple as that.

What you are confusing is "being agnostic about something happening or not happening" and "assuming it cannot happen".

And sorry, the "situation" is pretty precisely scoped by "Permissible undefined behavior...". So what can be ignored is this instance of UB, not the fact that UB exists.

Otherwise, if you're going to arbitrarily expand the scope of what the situation is, then how about "the fact that a C spec exist?". That's a situation, after all, and it is the situation you are in.

Or maybe just ignore parts of the spec, like the ones that define what is UB and what is not UB.

Then everything becomes a trigger, and if I can expand scope like that, then I have a standards-compliant C-compiler for you:

    int main() {  int a = *-1; } 
(I am pretty sure you can make a smaller one)

I doubt anyone would accept this broadening of the scope.

Once again, ignoring something is not generally the same as assuming it doesn't exist. They could be the same if you assume it doesn't exist and then do nothing differently. However, if you use your assumption that it doesn't exist and act differently based on that assumption than if it did exist, then you are not ignoring the situation.

And the latter is clearly what is happening with today's optimising C compilers. They act very differently in the presence of UB than they would if the UB would not be there, for example not translating code that they would have translated had the UB not been there, had it not been UB or had they actually ignored the UB as the should have.

> just do the "characteristic thing" during translation and emit the [null pointer] dereference

These things overlap slightly, but I doubt that "just emitting the dereference" qualifies as a documented exception to normal processing of a pointer dereference due to UB. It is exactly the same thing it does when the pointer dereference is UB, so it is just ignoring the UB.

Another misinterpretation that seems to be common is to interpret "the environment" in "characteristic of the environment" to include the (optimising) compiler itself.


> Because it is not ignoring the undefined behaviour.

But it is ignoring the situation? At least, given the broader interpretation of "the situation". I understand there's a narrow interpretation as well.

> And sorry, the "situation" is pretty precisely scoped by "Permissible undefined behavior...".

Maybe? I think I understand the argument. Will need to think on it some more...

> So what can be ignored is this instance of UB, not the fact that UB exists.

I think I haven't been clear enough on this - I had been using "ignore the fact that UB exists" to essentially mean "ignore this instance of UB" - i.e., carry on as if there was no UB. I had been using "ignore code paths with UB" for the broader modern-compiler-style interpretation.

> Otherwise, if you're going to arbitrarily expand the scope of what the situation is, then how about "the fact that a C spec exist?". That's a situation, after all, and it is the situation you are in.

Sure, it's a situation, but I don't think anyone is exactly advocating for an arbitrary expansion of the scope of a situation. "The fact that a C spec exists" is a situation, but it doesn't even pretend to have anything to do with the Standard's permissible UB.

> Once again, ignoring something is not generally the same as assuming it doesn't exist. They could be the same if you assume it doesn't exist and then do nothing differently. However, if you use your assumption that it doesn't exist and act differently based on that assumption than if it did exist, then you are not ignoring the situation.

I'd agree that proceeding without considering UB at all would count as ignoring something.

However, I'd argue that that's not the only way to read "ignore" - dropping something from consideration, to me, certainly sounds like ignoring something. You had to choose to do so, but that doesn't make it not ignoring it. That also depends on framing, though - back to how broadly "the situation" should be read.

> I doubt that "just emitting the dereference" qualifies as a documented exception to normal processing of a pointer dereference due to UB. It is exactly the same thing it does when the pointer dereference is UB, so it is just ignoring the UB.

Sorry, I don't quite understand what you're trying to say with the first sentence - where did the concept of an exception to normal processing come from? The idea was that emitting a dereference is the characteristic translation behavior, so that phrase in the Standard would cover "ignoring the UB" and doing what may otherwise be expected.

> Another misinterpretation that seems to be common is to interpret "the environment" in "characteristic of the environment" to include the (optimising) compiler itself.

I had interpreted "the environment" as including semantics; i.e., the translation environment includes these rules for translation/program semantics, so a characteristic behavior could be "normal" semantics. This interpretation doesn't need to include the compiler since the characteristic behavior is derived from the environment, not the compiler.

Looking more closely at the Standard, though, I'm not too confident in this interpretation. Perhaps "behaving during [] program execution in a documented manner characteristic of the environment" could work, though it's admittedly not what I originally had in mind, and I'm still not sure it works.

----

I do have to admit, though, that after the discussions I've had with you I'm less confident about my understanding of this. It'd be nice to talk to an actual major compiler dev or some C89 committee members about this. Feel like I had run across such a thing at some point, but I don't remember where or when.


To respond to your edit:

> Hunting for UB via static analysis and then changing your output based on whether you find it simply is not what the word "ignoring" means.

Why not? A static analysis pass can flag a code path as containing UB, and future passes can then ignore that path. Sure sounds like "ignoring" to me.


Ignoring Y because it has X is the opposite of ignoring X.

It is fundamentally impossible to both ignore X and make decisions based on X at the same time.

Something cannot be both ignored and a key criterion.

Your argument is akin to saying that a company hiring process that "ignores race" means eliminating applicants based on their race. It is untrue. It is the opposite of true. And I think you know that, so please stop trolling.


> It is fundamentally impossible to both ignore X and make decisions based on X at the same time.

Of course you can - the former can be the action you take as a result of the decision. Choosing to not consider something is distinct from refusing to make a decision based on that something, but both are "ignoring" that something.

Again, this boils down to how "ignoring the situation" is interpreted. I can ignore situations with UB and proceed as if those situations aren't present, or I can ignore situations with UB and proceed as if the UB weren't present. The Standard's wording does not rule out one or the other.

In addition, why would the "ignore the existence of UB" not fall under "behaving during translation or program execution in a documented manner characteristic of the environment"? That seems to match "pretend the UB were not present" much more closely.

> Your argument is akin to saying that a company hiring process that "ignores race" means eliminating applicants based on their race.

No, it means that the hiring process makes decisions without considering what race-based effects that may have. If that happens to result in weird race-based outcomes, then that's what happens.


The paragraph you quote is a Note, which is non-normative.

The normative text that specifies the semantics of undefined behavior is:

behavior [...] for which this International Standard imposes no requirements.


1. It still exists. Pretending it doesn’t exist is disingenuous at best.

2. It used to be normative.


> [whatever] is undefined behaviour, which means the compiler is allowed to assume it never happens;

That interpretation is the root of the problem. Compilers authors use it to implement outright user hostile behavior in the name of elusive performance.

Why would you spend resources looking for zero days when you can have a few LLVM contributors plant them in every program "as an optimization"?


So why not use a different compiler?


Would you write it?

You can use the same compiler with a language whose design committee isn't deliberately user hostile, like Rust (where UB-like behaviors in safe code are considered soundness bugs).

https://runrust.miraheze.org/wiki/Undefined_Behavior


Rust makes guarantees for safe code that undefined behavior would violate. C(++) has no such mode and as such a comparison cannot be made.


< you are telling the compiler that your arithmetic is guaranteed to never overflow.

Any time your are dealing with data from real, physical sensors or third-party APIs, this is an impossible guarantee to give - they could literslly break.


of course it is possible. One must validate the input before performing arithmetic on signed integers.


The compiler is allowed to assume that the data will never overflow so it is also allowed to get rid of overflow checks. Now imagine writing a complex validation routine that if it were violated would result in undefined behavior for every value that the validation rejects, a sufficiently smart compiler is allowed to simply remove your validation code, leaving you with no validation whatsoever.


It can get rid of checks that test whether the results of a previous operation have overflowed. It can't eliminate checks that test whether a subsequent operation will overflow and abort because that would be changing semantics.

C is an absolute minefield of undefined behavior, but let's be accurate about the things it does wrong.


There are several ways to do overflow checks safely (i.e. without undefined behavior), though the ergonomics are not always ideal.

C23 somewhat improves the situation with <stdckdint.h>, a standardized version of GCC’s __builtin_add_overflow and friends. That has ergonomics issues too due to its verbosity, but at least it’s hard to screw up.


The compiler is allowed to assume that if x and y are signed ints:

   if (x < 0 || y < 0) return -1;
   return x * y;
Will not overflow. And if you try to check for overflow with:

   if (x < 0 || y < 0) return -1;
   if (x * y < 0) return -2;
   return x*y;
Then yes, the compiler is within spec to remove your check because the only situation in which you could hit that check would be after signed integer overflow, which it is allowed to assume won't happen.

One way to implement this check in GCC where the compiler will respect it would be:

   if (x < 0 || y < 0) return -1;
   int z;
   if (__builtin_smul_overflow(x, y, &z)) return -2;
   return z;


That doesn’t sound like sound logic to me. The compiler assumes that at the point you have the arithmetics, they won’t overflow. This hinges upon all the former state of the program. If it can prove that the given integers won’t overflow (e.g. due to a previous, redundant check) then it can indeed remove a conditional, but the compiler can’t change the observable behavior of the program.


It can't (correctly) remove checks that would have prevented the undefined behavior, because that is changing the semantics of a program that (as written) does not trigger undefined behavior.


CPUs do different things when they know it is invalid. Often what happens is it send some sort of trap/signal to the operating system - on unix this would send a segfault and dump core. Not all invalid pointers can be detected, and not all CPUs have the required hardware (MMU) to detect any invalid pointer.


Exactly. Even though you could specify what happens mechanically on any given system, the results are kind of uncontrollable on most real computers.

Whether things should be UB or implementation defined requires a good value judgement. Per se there is not a real need of UB, (assuming that the behaviour of a machine can be specified), but on the other hand there isn't much value in specifying a situation where all control is lost to the point where implementation details leak (runtime structures are overwritten etc.).

Since there is (I assume?) an expectation for "implementation-defined" to have an actual definition, pedantically an implementation leak would require all implementation details to be codified, and thus set in stone.


It is more that while we can define everything, there are costs. I can track every allocation in a hidden table, then before following a pointer verify that the point being followed is in the table. However that is extremely expensive (don't forget that you might follow a pointer in one thread while a different thread deletes it, so there is a complex race condition logic needed here that is tricky to get right)

By making things undefined we can avoid a large amount of complex code to detect a situation that shouldn't happen.


Exactly. While it is definitely harder with a low-level language, it is absolutely doable in terms of costs with managed languages. In Java for example even data races are well-defined to a degree (and OCaml’s new multithreaded mode defines data races with even bigger guarantees)


This could be indicated in a comment, perhaps using a *RFI-like notation.


C did do ‘whatever this CPU does’, and then the ANSI C committee broke that (probably by accident). That's what the headline post is about: “Those origins of C were there first”.


The ANSI C committee certainly did not break this. UB always meant that ANSI C simply did not define what happens in this specific case. This could mean "whatever the CPU does" or "whatever the compiler decides". What caused problems then was that compilers started to aggressively exploit UB for optimization (and customers probably asked for such optimizations) treating programs with UB as invalid. When this went to far and caused problems, compiler vendors then blamed the user and the standard. But obviously it is completely at the compiler developer's discretion what the compiler does, so customers should just push back and request changes.


In the vast majority of cases I'm glad that my programs run fast. sure I have to do some work to avoid undefined behavior, but most code is correctly defined, and the exceptions are bad code that I need to fix anyway as even if the behavior was 100% defined to C it would still be buggy as in not do what my users want.


It’s up to the platform in the end.

I recall IBM’s AIX used to put a zeroed-out block at address zero, so this code was guaranteed to always return 0:

  *(int *)NULL
The reason was that it allowed their optimizing compiler to speculatively dereference a pointer before executing a surrounding if condition.


> It’s up to the platform in the end.

It’s not up to the platform. If it looks like a code path might dereference a null pointer, the compiler can and will make wild and bizarre optimizations in the surrounding code.


If code dereferences a NULL pointer, the spec imposes no constraints whatsoever on what happens after that point.

Yes, some compilers will make "wild and bizarre" optimisations (they are neither) based on the assumption that any pointer which is dereferenced is not NULL - because if it were, the code could have done the exact same thing anyway, because there are no constraints in that case.

But compilers don't have to do that. There are no constraints. A compiler can emit code to do anything at all in the event of a NULL pointer being dereferenced, including emitting the machine code you'd naively expect - which may trap on some CPUs, or always return 0 on reads and swallow writes on other CPUs. It may also emit code that wipes your hard disk (and still counts as a conforming implementation). Or make demons fly out of your nose.

A compiler vendor may choose to guarantee a specific behaviour for some invalid constructs. They are allowed to do this because there are no constraints on what they can do. It's just that most don't, because 99% of people comparing compilers care more about benchmarks than they do about what happens with invalid code.

So in some ways, it is up to the compiler/platform/CPU what actually happens with code that has invalid behaviour. But as a C author, you can't assume any particular behaviour, because it might change from one CPU to another. Or from one OS to another. Or from one compiler to another. Or from one version of one compiler to another.


This. In the case of AIX, IBM designed every component of the platform: the CPU, the operating system and the compiler. When they tell you that it’s safe to read up to 4K (or whatever) of zero bytes from NULL, they can make that promise. But obviously such code is not portable.


But this means the vast majority of C code isn't portable since you can't know what it will do with a given platform and compiler combination and the presence of even a single line of UBI invalidates the whole program.

It basically means that the whole portability argument that is supposed to be in favor of C is just wrong because every compiler and platform actually brings its own C dialect and it is just sheer coincidence that it works at all.


Are you seriously suggesting that the vast majority of C code causes signed int overflows or dereferences invalid pointers?

Not sure about the C code bases you've been spending time in, but in the ones I've looked at, the overwhelming majority of the code has been well-defined by either the C standard or the implementation. In the cases where it hasn't, the vast majority of those cases were genuine bugs that needed fixing, and the fix made the code well-defined.

The number of codebases I've seen that actually relied on unspecified behaviour, or on whatever the current compiler/OS/CPU happened to do with undefined behaviour, is miniscule. (Or an entry in an obfuscated/underhanded coding contest.)


Yes. The nominal reality is that effectively all extant C code has undefined behavior, but the practical reality is that most of that code will work portably as long as some reasonable precautions are taken.


The compiler doesn't have to come to your home and actively haunt you. The behaviour is indeed up to the platform (by "platform" I understand "compiler + host system" or whatever is used to run the program).


Ah - I suppose I don’t think of the compiler as part of the platform. Most platforms I work with tend to use gcc or clang, and as a result everyone is subject to the gcc / clang optimizations.


So it is up to the platform, since the compiler is part of the platform.


I disagree that it is completely unpredictable. If it is a read operation, dereferencing an invalid pointer on every normal system will either return data with an unspecified value or will raise a low level exception that may be caught.

Since a compiler cannot predict in all cases in advance whether a pointer is invalid it should be required to compile a pointer dereference the same way in all cases, i.e. emit the same code regardless of whether it has determined the pointer is invalid or not. Otherwise it effectively emits random code depending on how the optimizer is feeling.

Non-deterministic behavior for the same operation on the same values in the same program is not very helpful behavior, whenever and to the degree it can be avoided.


it would solve a lot of problems if leniency introduced into the ansi spec to permit c implementations on the burroughs large systems line was only used in cases like that, instead of compiler vendors abusing it to eke out 0.2% performance gains on amd64 hardware at the cost of introducing priority 1 security bugs into operating system kernels


Right but there definitely are things that were different in the past but are never really going to change in the future (because they are de facto standards).

A byte is now always 8 bits. Integers are always 2s complement with wrapping arithmetic. Nobody is going to make a CPU that doesn't use those (at least nobody that wants people to actually use it) because no software would work on it.

For example RISC-V set the cache line size to 64 bytes because that's what everyone else does and going against the grain is too difficult now (and maybe there was no reason to in that case but still...)

The only thing I can think of that benefits from C's "nothing is well defined" approach is CHERI, but I bet a ton of code needs fixes to work with it.


So, another article supporting implementation dependent behavior when everybody is out complaining about undefined behavior.

Of course implementation dependent behavior is a necessity. Undefined behavior too, on the very few cases that nobody complains about. This isn't one of them.


But to play the devil's advocate, where would we be if architectures followed higher level abstractions to a point they would be hard to differenciate ? They d be slower to run maybe, but so much nicer to program for!


There was this idea of "Friendly C" which in part would have brought C semantics closer to the concrete computer semantics. But it fizzled out before it really even got started properly. Regehrs "post-mortem" post summarizes the problem: https://blog.regehr.org/archives/1287

While there is a "group of C users who are unhappy with aggressively optimizing C compilers", they are not unhappy enough to put in the effort to define and implement alternative semantics.


It also turns out that the optimizations are valuable. It's easy to think its a bad deal when you are just complaining and don't actually have to face losing the optimizations.


If there was an optimization flag for sanity then I doubt people would complain.


The way I see friendly c could have been more likely to be a perf win than loss; it's more replacement for -O0 (etc) and allow at least some optimizations on otherwise hopeless codebases.


turns out DWIM is not a spec :)


UNIX as a whole, and C in particular weren't any kind of well-designed projects. They are awful inside and out. Their staying power isn't due to the goodness of the design or their remarkable performance. They are with us because of the network effect. Same as IPv4 + NAT, Base64-encoded email attachments etc.

Plan 9 would be a similar attempt to this "Friendly C", or D language: these are attempts by people who were misguided to believe that the original technology was mostly good, and needed only a nudge in the right direction to fix a few problems here and there to make it perfect... but it turned out that the technology didn't succeed on technological merits, and improving the technology only so slightly isn't going to bring any new audience to the clone.

C has way too many problems, far beyond undefined behavior. None of that is important as long as the most popular operating system on the planet is written in C, and C89 at that...


> UNIX as a whole, and C in particular weren't any kind of well-designed projects. They are awful inside and out. Their staying power isn't due to the goodness of the design or their remarkable performance. They are with us because of the network effect.

Can you explain why Algol-68 and PL/I didn't generate any network effects?

C may have survived for over 50 years because of network effects. But it didn't get in the position to have network effects by being "awful inside and out*. It got in the position to have network effects by being considerably better than the existing alternatives. (Better for actually writing programs, not better in any theoretical CS kind of way.)


I don't think it's fair to say that C was better by itself, as a language. It's more that it was better specifically in the niche of early microcomputers, in part because it has much less competition there. And then when that niche exploded - for other reasons - so did C.

In many ways, I think this parallels the rise of JavaScript, which was far from the best language in general, but happened to be the best language that would run in any browser - and so as popularity of the web grew, so did JS.


C was better than Pascal? Or Fortan? Puh-leeease!

Despite small community and not a lot of people working on Fortran compilers (at least as compared to C), Fortran programs usually still beat C on benchmarks.

And Pascal? -- Well, it has so much better grammar... like, it was specifically designed to be unambiguous LL(1) language.

And these are only the two I can name off the top of my head that would straight-up win against C in almost every respect, or at least draw. And these two predate C.

If you look more closely at the history of UNIX and C, you realize that people who created it weren't guided by some great ambition to make a good product... they were just pricks who didn't like to study what others did before them, and thus invented their own square-wheel bicycle. They then also lacked the insight into how defective their bicycle was, but were really eager to sell it to those who knew even less about bicycles. It was through pure luck that UNIX took off and won the OS race. It has nothing to do with its engineering qualities.


> Fortran programs usually still beat C on benchmarks.

Sure... for the kinds of programs that were written in Fortran, which tended to be math-heavy. But nobody wanted to write text-processing programs in Fortran, or parsers, or operating systems, or memory managers. (I mean, seriously, think about writing "grep" in Fortran. You might be able to do it, but C, for all its flaws, is still a far better tool for that kind of task.)

Pascal... better grammar. Horrible to actually use, though, at least before the Turbo Pascal extensions. Text processing was incredibly painful, because there was no such thing as a variable-length string, which was a crippling limitation. I/O was also pretty broken. It was very much not better than C.

Neither Fortran nor Pascal would straight-up win against C for general-purpose programming, still less for system programming.

> they were just pricks who didn't like to study what others did before them

Feel free to cool it with the ad-hominems. They are against site rules.


> Sure... for the kinds of programs that were written in Fortran,

As someone who has to look into the code of utils-linux, which is very much written in C, I can tell you that C... well, shouldn't have been used there. And the few, but still a significant number of times I had to make a trip into Linux kernel code, I can confidently tell you that C is a bad choice for that kind of program too.

There aren't good programs for C, or, to put it differently, C is not a good choice to solve any problem.

Just to give you an example of a bug in utils-linux I faced very recently, to, hopefully illuminate the problem further: there's a utility called mdadm (short for multiple devices admin). "Multiple devices" is a Linux name for RAID (basically, with minor differences). So, this utility must talk to the kernel a lot, and especially the drivers such as raid0, raid1 etc.

The thing is, and due to historical mishaps... in the previous iteration of this communication protocol tools were expected to use various ioctls to talk to kernel. The system grew and grew, and had outgrown itself. ioctls don't cut it anymore as they need to carry too much information, much more than is plausible to stuff into the simple mechanism that they are. So, the new wave of the kernel-utils communication is through sysfs. But, here comes the horror of every C programmer: parsing! If you talk to sysfs, all you get is file streams. These file streams have structured data in them, but there's no unifying format (so you cannot piggy-back on someone's hard work on creating a universal library to parse that stuff), nor are there any decent utilities in C to deal with extraction of data from file streams.

The result? -- the authors of mdadm discovered that in some circumstances using ioctl isn't going to work anymore, specifically, when dismantling MD devices, but they also realized that parsing sysfs stuff is just too hard for them and... gave up. There's a "TODO" in their code, has been there for many years, that says that they should be using sysfs... and nothing has been done about it.

I could blame the "lazy" mdadm programmers for it, but really, it's a fault of C. Even in some unappealing language like Python, this would've been a no-brainer...

Would Pascal win here? -- Absolutely. A lot of fears that C programmers have when it comes to deal with strings are non-issues in Pascal. But, wait, Pascal has evolved, where C hasn't. Ada is in many ways a spiritual successor to Pascal.

> Feel free to cool it with the ad-hominems.

What you wrote is an ad-hominem attack, regardless of the site rules, I don't care about it. What you quoted, however, is based on memoirs and first-person impressions from people familiar with the subject. It gives fair and appropriate description of the charters in question. While many of them are no longer with us, the remaining ones aren't likely to dispute the claim.


> Can you explain why Algol-68 and PL/I didn't generate any network effects?

The community was too small. The grows in the programming field was very rapid. I don't want to say "exponential" because I don't have the actual numbers, but you could see it because, well, you would almost never meet anyone with more than some 5 years of experience in almost any programming field for decades, i.e. in 70's, 80's, 90's... I started my career in the 90's, and, so far, in real life, I only met three programmers who started more than 10 years before my time.

The new generation of programmers simply didn't know much of what the previous generation did, and this repeated many times over, not just with Algol or PL/I.

> C may have survived for over 50 years because of network effects. But it didn't get in the position to have network effects by being "awful inside and out".

I don't know how much do you know about UNIX history, but you are willfully misquoting me. Literally, the success of UNIX, and by proxy, of C is the network effect. In a more literal sense than you probably imagine. The appeal of UNIX was more or less this: the "real" computers of the day, the so-called "big iron" had always custom-made operating systems for them. Essentially, every different hardware model would ship with a different OS. UNIX was the first portable OS in a sense that you could install it on more than one CPU architecture (not from the very start, but that was the goal, and they succeeded at it). And the reason why people wanted a portable OS was the idea behind how they wanted to build networks back in those days: have a "real big-iron" have a "side-kick" computer that handles the networking issues. The side-kicks would all run the same system, and serve as adapters to "real" computers. UNIX, in a sense, was a glorified router modem... at least, that's what it was meant to be.

Quite soon programmers realized that instead of connecting "side-kick" computers to "real" computers they may make "real" computers run the same standardized OS. A much simpler one at that! Very little thought was put into thinking about why those systems on "real" computers had to be so complex and big. The same kind of enthusiast who proclaimed that Emacs is huge, but ended up using Eclipse, or the same enthusiasts who proclaimed that Ada is huge, but ended up using C++ are the shortsighted programmers who promoted UNIX in its early days, fullheartedly believing that complexity will somehow evaporate, that they are getting a simple tool to solve complex problems...

So, yeah, because UNIX did became popular due to network effect, literally and figuratively. C just piggy-backed on its success.*


> I don't know how much do you know about UNIX history...

Perhaps more than you. I started a decade before you, so I was there for more of it than you were. (Not at the beginning, I admit.)

> but you are willfully misquoting me.

Not willfully - that takes intent. What, specifically, did I say you said that isn't what you said, or that was out of context? Having read your reply here, I still don't see what I'm misquoting.


Being there and studying it is very different. It wouldn't surprise you to discover that the authors of UNIX were there, would it? And yet, somehow, I believe that I know better than they do -- how comes?

You wouldn't believe it, but things often are easier to judge in hindsight, than when being involved with them...

> willfully

You pretend that I said that UNIX got into its position by being awful inside and out, but what I wrote is that it got into it's position despite being awful inside and out. In other words, you pretend to misunderstand me, and then argue with something I didn't say.


You sure are free with accusations of bad faith on my part. That also is against site guidelines.


because they don't have any. C operates on headcannon


Early C largely inherited the data model and semantics of BCPL, which was built around "machine words" and not byte-addressable memory. The introduction of the C "char" allowed them to take advantage of PDP-11's byte addressability for special occasions, but "int" was still the default choice for storing everything from sizes, to file descriptors, to pointers.

In that sense, it was not exclusively built for a byte-addressable machine with registers of a varying size.


there are an awful lot of chars in the v6 unix code, i don't think it's really 'for special occasions'

also pointer arithmetic on a byte-addressed machine is different from int arithmetic, so you have to know if something is an int or an int pointer if you want to increment it

from the horse's mouth in https://www.bell-labs.com/usr/dmr/www/chist.html, dmr's hopl ii paper, with the advantage of 20 years of hindsight

> The machines on which we first used BCPL and then B were word-addressed, and these languages' single data type, the `cell,' comfortably equated with the hardware machine word. The advent of the PDP-11 exposed several inadequacies of B's semantic model. First, its character-handling mechanisms, inherited with few changes from BCPL, were clumsy: using library procedures to spread packed strings into individual cells and then repack, or to access and replace individual characters, began to feel awkward, even silly, on a byte-oriented machine.

> Second, although the original PDP-11 did not provide for floating-point arithmetic, the manufacturer promised that it would soon be available. Floating-point operations had been added to BCPL in our Multics and GCOS compilers by defining special operators, but the mechanism was possible only because on the relevant machines, a single word was large enough to contain a floating-point number; this was not true on the 16-bit PDP-11.

> Finally, the B and BCPL model implied overhead in dealing with pointers: the language rules, by defining a pointer as an index in an array of words, forced pointers to be represented as word indices. Each pointer reference generated a run-time scale conversion from the pointer to the byte address expected by the hardware.

> For all these reasons, it seemed that a typing scheme was necessary to cope with characters and byte addressing, and to prepare for the coming floating-point hardware. Other issues, particularly type safety and interface checking, did not seem as important then as they became later.


This confirms that it was conceived pretty much as a typeless language. The purpose of "char" was to be able to process byte-addressed strings character by character, and to do byte-sized pointer arithmetic without any runtime conversions. For most other purposes, "int" was the machine word. Hence the weak typing and the odd conversion and promotion rules between "char" and "int".


yes, b (like bcpl) was completely untyped, and in c both int and pointers were machine words (and in 6th edition c typing was very weak indeed), but you have the pointer arithmetic thing backwards

on a byte-addressed machine, byte pointer arithmetic works fine if you treat the byte pointers as integers and don't do any conversions at dereference time; that's what it means to be a byte-addressed machine usually (certainly in the case of the pdp-11)

it's pointers to larger-than-byte things (ints, pointers, and later floats, structs, and arrays) where runtime conversions rear their head; if you try to not distinguish between ints and pointers to ints, then for *(p+1) to refer to the int after *p (instead of one overlapping it, giving a bus error), you need to shift p left by one bit at dereference time, or two bits on a 32-bit machine (if its memory addresses identify 8-bit bytes, as on the 360, pdp-11, vax, and 8086). no such conversion is required for char pointers

hope this clarifies


Also worthy of note that BCPL was originally designed as means to Bootstrap CPL, not to go around writing full systems with it, hence why it was so basic.


I think on x86 loading a char from memory took one more clock cycle than loading an int.


You mean one less cycle? I think that was only ever the case in the 8088 with its 8-bit bus.


> It's also not the case that C only succeeded in environments that were designed for it. In fact C succeeded in at least one OS environment that was relatively hostile to it and that wanted to be used with an entirely different language.

Well, I had to look into what was this.

And as expected it refers to C uptake against Pascal on the Mac OS.

Except what replaced Object Pascal was C++, not C, even though C minded folks like to think otherwise.

MacApp was ported from Object Pascal into C++, and Metrowerks also added PowerPlant to the party.

MPW ultimately happened, because a few folks pushed for it.

https://en.wikipedia.org/wiki/Macintosh_Programmer%27s_Works...

https://en.wikipedia.org/wiki/MacApp

https://en.wikipedia.org/wiki/PowerPlant

After Object Pascal, the new kid on the block for Apple was C++, even if Macintosh Toolbox exposed API entry points as C like.

Newton OS (Dylan lost to C++), Taligent and Copland were also mainly C++ based.

https://en.wikipedia.org/wiki/Newton_OS

https://en.wikipedia.org/wiki/Taligent

https://en.wikipedia.org/wiki/Copland_(operating_system)


Imagine a world where Apple did not latch onto C++ but rather... lisp.


So Dylan. They tried.


The world where Apple/Jobs never had their comeback, and became as much a part of the modern ecosystem as DEC?


It is more like they lost to smalltalk through self and objective-C


I don’t think it makes sense to say C begat the modern OS/CPU or vice versa. These things tended to develop in tandem over time, which is kinda the author’s point. There are of course many such examples regarding OS’s and chip features, languages and frameworks/ecosystems, etc.

Also, not to be overly Neoplatonic, but is it not the case that C is basically what a smart person would come up with as a way to write portable-but-not-very-abstracted imperative code on a Von Neumann machine?


Yes, in fact one only has to look into what was happening during the late 50's in systems programming, starting with JOVIAL in 1958, or outside Bell Labs, to see it was indeed so.


But C is late 1960s / early 1970s, created after PL/I, Algol-68, and even Pascal.

Sadly, C went forth not only with hard-to-parse syntax, but also with stuff like undefined behavior, null-terminated strings, unchecked pointer arithmetic for array access, etc.

It was designed as a language for a confident kernel hacker working close to hardware, but ended up as a general-purpose programming language for the entire OS, and it's not the best fit for that role.


I try to always understand the current state of things in the context of how they were before the current state.

C and Unix came on to the scene - and they were much much better than what came before.

Users were meant to use the shell commands + awk and get a whole lot done with that. C programs were meant to be small. How small? Well - how much code would you be able to write using ed?

I think Unix intended for programmers to develop languages for users - use lex and yacc to come up with something to hand off to users.

The Unix operating system assumed a corporate org structure that just does not exist. Genius programmers and highly educated users.

https://youtu.be/tc4ROCJYbm0 <--- this is what they were expecting

If we went back in time to tell them how real corporations would be set up, I think C would have had safe defaults that could be disabled when needed and syntax closer to Go than C. In fact if we told them how things would be in 40 years, they would have settled on the erlang VM for stuff outside of systems and graphics programming.


C and Unix came on to the scene and they were free beer (for all pratical purposes), so they won over the other stuff that cost lots of green paper.

Had AT&T been allowed to profit from their Bell Labs research projects, and history would have been quite different, as proven when they went after Lions commentary book and BSD, shortly after been allowed again to profit from their research.


UNIX was completely irrelevant/unreachable for most people until Linux showed up.

By that time, C was already popular outside the UNIX world.


"Most people" don't matter here. The girl who bags your shopping doesn't need to know what UNIX® is for C to take over the world, but the guy who wrote Commander Keen (and who will go on to write Doom and Quake) does.

For example in Spring 1991, with Linux still just some C code Linus Torvalds was thinking of naming Freax if he got it working, JANET, the organisation providing network access to the UK's universities (via X.25 of course) decides to launch JIPS, an experimental IP network.

JIPS was huge. Why was JIPS huge? Because unlike X.25 you could just download BSD source spin up a Unix with TCP/IP and run everything you could think of or write your own software, you don't need anybody's permission - there's some guy at CERN who has written a "Web browser" which sounds pretty interesting for instance. By the time Torvalds writes his Linux 0.0.1 announcement email, JIPS is the dominant use of JANET and X.25 is on its way to deprecation.

It was all bundled together with this free software I got, 100% legit. It seems to work pretty good, and unless you've got funding from somewhere to use something different I think we should use C / IP / Unix.


I don't know about most of the people. What may have mattered was easy access to Unix for students in universities. You would normally expect there some VAXen, Sun hardware, and even PDPs, all running Unix.

Had AT&T made Plan 9 available in equally relaxed terms, history might have gone differently, too.


University site licenses for UNIX were very inexpensive. UNIX/C was way more accessible than most competitors, and students took that experience out to the real world.


Nope, it was what most companies and universities were using in their compute centers, when not using MS-DOS with Novell Netware.


My understanding of computing history is that C was basically

1. A version of PL/I that doesn’t suck. From what I can tell it was way too broad and the implementations weren’t great.

2. An Algol that is designed in the context of “represents concepts that map cleanly to lower level semantics”. Which was basically just coming full circle from Algol being a way for computer scientists to have something more expressive than Fortran and COBOL, because people kept implementing Algol and realizing that it was missing things. In my own uneducated view, Algol (and many LISPs) are too structured around the concept of completing an evaluation of a program, which made sense in the computing world when they originated, but became out of date as computers started being used for more than just directly computing things.

3. Had good implementations of several “trendy” or cutting edge concepts of the time like preprocessing, recursion, and most importantly structs. Yeah most of this wasn’t technically new. But the prior art like Algol68 was horribly flawed for other reasons.

Because the underlying system allowed concepts like null-terminated strings and unchecked pointer arithmetic, it was fair game for C. I don’t see C as a “better or worse” thing compared to other languages but something that had/has to exist as a bridge between intriguing-but-flawed/limited high level languages and the more functional but unexpressive early languages that saw adoption outside of computer science.

Of course it didn’t need to be the case that the Unix ecosystem’s userspace was mostly C, but C was a huge step up for its time. Like try reading Fortran, COBOL, Basic, and Algol and tell me you’d rather work with that than C. Pascal was later and not that much better, plus computing was a lot more fractured/expensive and the internet was basically not a thing, so it’s not like one could always just start writing pascal on their Nix or vice versa. Even today Rust and C++ are basically the only things that can replace C in many contexts, and Rust is pretty new.


> It was designed as a language for a confident kernel hacker working close to hardware, but ended up as a general-purpose programming language for the entire OS, and it's not the best fit for that role.

C is a classic case where good enough was so good that many tries at perfect couldn't unseat it.

I never felt like C was a bad tool in the 80s and at least early 90s - in many cases it was better than the alternatives which would result in slow software or would impose difficult constraints (Pascal string limits and array semantics vs. C's strings and buffers). I never thought much about C as a "kernel hacking" language because I was writing software for MS-DOS where systems programming was calling bios routines or intercepting interrupts and doing unholy things. I guess I just saw C as better than Pascal, compiled BASIC, COBOL and slow interpeted languages. When I moved to Unix, C just was the low friction way...


C was a small enough language to allow compilation on a mid-1970s 16-bit minicomputer with limited memory and a small removable disk pack for storage.

PL/I compilers ran on mainframes.


Yeah, PDP-11 was worse than computers that came a decade before it. /s


Hard to parse? Not in my experience. I wrote a simple C compiler while in high school using a handwritten recursive descent parser. Parsing was relatively easy, and recursive descent made generating sensible parse errors possible (vs yacc). Register tracking and allocation was the hard part.


It is comparatively hard to parse, being context sensitive (need to keep track of types) as well as requiring an implementation of the preprocessor, which has arcane enough rules that most programmers don't really know them.

Also, there are a number of historic syntax quirks that a conforming compiler, at least pre-C23, has to understand. For example, function declaration that lacks return value specification (implicit int) or parameters.


Have you actually written a C parser? I will grant that C++ is incredibly hard, but C is relatively straightforward. Types are not at all difficult to parse in C, and the pre-processor isn't all that complex either when you have the spec in front of you. If you think it's hard, you probably didn't have the right documentation. I picked up the book "The Annotated ANSI C Standard" by Herbert Schildt sometime back in ~1993, and it made writing a recursive descent parser almost trivial. That book made C syntax actually make sense in a way that the K&R book and other introductory C books didn't.


It's "hard" compared to, say, Pascal, which has a grammar that's intentionally LL(1). The fact that parsing declarators is context-dependent in particular, requiring you to maintain and consult a symbol table at that point (before you even have an AST) already to determine whether a given identifier is a type or not, is a minor annoyance, but an annoyance nevertheless.


More complex from a theory view, but not hard. You need a symbol table regardless, and symbols tables are most decidedly not "hard" to implement. Sure, it's a layering violation, but that's how we were forced to write efficient code back when computing resources were scarce 30+ years ago. An entire C compiler and integrated development environment was only a couple of hundred KB of executable on computers back in the 1980s under MS-DOS and on the Amiga.


> Types are not at all difficult to parse in C

I haven't written a C parser, but a number of other parsers. I can confidently say that C syntax, as a result of historical development, ended up in a place where it is much more annoying to parse than a properly designed language with a LL(1) syntax. If your parser can parse the following, I both congratulate you for your persistence and ridicule you for the statement that this is "not at all difficult". C syntax is annoying at the very least.

        typedef struct Foo Foo;

        void xx(void)
        {
                typedef int Bar;
                Bar x;
        }

        void foo(int, int, int);

        void bar(Foo Foo, int, int Bar);

        baz(Bar, y, z)
                Foo *Bar;
                int y;
                double z;
        {
                return 1;
        }
I don't know what exactly is required from a conforming C compiler, but this is successfully compiled by gcc -std=c89.

The preprocessor, don't get me started. The D author, who is active on HN, has both stated that parsers are less than 0.1% of the work in a compiler, and also that the C preprocessor is terrible and he required multiple or many attempts, I think spread over multiple years, to get it right.


The way I handled it fell out as a result of how tokens were parsed. The token would be hashed, and that hash would be used to check if the token was a keyword in one hash table, then the same hash used to check in a symbol table. That made classification easy and low cost.

I don't think it's hard in practice if you use the right approach. More complex from a theory point of view, sure.

I am serious when I say that the Annotated ANSI C Standard book made this easy to understand. Without that book, parsing C types certainly did not make a lot of sense to me either. It can be found here: https://www.amazon.com/Annotated-ANSI-Standard-Programming-L...


C was created to write Unix on PDP-11 hardware. Prior to that operating systems were typically written in assembler. C was seen as a high level assembler and only later ported to other hardware.

Realistically, C should have been restricted for writing operating systems and device drivers. It is far too low-level for application programming. But since it was often the only portable high level language, software vendors adopted it for purposes that it was never designed to handle. That was partly the reason that C++ and Objective C emerged.


Only portable language with free or vendor implementations.

It was initially tied to the rise in UNIX; being the C operating system meant that it had to have a C compiler, even if it wasn't always a part of the vendor install. That's why GNU set out to build a compiler before building an operating system.

Similar to how the microcomputer era spread interpreted BASIC as the language of choice.

If you wanted another language, you'd not only have to actively choose it but you'd have to pay money for it.


Relatively quickly we got Pascal, and the Borland dialect got widely adopted in the academia. Unfortunately, because of teething problems (various dialects, TP only available for DOS/Windows etc.), eventually it lost the battle with C. It's a pity as we would have to deal with much less bugs over those years.


Also note that the Borland dialect got inspiration from Apple's Object Pascal (TP 5.5) and UCSD (TP 4.0).

The switch in what customer base to target was what done more damage to Borland, more than anything else.

Delphi could still be a major language on the PC world at least.


C was created to port Unix V6 into PDP-11 hardware, after a couple of failed attempts with B.


v4 was the first kernel written in C, and the assembly kernel before that already ran on a PDP-11. I doubt anyone even considered using B for this.


I stand corrected, it was v4 and not v6, yeah.

For a while they used improved B versions, what Dennis calls New B, Embryonic C and Neonatal C, regarding the language evolution.

https://www.bell-labs.com/usr/dmr/www/chist.html


C++ and Objective-C emerged because everybody had too much of the OOP Kool-Aid during the late 80's and early 90's.


Also Java and Python.


Also Java and Python... what? This sentence no verb.


Read it with the post they're responding to.


I guess this needed an example for people. That comment was an addendum, not a lone statement. It's read like this:

> That was partly the reason that C++ and Objective C emerged. Also Java and Python.


I don't think the C standard's inclusion of implementation / undefined behavior is due to what OP calls a "documentation standard", documenting a variety og behavior already codified into various implementations...

Rather, it seems that the C standard tried (& succeeded) from the beginning in making it easy to implement a compiler that generated good code, even to the extent of including register allocation hints to the compiler. This also appears to be why the short/int/long types are defined the way they are - not as types with specific ranges, but rather only that long >= int >= short. This allowed a given implementation to map int to whatever was most appropriate to the target CPU, be it 8-, 16- or 32-bit.

In this vein, wanting to be fast on all targets, I don't think C wanted to assume the presence of an MMU or any hardware required to efficiently detect NULL pointers, so instead just left this as an instance of implementation defined behavior.

C isn't alone in this regard - AFAIK many languages only define the behavior of correct programs and leave the behavior of buggy ones as implementation defined.


> (There have also been actual C interpreters, some of which had strict adherence to the abstract semantics, cf (available online in the Usenix summer 1988 proceedings).)

I used one of these a little, Saber-C, circa 1990, on a Sun SPARCstation. It was glorious, and several steps beyond the Turbo C IDE on MS-DOS that I'd been using at home as a teen.

One of the big wins of Saber-C, before Purify and the later fancy open source memory checkers, was that it could quickly find memory problems. One of my mentors spent some evenings doing a Saber-C-powered memory-bug-search&destroy blitz through an open source Unix X11 game. (I think our own code had less need of help like that, so less Saber-C low-hanging-fruit fun.)

https://archive.org/details/1988-proceedings-summer-san-fran...


Some actionable takeaways about the abstract machine and undefined behavior of the standard.

* Just because something is undefined in the standard now, it does not mean that it has to be undefined forever. It might be an oversight, or priorities might shift. If you disagree with a particular undefined behavior and about ways how it can manifest you can raise it with the committee and/or compiler vendor.

* The behavior of a program is just as defined by the implementation now than before a standard existed. If you write code that contains operations where the standard does not define behavior, the implementation might (but they don't have to). Consult your implementation for the compilers you care about. They have flags to enable certain extensions that define the behavior of some operations, at the expense of disabling some optimizations.


The standard doesn't even need to change. The compilers can just stop interpreting the rules around UB with malicious compliance for the sake of 0.1% optimization in some other scenario.


You know what is commonly specified with abstract machine semantics now?

Assembly language. See e.g. the ARMv8 Architecture Reference Manuals, where code for an abstract machine is listed right there in every single instruction listing, and a great deal of appendix space is devoted to providing a library of helper routines for this machine, and also specifying things like the virtual memory system using the AM.

C programmers who think abstract machine semantics are bad are completely out of touch with reality. I put them in the same bucket as people who think all CPUs are 32-bit x86. They in turn are like the devs who maintain old COBOL systems, except less useful because backward compat and huge emulation advances have made it unnecessary to actually run such 32-bit x86 configurations, unlike the COBOL case.


Without trying to suggest the author is wrong (I'm confident he isn't) "C Minus-Minus" is interesting and does aim to be a "portable assembly language" which in turn is pretty much a target for an abstract machine which most modern compilers can deal with.

https://en.wikipedia.org/wiki/C--


This reminds me of how prestige dialects or formalizations of natural languages (e.g. Panini with Sanskrit) are perceived as the origin of a language, when in fact they are derivatives or refinements of vernacular.


Not at all a C partisan, but this blog is pretty amazing, with a bunch of interesting asides. A sort of hyperlinked version of somebody's entire worldview, and this is the entrypoint. Good job on the output


Ok, so this article is at least contributing to the good side of the discourse about undefined behavior, because it isn't actually delusional about reality. (I mean this in the sense of people being deluded/having the wrong notions about actual facts, by the way, rather than actual craziness. The people who are deluded about undefined behavior tend to say things like "C is definitely a portable assembler" or "compilers are trying to trick me" which are definitely false. A non-deluded person can be upset about the current state but understanding how C currently works is important if you want to discuss it.)

Anyways, to the actual content of the article, I agree with it but I think the frustration it accepts as reasonable is actually misguided. Here is my understanding of how things ended up for C (disclaimer: I was not born when most of this stuff happened.)

In the beginning, you had a C compiler for your computer, and it was basically just an assembler. This is what the "make C great again" people think the language really is under the hood, by the way. However, very quickly people realized that they want their C code to run elsewhere, and every computer does things differently, so they needed to have some sort of standard of what was approximately the lowest common denominator for most machines and that become the C standard for what it is legal to do. The guarantee created at that point was that if you conform to the standard, every implementation of C has to run your program as the standard specifies. This was palatable to people because they had a bunch of machines with weird byte orderings or whatever, and it was obvious what would happen if the dumb compilers of the time translated their platform-specific code to a new architecture.

Later, the weird architectures started becoming rare. At the same time, though, a new architecture started growing: a virtual architecture, one where the compiler would actually "port" your code to the exactly same processor you were compiling for before, but the code would run faster. It would do this by starting to take latitude through intermediate transformations which it assumed it could do because your program should have been portable to the abstract machine.

Now, this completely weirded people out, because "I'm compiling to a new virtual architecture called 'x86-64 -O3' that is the same as 'x86-64 -O0' but faster and more restrictive" sounds really stupid. It's the same architecture, and they're not even real processors! But if you really think about it compilers are really just taking advantage of the fact that your code is portable, because it works in the space of the C abstract machine, to do a "port" called "run a bunch of optimization passes". People understand when a port ends up causing a trap on another processor because of course it does that on the new platform. But getting people to understand that your unaligned accesses on the "-O0" machine are no longer valid on the "-O3" machine, because, again, the instructions that come out look awfully similar and straightforward most of the time, except for the weird times where a change "surprises" you because the transition between the two crossed through an invalid space. Kind of like a path that seems to have a "weird jump" because it normally crosses through 3D space and at some point someone found a shortcut through the fourth dimension.

Anyways, the performance virtual architecture is all well and good, but what I think will be interesting moving forward is the security virtual architecture, where overflows and out-of-bounds accesses and type confusions are focused on more. Right now as a side effect of performance optimization they end up causing headaches for people, but Valgrind/sanitizers are an interesting look into what compiling to "x86-64 for security testing" architecture looks like. The logical next step is even more exciting, because we're actually starting to deploy real architectures with security-focused features that will require ports that are every bit as concrete as any other physical architecture difference, which I think will "legitimize" this mindset to the people who I called deluded at the start of this now very rambly comment. Page protections means that "const" is not something you can ignore. Pointer signing can mean that your "but they're the same bits underneath!" type confusions are no longer valid. ARM's Morello now means you can't play fast-and-loose with your pointers anymore; they're 128+ bits and you can't just decide you want forge one out of an integer anymore without caring. Ports to these architectures absolutely rely on the existence of a C abstract machine, which has served pretty well considering that its existence is really just what a piece of paper says is legal or not, rather than something really planned beforehand.


Thank you for taking the time to explain this, I've seen numerous threads where people complain about C but lack enough understanding of it to know where to properly attribute the blame. C shares one common trait with assembly: the ability to access, interpret, and modify memory freely. It is also an intensely manual language and expects the programmer to be both discriminating and thorough when dealing with unexpected values. Beyond that, they should not be compared. The language is not forgiving and as such has earned resentment from programmers which have had the benefit of using other languages that lessen the burden on the developer. Truthfully, all programming languages have an area where they excel (even ones we don't enjoy) and often times the approach or requirements of a project determine the language which should used. Many complain about it generally as being inadequate but fail to provide the context of which the language is employed, in which case it would be obvious that they should use another language. I would also like to add that scale (as in LoC or project complexity) is an important factor to consider when selecting an appropriate language. C can be much easier to manage for smaller executables/libraries or projects that don't have many layers of abstraction. Simultaneously, modern software and application complexity has grown significantly since C's inception and it is not commonly the most ideal solution. Discussions involving languages are enjoyable here when there is deliberation but I loathe when they devolve into tribalistic posturing and whinging.


> C shares one common trait with assembly: the ability to access, interpret, and modify memory freely

In reality that is a trap; C makes you think that you might be able to poke memory in all sorts of ways but in reality there are lot of subtle restrictions around memory access. The whole discussion around pointer provence is the tip of the iceberg here.

And that is pretty much at the core of the whole ub hullabaloo; difference between what C seems to be (or have been) and what the standard says.


To see how cross platform C looked like in those early days, nothing like reaching out to books like "A book on C" from 1984 (Robert Edward Berry and B. A. E. Meekings), which has an implementation of RatC [0].

https://www.amazon.de/-/en/Robert-Berry/dp/0333368215/ref=sr...

Then there were Small-C and BDS C, as yet two other major subsets in those early days.

[0] - Similar in ideas to Ratfor but applied to a K&R C subset


worth noting that bds c, which generates 8080 code, is now free software


I fully agree with this. For example that assigning a freed pointer in C is UB is not because of optimization, but because there were real world architectures with memory segmentation were loading such a pointer caused a run-time trap (e.g. 286 protected mode). Or that reading an uninitialized automatic variable is UB is because it could cause a trap on architectures which could detect this (e.g. IA-64). There were also C versions with bounds checking etc. That compilers which are popular today focus on exploiting UB for optimization instead of security is an implementation choice, but not a fundamental problem of the language itself.


I like this description. It's a useful mental model.

But in summary, doesn't that just move the target from "The compiler is stupid, it shouldn't be doing this, it clearly should know what I mean and I didn't mean that!" to "This is a bad minimum common denominator, if any architecture really needs this guarantee or for things to behave like this, then they should pay a performance penalty. We shouldn't all have to pay the portability price for this one thing that isn't a issue anywhere."

And to be honest, most of the UB hate I see is about the latter, not the former, no?


Most of the UB hate is that bugs that always existed only recently became exposed. "This code worked fine for years why is the compiler breaking it!" is always the rant, but it's misplaced. It should instead be "why didn't I get a sanitizer/linters/debug-whatever error first?"

The proliferation of optimization passes outpaced decent debuggability, and that's really the problem. Rants against UB are nearly always irrelevant or even just outright wrong. And worse still, those crusaders are harmful. You can see this in Rust as a perfect example. Signed integer overflow is defined two's compliment, much rejoicing from the "UB always bad!" crowd. Except wait a minute, in a debug build of Rust it's defined to be a panic. Why? Because signed integer overflow is 99.999% of the time a bug, and defining how it overflows doesn't actually help anyone. So instead you're left with the worst of both worlds - you both can't rely on how signed ints behave in Rust as a programmer because they have 2 extremely incompatible defined behaviors, and the optimizer/runtime then can't take advantage of them being undefined behavior in practice in release builds to optimize better.


It boils down to a culture problem, while communities in safer systems programming languages embrace having a panic on signed integer overflow, in the C languages world suggesting the use of -ftrapv (or similar) will make them reach out for the pitchforks.

The linters and compiler security flags are there, the problem is getting them adopted.


Of course, a panic is also a failure. If may be a less serious one, or it may just make your rocket explode on take-off and kill anything down-range while the result of the incorrect computation would otherwise have been irrelevant.

One of the difficulties I've had with the 'safer systems programming languages' advocacy is that since something going wrong is inherent and unavoidable -- since the flaw is ultimately in the user's code -- there is a tendency to pretend that the panic isn't something going wrong. In my experience this has result in measurably lower quality code from these communities, code which panics in slightly unexpected conditions -- while something written in C would not (yet may fail in a worse way when it does fail).

I don't think I've yet managed to download and run anything written in rust where it doesn't panic within the first 15 minutes of usage-- except the rust compiler itself and firefox (though I do now frequently get firefox crashes that are rust panics).

It may well be that the increased runtime sensitivity to programmer errors in these languages inherently mean we should expect more runtime failures as previously benign mistakes are exposed, and ought to accept that software written in these languages may be less reliable on aggregate because when it does fail its less likely to create security problems and that this is a worthwhile tradeoff. (Python users sure seem to survive a near constant rate of surprising runtime failures...)

But to the extent and so long as language advocates pretend that panics aren't failures they can't really advocate for the trade-off, advance better static analysis to reduce the gap, and will continue to seem fundamentally dishonest to people who try to use the languages and software written in them and experience the frequent panics first hand.


The difference between Rust and Java is that Rust developers decided that panics shouldn't be recoverable except as an afterthought for C compatibility.

In principle most panics are amendable to retrying the operation which would be the equivalent of catching exceptions in Java. So yes you get an "error has occurred" warning but your program doesn't terminate immediately. I don't think C has an edge over Java here.


How common is it for java code to handle exceptions in useful ways rather than just fail in even more inexplicable ways due to no one ever having conceived of much less tested those code paths being executed?


A panic makes an error situation visible, the way of C can lead to unnoticeable error situation go for longer than expected corrupting data in more inrecovereable ways than just crashing right there on the spot.

A bit like having warnings as errors, or deciding to ignore warnings as peril to what might come later without the feedback of what those warnigns were all about.


Yes, but visible at runtime. Depending on the situation you may well prefer* the silent failure. Many such silent failures are completely benign, e.g. the result of the wrong code (or whatever it corrupted) wasn't subsequently used.

*would prefer if you actually got to pick. But you don't get to pick because once you know of the bug you fix it either way.

Warnings as errors isn't a great example, because if you do it in code distributed to third parties its an absolute disaster as the warnings are not stable and there are constantly shifting false positives. It's perhaps not a good example even without distributing it, because it can lead to hasty "make it compile" 'fixes' that can introduce serious (and inherently warning undetectable) bugs. It's arguably better to have warnings warn until you have the time to look at them and handle them seriously, so long as they don't get missed.

The parallel doesn't carry through to undefined behavior because the undefined behavior isn't logging a warning that you could check out later (e.g. before cutting a release).


However, culture results in artefacts. You mostly won't find American Football Stadiums in England's cities, because it's not part of their culture. If the English suddenly took to this game, such stadiums likely would take as much as several decades to become widespread.

C libraries like OpenSSL reflect what's culturally appropriate in that language, so even if you came to C from a language with a different culture, too bad it has the culturally appropriate API design and behaviour.


I think that OpenSSL has historically reflected a rather antiquated C culture that most software moved on from long ago, FWIW.

A clear example of this is OpenSSL intentionally mixing uninitialized memory into its randomness pool (because on some obscure and long forgotten platforms it was the only way they had to get any 'randomness'), resulting in any programs written using it absolutely spewing valgrind errors all over the place. (Unless your openssl has been compiled with -DPURIFY to skip that behavior, or had the debian "fix" of bypassing the rng almost completely :P ).


I think the OpenSSL situation you're talking about arises because of a mistake by a maintainer.

MD_Update(&m,buf,j);

Kurt Roeckx found this line twice in OpenSSL. Valgrind moaned about this code and Kurt proposed removing it. Nobody objected, so in Debian Kurt removed the two lines.

One of these occasions is, as you described, mixing uninitialized (in practice likely zero) bytes into a pool of other data and removing it does indeed silence the Valgrind error and fixes the problem. The other, however is actually how real random numbers get fed into OpenSSL's "entropy pool", by removing it there is no entropy and the result was the "Debian keys" - predictable keys "randomly" generated by affected OpenSSL builds.

I haven't seen OpenSSL people claim that the first, erroneous, call was somehow supposed to make OpenSSL produce random bits on some hypothetical platform where the contents of uninitialised memory doesn't start as zero, it looks more like ordinary C programmer laziness to me.


The odd thing with that incident is that the "PURIFY" define long predated it-- the correct fix in debian should have been "Just compile with DPURIFY"-- I believe redhat was already doing so at the time.

> I haven't seen OpenSSL people claim that the first, erroneous, call was somehow supposed to make OpenSSL produce random bits on some hypothetical platform where the contents of uninitialised memory doesn't start as zero

I had an openssl dev explain (in person) to to me when I complained about the default behavior: that there had been platforms that depended on that behavior, that they weren't sure that which ones did, and so it didn't seem safe to eliminate it. (I'd complained because I couldn't have users with non -DPURIFY openssl code run valgrind as part of troubleshooting). IIRC the use of uninitialized memory was intentional and remarked on in comments in the code.


They should make such considerations as:

- If the "uninitialized" data is actually somehow some kind of interference.

- In LLVM, using a "undef" value will not always do the same thing each time; however, the "freeze" command can be used to avoid that problem. (I don't know if this feature of LLVM can be accessed from C codes, or how the similar things are working in GCC.)

- If the code seems unusual, then you should write comments to explain why it is written in the way that it is. (You can then also know what considerations to make if you want to remove it.)

- Whether or not there is uninitialized data, you will need to make proper entropy too, from other properly entropy data.


>So instead you're left with the worst of both worlds - you both can't rely on how signed ints behave in Rust as a programmer because they have 2 extremely incompatible defined behaviors, and the optimizer/runtime then can't take advantage of them being undefined behavior in practice in release builds to optimize better.

I don't understand how this is the worst of both worlds.

You can explicitly define overflow behavior in Rust. There are wrapper types and explicit checked or saturating and wrapping operations if those are necessary for the correctness of your program. If your program doesn't rely on them then checked overflow being the default in debug builds is the way to go and given enough confidence in the final product it makes sense to drop them in release builds and given enough processor advancements we can also do checked overflow in release builds.


I can do that in C/C++ where regular signed integer overflow is otherwise undefined behavior. The point is just Rust defining the behavior (no UB!) didn't do a damn thing to help anyone since if you actually want and expect overflow you need to use specific functions/wrappers to do that anyway. You also probably want the carry flag anyway so having regular addition be "defined behavior" is still useless.


Yep, exactly this. There's a handful of undefined behavior that might actually be worth reconsidering, but almost all UB that people want turned into defined behavior are "yeah we had a bug let's make it do something about as bad but call it defined".


Wise man once said: "C is just memory, with syntax sugar."


One of the main consequences of how the abstract machine in the standard is defined is that it does not imply that there is one big flat memory. It requires pointers into arrays to behave as if there was, but that is it. In other words, taking a ptr_diff_t of two pointers that do not point into the same array is UB (6.5.6.10).

Trivial example of platform where that is relevant is 16b 8086 with anything but tiny memory model (ie. CS=DS=SS). Somewhat more relevant are various Harvard-ish embedded platforms with separate RAM/ROM address spaces. In both cases there are C implementations that are used in production applications.

Another reason for this rule (and probably the original one) are various platforms with segmentation based fine-grained memory protection. That means either things like Burroughs large systems or running C on top of some kind of VM while preserving the underlying memory management. And there are C implementations for that kinds of environments (That most of the existing C code is not directly compatible with that kind of implementation is another thing).


Another thing that turns out to be compatible with the C memory model is "doing weird things with pointer bits", such as pointer authentication: https://www.qualcomm.com/content/dam/qcomm-martech/dm-assets...


for a definiton of "just memory" that includes the strict aliasing rule. Certainly not a buffer of bytes, what most people would assume when they hear "just memory".


The strict aliasing rule that you point out is not part of the “just memory” part, but rather the “syntax sugar” part. Now you can debate if it’s the right kind of sugar, however.


No, The strict aliasing part of the semantic of the memory. “syntax sugar” is, per definition, part of the syntax.


Indeed. What the strict aliasing rule implies is that only well typed C programs have meaning. Which means that C is indeed not "just memory".

If C was just memory, the only operations allowed would be on and through memory addresses, and values wouldn't be first class.


"Memory" might not even be a concept in the C standard, rather it is what C programmers think about in practice. The culture around the language is that the language should get out of the way as much as possible while still providing a good amount of convenience, portability, and performance. The standard is a necessity, but is not the center of attention while working.


But then there's the memcpy escape hatch that lets you treat anything and everything as a raw sequence of bytes with no concern for types or aliasing. So arguably the fundamental memory model is still "just memory" (albeit not necessarily a single address space), and the rest is bolted on top and applies only to specific language constructs.


Another one of my favorites of his: https://utcc.utoronto.ca/~cks/space/blog/programming/CTriump... (the C Juggernaut) I like it in part because I lived through some of that time period (getting code running on early Macs)

I can relate to people who just like programming in C. :-)


Curiously, C semantics influenced CPU instruction set architecture to support C better.


It's not unique in that regard; ARMv8.3 added an instruction to handle floating point conversion errors in a way that is more efficient for JavaScript.

https://twitter.com/gparker/status/1047246359261106176


The history of the c specification is irrelevant to the question of the abstract machine. Other languages developed this way are rust, scheme, and... yep, pretty much every single other language spec in existence.


"This is simultaneously true and false."

Why am I not surprised about this when talking about C.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: