Hacker News new | past | comments | ask | show | jobs | submit login

> we're talking about what a standards conforming compiler can do

No. You may be, I am not.

I am talking about what a reasonable C compiler can do, and what the ANSI standards committee intended.

Adhering to the ANSI/ISO standard is at best a necessary condition for producing a useful C compiler. It is most definitely not a sufficient condition. As I've pointed out many times before, this was intentional.

And the existence of pre-standard C compilers that worked is, of course, clear evidence that it is not a necessary condition either. Or at least was not.

The C standard leaves a lot "undefined" or "implementation defined" that is actually well-defined on a concrete machine the compiled code runs on. If you seriously think the intention of all this was to allow demons to fly out of nostrils or for compilers to start mining bitcoins, which is all perfectly legal by the standard, well, I don't really know what to say.

Particularly because the creators of the standards clearly stated so. In the very standard itself. Now they didn't make that language mandatory, so yeah, you can make a "standards compliant" compiler that violates that intent. But it will be a sucky compiler.

C ≠ The ANSI/ISO C standard.




> The C standard leaves a lot "undefined" or "implementation defined" that is actually well-defined on a concrete machine the compiled code runs on.

This assumes there is a one-to-one correspondence between C constructs and machine code, which isn't true. C isn't a "portable assembler" and compilers are luckily able to choose whatever machine code they think will perform best under the assumption that there is no undefined behavior.

> But it will be a sucky compiler.

I don't think gcc or clang are sucky compilers at all. In fact, their ability to aggressively optimize valid code is extremely helpful for producing high-performance programs.

> C ≠ The ANSI/ISO C standard

So C is defined neither by the standard _nor_ by existing implementations and is instead defined to be whatever your headcanon is?


> This assumes there is a one-to-one correspondence between C constructs and machine code

No it does not.

> which isn't true.

Actually, it is true in a vast majority of cases.

> C isn't a "portable assembler"

Not sure why people keep repeating this despite it being so obviously and patently untrue:

"Committee did not want to force programmers into writing portably, to preclude the use of C as a “high-level assembler:”

https://www.open-std.org/JTC1/SC22/WG14/www/docs/n897.pdf

p10, line 39

"C code can be portable. "

line 30


> And the existence of pre-standard C compilers that worked is, of course, clear evidence that it is not a necessary condition either. Or at least was not.

The difference is that those compilers didn't have a spec to follow or disobey anyways. The can always just say "whatever happened is right" and you could argue that the compiler did something _unhelpful_ but the compiler broke no promise because it made no promise.

> Particularly because the creators of the standards clearly stated so. In the very standard itself.

They clearly stated the opposite:

""" * Undefined behavior --- behavior, upon use of a nonportable or erroneous program construct, of erroneous data, or of indeterminately-valued objects, for which the Standard imposes no requirements. Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message) """

from ANSI C89 (§1.6).

The compiler may ignore the situation completely leading to unpredictable results. If the compiler ignores a situation, such as a signed integer overflow, and that leads to unpredictable results, such as "next time we see an if-statement, execute both the if-block and the else-block", then that's 100% conforming. That's exactly what was intended. This sort of thing can happen, suppose the evaluation of an if-condition ends up in some CPU flags, and the compiler knows the two flags are exclusive because the only way they wouldn't be is if the program had a signed integer overflow.

FWIW, GCC 1.17, released in January 1988, would launch 'nethack' during the compile if it detected a #pragma that it didn't understand. The idea of interpreting it as "anything can happen" is neither incorrect nor new. (Technically in this case, an unknown pragma has implementation-defined behavior which is the same as UB plus a requirement that the behavior must be documented in the compiler's manual.) It was a bad idea though and they removed it because a compiler that does that is, well, not good. We call this property "quality of implementation", but it wasn't a correctness issue.


> didn't have a spec to follow or disobey anyways.

Exactly. Yet they were still C compilers. So the idea that "being a C compiler" is the same as "follows the spec" is clearly nonsense. You can follow the spec and not be a usable C compiler, and you can be a usable C compiler and not follow the spec.

And the C spec not being complete was intentional, because otherwise too many already existing C compiler would have not had a chance of becoming ANSI compliant, and thus the spec would have been meaningless.

> They clearly stated the opposite:

> [..]

> Permissible undefined behavior ranges from ignoring ...

Very clearly gives a list of "permissible behaviors". Now if you believe that one of those options is "do anything you please" when it both clearly doesn't say that, it just "ignore with unpredictable results" AND it doesn't make logical sense, not sure how to help you.

(The two other options for permissible behavior are clearly completely redundant if one of them permits you to do anything you want whatsoever).

> "next time we see an if-statement, execute both the if-block and the else-block", then that's 100% conforming

It is "conforming" to the spec that has made that part non-binding. It does not conform to "ignore the situation", because the unpredictable behavior mentioned in the spec is that of the environment, not the compiler.


> Very clearly gives a list of "permissible behaviors"

A statement like "we have products ranging from cake decorating to peanut crocheting to jousting lances" does not mean that this is a list of the only three types of items in the store. This construction is called a "false range" in English. When you have a range of something that does not have an order, it means "varied things, left unspecified". It's very clearly not supposed to be an exhaustive list, merely a few examples.

So the standard lists three examples. The first is that the runtime behaviour of the program may do any-unpredictable-thing. The second is that the compiler may 'behave in a documented manner' and maybe issue an error. They wanted these two examples because they didn't want any misunderstanding that UB was limited to what could be shown UB statically at compile time, nor that it was limited to only having effects on the program at runtime.

I'm not honestly sure why they bothered adding the third "oh, the compiler or the program terminates with an error message". I could speculate that this is what they wish would actually happen, and including it in the list improves the chance of that.

> unpredictable behavior mentioned in the spec is that of the environment, not the compiler.

I don't believe that's correct -- I think it appertains to the program not the environment or the compiler -- but it doesn't matter either way. The environment is responsible for supporting execution of the program, so if it's unpredictable then it follows that any unpredictable things can happen to your program -- it would be like trying to run on a CPU that's experiencing physical failures.


> A statement like "we have products ranging from cake decorating to ...

If you have a statement like that, that is likely true. However, this is not a statement like that.

1. It gives a range of 3 permissible options. If one of these "permissible" options is "you can do anything", what are the other 2 options doing on that list?

2. Even worse for your interpretation, the very word permissible only makes sense if there are things that are not permissible. So once again, "you can do anything" makes no sense.

Both of these are non-sensical. Now had they actually written "you can do anything", this would be easy: they wrote nonsense. But they didn't write "you can do anything". What they actually wrote "ignoring the situation completely with unpredictable results". That this somehow (how?) means "I can do anything" is purely your interpretation. And your interpretation leads to nonsense. So clearly your interpretation is wrong, particularly when there is an alternative interpretation, that does not lead to nonsense.

In addition, the "interpretation" that does not lead to nonsense is the one that takes the words literally. "Ignore the situation". Not "act on the situation and then do anything I damn well please".

3. Even a heterogenous range restricts. Yes, "we have products ranging from cake decorating to peanut crocheting to jousting lances" does not mean that those are the only items in the store. But even with your somewhat odd choice of items, if you go into that store and ask for an aircraft carrier, you will get odd looks, because the range of items mentioned clearly restricts the items they stock to non-aircraft carriers.

> it would be like trying to run on a CPU that's experiencing physical failures.

No, it is like reading beyond the range of an array: the machine will attempt to read fron that location, that may return a value, we don't know what value, or it may signal a fault. What it does is not defined by the standard, hence undefined behavior.

It's not that* hard.


> 1. It gives a range of 3 permissible options. If one of these "permissible" options is "you can do anything", what are the other 2 options doing on that list?

Before the list it clearly says, "behavior, [... when UB occurs elided ...], for which the Standard imposes no requirements."

How can it both impose no requirements, yet simultaneously impose a requirement that it come out of that list of three options?

> That this somehow (how?) means "I can do anything" is purely your interpretation.

Purely mine?

* https://devblogs.microsoft.com/oldnewthing/20140627-00/?p=63...

* https://blog.regehr.org/archives/213

* https://blog.llvm.org/2011/05/what-every-c-programmer-should...

* https://stackoverflow.com/questions/32132574/does-undefined-...

* https://en.wikipedia.org/wiki/Undefined_behavior#Examples_in...

* https://www.youtube.com/watch?v=ehyHyAIa5so

I must have an amazing number of sock puppet accounts.


> Purely mine?

Surely you've observed that bad ideas and even errors are not immune to spread.


That is also true.

And it turns out that the opinion "we can do anything we want" is based on the non-normativity of the section we were debating about, not on misinterpreting it.


You started the discussion by saying that we can understand what the committee meant by simply reading the text in the standard that they wrote. Remember? Why then does it matter if it's non-normative text (aka. informative)? If it's non-normative it's there to explain what they were thinking.

I double-checked my copy of the C89 standard (draft), and as far as I can tell the text is normative. Non-normative text includes footnotes and appendices and editor's notes in square brackets, but I didn't see any of those involved. Sometimes there's a section that states 'the following text is non-normative' but I didn't see any of that either. Why do you think it's non-normative?

If that text is non-normative, where is the normative definition of undefined behavior?

Regardless, I think I understand what's happened here. If I may? You've read the text and you've repeatedly commented that we have to discard the "it allows you to do anything" interpretation because that interpretation is nonsense.

I'll stipulate that when reading English, we regularly have to discard nonsense interpretations. "Out the window, the mountains looked over a beautiful lake", we discard the interpretation where the mountains have eyes and are looking out a window. This is a normal part of reading English.

I posit that you believe that the "anything can happen" interpretation is impossible so strongly that no possible wording would ever lead you to the interpretation intended by the committee.

So instead, how about I explain why the committee chose to define UB this way? How it isn't nonsense?

It's so that a C compiler could use the plain "add" instruction for an addition in C across all the crazy CPU designs. Here, let me make a simple example, this isn't a real CPU. Suppose the CPU has a status register which contains "signed overflow" as one of its bits. This bit is set or cleared when you do an ALU operation, including ADD. The same status register is reused when doing a memory operation, but that bit is reused to indicate whether you're going to the first or second bank of memory. The CPU authors think that this is great, if you're doing pointer math and you add a pointer and an integer then you transparently overflow from the first bank of memory into the second and it looks like a contiguous address space! The system integrator (or, motherboard designer, roughly) decides to use bank selection for a different purpose. There's no way the first bank of memory would ever be completely full (nobody buys or sells that much RAM) so they put the RAM on bank 0 and the I/O ports on bank 1. Their system has memory-mapped I/O! So far, everybody's done something that seems sensible to them. Third, the C programmer writes "*p = x + y;". What happens if x + y are signed ints and the addition overflows? The signed overflow bit gets set, then the STORE instruction accesses the I/O ports instead of the memory!

Is the C compiler buggy for not inserting an extra instruction that clears the sign bit of the register? The committee intentionally decided that no, this is how signed integers should work and if you want potentially slower integers with guaranteed semantics that you should use unsigned integers instead. I think that attaching well-definedness to signed/unsigned was a bad move, but this is what they did. (And it does what you've said you want in other comments: the + in C becomes whatever the machine ADD instruction does!)

The C committee invented undefined behavior as a way to ensure that the compiler really could ignore the situation. (FWIW, real CPU ISAs back then had all kinds of interesting ideas. We hadn't yet agreed that bytes are 8 bits. Or that we should use 2's complement. Some designers looked at division by zero as an invalid operations and thought that this was an excellent feature that should be brought over to other operations like add and mul, hence "trap values" in the C standard.)

C and Unix were a commercial success (CPU firms could skip writing their own OS every time), and starting then, CPU designers made ISAs where C could be easily lowered to efficient assembly on their machines. This notion that there was an efficient lowering from C to the CPU, at the time C was standardised, is an anachronism. C created UB to handle then-contemporary CPUs, then later CPUs created ISAs that matched C syntax. If you don't work in compilers or assembly or CPU design, this might be surprising, but if you aren't intentional about making your ISA work well in C, it's easy to accidentally make one which doesn't. Intel MMX famously couldn't be targeted from compilers because the compilers don't have sufficient information to solve where to put the necessary EMMS instructions. Oops!

Decades down the road, CPU designs evolved and started creating new patterns that don't match C well -- and the C language didn't evolve with them. What expression is PSHUFB in C? Or VPMASKMOVD? The CPU firms knew enough to make sure that the compilers could support their new instructions, so they added these as CPU-specific extensions. If you wanted them, you had to write non-portable code that only worked when compiling to target their CPU and not others.

The compiler engineers believe in C being a portable language. If you write code using SSE or AVX and compile it to ARM with clang, it will compile and port the SSE builtins to ARM Neon vector extensions. Doing this required the compilers to be a whole lot smarter about how the code works, and is a large part of the source of modern complaints about compilers "exploiting" undefined behaviour. I counted 23 ADD instructions in contemporary x86-64, assuming you include fused instructions (LEA) and exclude things like OR (saturating add without carry between bits) and XOR (addition with 1-bit vector lanes, lol).

Finally, C defines an abstract machine. In this context, machine is a "term of art" in computer science, popular types of machines include cellular automata (the machine side of the 'regular expression' language), push-down automata and Turing machines. If you've seen those before, you may know that they're usually pictured as a directed graph, with states drawn as nodes and state transitions as directed edges, the edges labelled with the circumstance under which this edge is taken. Now, C's machine and the three I listed have a key difference, those three are all decision machines, meaning they exist to either accept or reject an input string. The C machine is a functional machine, it describes a function that transforms and input to an output. (In this treatment, you may picture side-effects as being part of the output.) The C standard defines how such C abstract machines are written down (the C syntax) and what semantics they have: states, and state transitions. The question is what happens when you are in a state and receive an input for which the standard does not define any particular state transition? In a cellular automata or PDA or Turing machine you have a single state named "error" (or "reject") at which point you reject the input string as not being a member of the set that the machine is deciding (aka., your input string fails to match the regex, and we're done). In a functional machine, we don't traditionally have such a state. By defining UB in the way they did, the C standard is stating that when no state transition is specified, it may go anywhere, including to states that aren't required to exist and on which the standard places no requirements.


If it doesn't change whether the argument is correct or not, why did mpweiher bring it up?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: