>the problem is with compilers (and their developers) who think UB really means ...

wfo · on Feb 14, 2017

In some sense the language is the compiler and the compiler is the language; the language is much like a human language, used for its utility in expressing things (ideas, programs). You can tell if your human language words work by determining if people understand you. If people start being obtuse and refusing to understand you because of an arbitrary grammar rule that isn't really enforced, you'd be right to be upset with the people just as much as the grammar.

It in fact doesn't matter at all what the standard says if GCC and LLVM say something different, because you can't use the standard to generate assembly code.

The standard doesn't have anything to say about UB, so it's the compiler's responsibility to do the most reasonable, non-shocking thing with it possible: if I'm a GCC developer and you ran GCC on one of these fairly mundane examples and it compiled without error then ran rm -rf / or stole your private RSA keys and posted them on 4chan and I said "well, you can't be mad because it's undefined, it's the standard's fault" you'd probably punch me in the face after some quick damage control.

If it deletes an if loop or terminates a spinlock early that's potentially even worse than those two examples.

sjolsen · on Feb 14, 2017

>In some sense the language is the compiler and the compiler is the language; the language is much like a human language, used for its utility in expressing things (ideas, programs). You can tell if your human language words work by determining if people understand you. If people start being obtuse and refusing to understand you because of an arbitrary grammar rule that isn't really enforced, you'd be right to be upset with the people just as much as the grammar.

The shortcoming of this interpretation is that programs are not (only) consumed by humans; they're consumed by computers as well. Computers are not at all like humans: there is no such thing as "understanding" or "obtuseness" or even "ideas." You cannot reasonably rely on a computer program, in general, to take arbitrary (Turing-complete!) input and do something reasonable with it, at least not without making compromises on what constitutes "reasonable."

Along this line of thinking, the purpose of the standard is not to generate assembly code; it's to pin down exactly what compromises the compiler is allowed to make with regards to what "reasonable" means. It happens that C allows an implementation to eschew "reasonable" guarantees about behavior for things like "reasonable" guarantees about performance or "reasonable" ease of implementation.

Now, an implementation may choose to provide stronger guarantees for the benefit of its users. It may even be reasonable to expect that in many cases. But at that point you're no longer dealing with C; you're dealing with a derivative language and non-portable programs. I think that for a lot of developers, this is just as bad as a compiler that takes every liberty allowed to it by the standard. The solution, then, is not for GCC and LLVM to make guarantees that the C language standard doesn't strictly require; the solution is for the C language standard to require that GCC and LLVM make those guarantees.

Of course, it doesn't even have to be the C language standard; it could be a "Safe C" standard. The point is that if you want to simultaneously satisfy the constraints that programs be portable and that compilers provide useful guarantees about behavior, then you need to codify those guarantees into some standard. If you just implicitly assume that GCC is going to do something more or less "reasonable" and blame the GCC developers when it doesn't, neither you nor they are going to be happy.

E6300 · on Feb 14, 2017

On the other hand, the expected and desirable behavior in one platform might be different from that in another platform. It's possible to overspecify and end up requiring extra code when performing ordinary arithmetic operations, or lock yourself out of useful optimizations.

sjolsen · on Feb 14, 2017

Which is exactly the motivation behind implementation-defined behavior. There's a broad range of "how much detail do you put in the specification" between the extremes of "this is exactly how the program should behave" and "this program fragment is ill-formed, therefore we make no guarantees about the behavior of the overall program whatsoever."

E6300 · on Feb 14, 2017

Implementation-defined behavior at best just tells you that the behavior is guaranteed to be deterministic (or not). You still cannot reason about the behavior of the program by just looking at the source.

And I'm not sure if optimizations such as those that require weak aliasing would be possible if the behavior was simply implementation-defined.

sjolsen · on Feb 14, 2017

The desire to reason about the precise behavior of a program and the desire to take advantage of different behavior on different platforms are fundamentally at odds. Like I said, there's a broad range of just how much of the specification you leave up to the implementation; it's an engineering trade-off like any other.

E6300 · on Feb 14, 2017

My point is that by leaving a behavior to be defined by the implementation you're not making it any easier to write portable code, but you may be making some optimizations impossible.

sjolsen · on Feb 14, 2017

That's not entirely true. Regarding portability, the layout of structs for example is implementation-defined to allow faster (by virtue of alignment) accesses or more compact storage depending on the system, but it's perfectly possible to write portable code that works with the layout using offsetof and sizeof (edit: and, of course, regular member access :) ).

That said, I would agree that, on the whole, C leans too heavily on under-specified behavior of every variety. It's just not an absolute.

vyodaiken · on Feb 14, 2017

It's a stupid convention of compiler writers and standards writers at the expense of common sense and engineering standards. In fact there are many thousands of lines of C code that depend on compilers doing something sensible with UB. For example 0 is a valid address in many cases (even in some versions of UNIX). The decision to allow compiler writers to make counter-factual assumptions on the basis of UB is the kind of decision one expects from petty bureaucrats.

palunon · on Feb 14, 2017

>For example 0 is a valid address in many cases (even in some versions of UNIX).

0 may be a valid address at runtime, but a NULL pointer is always invalid.

On such platforms, the compiler should handle 0 pointer values correctly - and the NULL pointer may not have a 0 value, and must not compare equal to any valid pointer.

But 0 or NULL constant, when converted to a pointer type, MUST result in a NULL pointer value - which may be nonzero. Dereferencing such a pointer is an UB.