Hacker News new | past | comments | ask | show | jobs | submit login

IMHO the problem is with compilers (and their developers) who think UB really means they can do anything, when what programmers usually expect is, and the standard even notes for one of the possible interpretations of UB, "behaving during translation or program execution in a documented manner characteristic of the environment".

Related reading:

http://blog.metaobject.com/2014/04/cc-osmartass.html

http://blog.regehr.org/archives/1180 and https://news.ycombinator.com/item?id=8233484




>the problem is with compilers (and their developers) who think UB really means they can do anything

But that's exactly what undefined behavior means.

The actual problem is that programmers are surprised-- that is, programmers' expectations are not aligned with the actual behavior of the system. More precisely, the misalignment is not between the actual behavior and the specified behavior (any actual behavior is valid when the specified behavior is undefined, by definition), but between the specified behavior and the programmers' expectations.

In other words, the compiler is not at fault for doing surprising things in cases where the behavior is undefined; that's the entire point of undefined behavior. It's the language that's at fault for specifying the behavior as undefined.

In other other words, if programmers need to be able to rely on certain behaviors, then those behaviors should be part of the specification.


In some sense the language is the compiler and the compiler is the language; the language is much like a human language, used for its utility in expressing things (ideas, programs). You can tell if your human language words work by determining if people understand you. If people start being obtuse and refusing to understand you because of an arbitrary grammar rule that isn't really enforced, you'd be right to be upset with the people just as much as the grammar.

It in fact doesn't matter at all what the standard says if GCC and LLVM say something different, because you can't use the standard to generate assembly code.

The standard doesn't have anything to say about UB, so it's the compiler's responsibility to do the most reasonable, non-shocking thing with it possible: if I'm a GCC developer and you ran GCC on one of these fairly mundane examples and it compiled without error then ran rm -rf / or stole your private RSA keys and posted them on 4chan and I said "well, you can't be mad because it's undefined, it's the standard's fault" you'd probably punch me in the face after some quick damage control.

If it deletes an if loop or terminates a spinlock early that's potentially even worse than those two examples.


>In some sense the language is the compiler and the compiler is the language; the language is much like a human language, used for its utility in expressing things (ideas, programs). You can tell if your human language words work by determining if people understand you. If people start being obtuse and refusing to understand you because of an arbitrary grammar rule that isn't really enforced, you'd be right to be upset with the people just as much as the grammar.

The shortcoming of this interpretation is that programs are not (only) consumed by humans; they're consumed by computers as well. Computers are not at all like humans: there is no such thing as "understanding" or "obtuseness" or even "ideas." You cannot reasonably rely on a computer program, in general, to take arbitrary (Turing-complete!) input and do something reasonable with it, at least not without making compromises on what constitutes "reasonable."

Along this line of thinking, the purpose of the standard is not to generate assembly code; it's to pin down exactly what compromises the compiler is allowed to make with regards to what "reasonable" means. It happens that C allows an implementation to eschew "reasonable" guarantees about behavior for things like "reasonable" guarantees about performance or "reasonable" ease of implementation.

Now, an implementation may choose to provide stronger guarantees for the benefit of its users. It may even be reasonable to expect that in many cases. But at that point you're no longer dealing with C; you're dealing with a derivative language and non-portable programs. I think that for a lot of developers, this is just as bad as a compiler that takes every liberty allowed to it by the standard. The solution, then, is not for GCC and LLVM to make guarantees that the C language standard doesn't strictly require; the solution is for the C language standard to require that GCC and LLVM make those guarantees.

Of course, it doesn't even have to be the C language standard; it could be a "Safe C" standard. The point is that if you want to simultaneously satisfy the constraints that programs be portable and that compilers provide useful guarantees about behavior, then you need to codify those guarantees into some standard. If you just implicitly assume that GCC is going to do something more or less "reasonable" and blame the GCC developers when it doesn't, neither you nor they are going to be happy.


On the other hand, the expected and desirable behavior in one platform might be different from that in another platform. It's possible to overspecify and end up requiring extra code when performing ordinary arithmetic operations, or lock yourself out of useful optimizations.


Which is exactly the motivation behind implementation-defined behavior. There's a broad range of "how much detail do you put in the specification" between the extremes of "this is exactly how the program should behave" and "this program fragment is ill-formed, therefore we make no guarantees about the behavior of the overall program whatsoever."


Implementation-defined behavior at best just tells you that the behavior is guaranteed to be deterministic (or not). You still cannot reason about the behavior of the program by just looking at the source.

And I'm not sure if optimizations such as those that require weak aliasing would be possible if the behavior was simply implementation-defined.


The desire to reason about the precise behavior of a program and the desire to take advantage of different behavior on different platforms are fundamentally at odds. Like I said, there's a broad range of just how much of the specification you leave up to the implementation; it's an engineering trade-off like any other.


My point is that by leaving a behavior to be defined by the implementation you're not making it any easier to write portable code, but you may be making some optimizations impossible.


That's not entirely true. Regarding portability, the layout of structs for example is implementation-defined to allow faster (by virtue of alignment) accesses or more compact storage depending on the system, but it's perfectly possible to write portable code that works with the layout using offsetof and sizeof (edit: and, of course, regular member access :) ).

That said, I would agree that, on the whole, C leans too heavily on under-specified behavior of every variety. It's just not an absolute.


It's a stupid convention of compiler writers and standards writers at the expense of common sense and engineering standards. In fact there are many thousands of lines of C code that depend on compilers doing something sensible with UB. For example 0 is a valid address in many cases (even in some versions of UNIX). The decision to allow compiler writers to make counter-factual assumptions on the basis of UB is the kind of decision one expects from petty bureaucrats.


>For example 0 is a valid address in many cases (even in some versions of UNIX).

0 may be a valid address at runtime, but a NULL pointer is always invalid.

On such platforms, the compiler should handle 0 pointer values correctly - and the NULL pointer may not have a 0 value, and must not compare equal to any valid pointer.

But 0 or NULL constant, when converted to a pointer type, MUST result in a NULL pointer value - which may be nonzero. Dereferencing such a pointer is an UB.


There's no compiler writers throwing out

  if(undefined_behavior) {
    ruin_developers_day();
  }
It tends to be the effects of valid by the spec optimizations making assumptions that would only not be true during undefined behavior.


People have been a little sloppy with the terms, but there's a difference between implementation defined behavior and undefined behavior. Generally, the committee allows undefined behavior when it doesn't believe a compiler can detect a bug cheaply.

Of course, many programmers complain about how the committee defines "cheaply." Trying to access an invalid array index is undefined because the way to prevent that kind of bug would be to add range checking to every array access. So, each extra check isn't expensive, but the committee decided that requiring a check on every array access would be too expensive overall. The same applies to automatically detecting NULL pointers.

And the fact that the standard doesn't require a lot -- a C program might not have an operating system underneath it, or might be compiled for a CPU that doesn't offer memory protection -- means that the committee's idea of "expensive" isn't necessarily based on whatever platforms you're familiar with.

But it is certainly true that a compiler can add the checks, or can declare that it will generate code that acts reliably even though the standard doesn't require it. And it's even true that compilers often have command line switches specifically for that purpose. But in general I believe those switches make things worse: your program isn't actually portable to other compilers, and when somebody tries to run your code through a different compiler, there's a very good chance they won't get any warnings that the binary won't act as expected.


Why restrict yourself to one compiler if you can write portable code?

Clang and gcc provide flags that enable nonstandard behavior, and you can use static and dynamic (asan, ubsan) tools to detect errors in your code, it does not have to be hard to write correct code.


Strict aliasing and ODR violations are extremely difficult to detect; these are the poster children for "undefined behavior that's hard to avoid and could seriously ruin your day if the compiler gets wind of it."

There does appear to finally be a strict aliasing checker, but I have no experience with it.


In the main, people seem to be unfamiliar with what lies underneath C, so they never seem to really get this idea that you might be able to (or want to) expect any behaviour other than that imposed by its own definition.


Right. Except for a few optimizer edge cases, you generally know what "undefined behavior" is going to spit out on a particular machine. Signed integer overflow, for example, almost always happens exactly the way you'd expect.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: