> how insidious undefined behavior is. Indeed. UB in C doesn't mean "and then th...

mattkrause · on Nov 28, 2022

I'm not sure that's a productive way to think about UB.

The "weirdness" happens because the compiler is deducing things from false premises. For example,

1. Null pointers must never be dereferenced.

2. This pointer is dereferenced.

3. Therefore, it is not null.

4. If a pointer is provably non-null, the result of `if(p)` is true.

5. Therefore, the conditional can be removed.

There are definitely situations where many interacting rules and assumptions produce deeply weird, emergent behavior, but deep down, there is some kind of logic to it. It's not as if the compiler writers are doing

   if(find_undefined_behv(AST))
      emit_nasal_demons()
   else
      do_what_they_mean(AST)

WalterBright · on Nov 28, 2022

The C and C++ (and D) compilers I wrote do not attempt to take advantage of UB. What you got with UB is what you expected to get - a seg fault with a null dereference, and wraparound 2's complement arithmetic on overflow.

I suppose I think in terms of "what would a reasonable person expect to happen with this use of UB" and do that. This probably derives, again, from my experience designing flight critical aircraft parts. You don't want to interpret the specification like a lawyer looking for loopholes.

It's the same thing I learned when I took a course in race in high performance driving. The best way to avoid collisions with other cars is to be predictable. It's doing unpredictable things that cause other cars to crash into you. For example, I drive at the same speed as other traffic, and avoid overtaking on the right.

bombcar · on Nov 28, 2022

I think this is a core part of the problem; if the default for everything was to not take advantage of UB things would be better - and we're fast enough that we shouldn't NEED all these optimizations except in the most critical code; perhaps.

You should need something like

    gcc --emit-nasal-daemons

to get the optimizations that can hide UB, or at least horrible warnings that "code that looks like it checks for null has been removed!!!!".

badsectoracula · on Nov 29, 2022

AFAIK GCC does have switches to control optimizations, the issues begin when you want to use something other than GCC, otherwise you're just locking yourself to a single compiler - and at that point might as well switch to a more comfortable language.

titzer · on Nov 29, 2022

> What you got with UB is what you expected to get - a seg fault with a null dereference, and wraparound 2's complement arithmetic on overflow.

This is how it worked in the "old days" when I learned C. You accessed a null pointer, you got a SIGSEGV. You wrote a "+", then you got a machine add.

WalterBright · on Nov 29, 2022

In the really old DOS days, when you wrote to a null pointer, you overwrote the DOS vector table. If you were lucky, fixing it was just a reboot. If you were unlucky, it scrambled your disk drive.

It was awful.

The 8086 should have been set up so the ROM was at address 0.

badsectoracula · on Nov 29, 2022

This is the right approach IMO, but sadly the issue is that not all C compilers work like that even if they could (e.g. they target the same CPU) so even if one compiler guarantees they wont introduce bugs from an overzealous interpretation of UB, unless you are planning to never use any other compiler you'll still be subject to said interpretations.

And if you do decide that sticking to a single compiler is best then might as well switch to a different and more comfortable language.

titzer · on Nov 28, 2022

This is the problem; every compiler outcome is a series of small logic inferences that are each justifiable by language definition, the program's structure, and the target hardware. The nasal demons are emergent behavior.

It'd be one thing if programs hitting UB just vanished in a puff of smoke without a trace, but they don't. They can keep on spazzing out literally forever and do I/O, spewing garbage to the outside world. UB cannot be contained even to the process at that point. I personally find that offensive and rude that tools get away with being so garbage that they can't even promise to help you crash and diagnose your own problems. One mistake and you invite the wrath of God!

nayuki · on Nov 28, 2022

> I personally find that offensive and rude that tools get away with being so garbage that they can't even promise to help you crash and diagnose your own problems.

This is literally why newer languages like Java, JavaScript, Python, Go, Rust, etc. exist. With the hindsight of C and C++, they were designed to drastically reduce the types of UB. They guarantee that a compile-time or run-time diagnostic is produced when something bad happens (e.g. NullPointerException). They don't include silly rules like "not ending a file with newline is UB". They overflow numbers in a consistent way (even if it's not a way you like, at least you can reliably reproduce a problem). They guarantee the consistent execution of statements like "i = i++ + i++". And for all the flak that JavaScript gets about its confusing weak type coercions, at least they are coded in the spec and must be implemented in one way. But all of these languages are not C/C++ and not compatible with them.

titzer · on Nov 28, 2022

Yes, and my personal progression from C to C++ to Java and other languages led me to design Virgil so that it has no UB, has well-defined semantics, and yet crashes reliably on program logic bugs giving an exact stack traces, but unlike Java and JavaScript, compiles natively and has some systems features.

Having well-defined semantics means that the chain of logic steps taken by the compiler in optimizing the program never introduces new behaviors; optimization is not observable.

galangalalgol · on Nov 28, 2022

It can get truly bizarre with multiple threads. Some other thread hits some UB and suddenly your code has garbage register states. I've had someone UB the fp register stack in another thread so that when I tried to use it, I got their values for a bit, and then NaN when it ran out. Static analysis had caught their mistake, and then a group of my peers looked at it and said it was a false warning leaving me to find it long afterwards... I don't work with them anymore, and my new project is using rust, but it doesn't really matter if people sign off on code reviews that have unsafe{doHorribleStuff()}

lmm · on Nov 28, 2022

On the contrary, the latter is a far more effective way to think about UB. If you try to imagine that the compiler's behaviour has some logic to it, sooner or later you will think that something that's UB is OK, and you will be wrong. (E.g. you'll assume that a program has reasonable, consistent behaviour on x86 even though it does an unaligned memory access). If you look at the way the GCC team responds to bug reports for programs that have undefined behaviour, they consider the emit_nasal_demons() version to be what GCC is designed to do.

masklinn · on Nov 28, 2022

> There are definitely situations where many interacting rules and assumptions produce deeply weird, emergent behavior

The problem is how due to other optimisations (mainly inlining) the emergent misbehaviour can occur in a seemingly unrelated part of the program. This can the inference chain very difficult, as you have to trace paths through the entire execution of the program.

The issue occurs for other types of data corruption, it’s why NPE are so disliked, but UB’s blast radius is both larger and less reliable.

nayuki · on Nov 28, 2022

I agree with the factual things that you said (e.g. "entire program execution was meaningless"). Some stuff was hyperbolic ("time-travel back to the start of the universe, delete it").

> [compilers] will make transformations to the program whose effects could manifest before the UB-having code executes [...] It's monumentally rude to us poor programmers who have bugs in our programs.

The first statement is factually true, but I can provide a justification for the second statement which is an opinion.

Consider this code:

    void foo(int x, int y) {
        printf("sum %d", x + y);
        printf("quotient %d", x / y);
    }

We know that foo(0, 0) will cause undefined behavior because it performs division by zero. Integer division is a slow operation, and under the rules of C, it has no side effects. An optimizing compiler may choose to move the division operation earlier so that the processor can do other useful work while the division is running in the background. For example, the compiler can move the expression x / y above the first printf(), which would totally be legal. But then, the behavior is that the program would appear to crash before the sum and first printf() were executed. UB time travel is real, and that's why it's important to follow the rules, not just make conclusions based on observed behavior.

https://blog.regehr.org/archives/232

salawat · on Nov 28, 2022

...Why is the compiler reordering so much?

Look. I get it, clever compilers (I guess) make everyone happy, but are absolute garbage for facilitating program understanding.

I wonder if we are shooting ourselves in the foot with all this invisible optimization.

saagarjha · on Nov 29, 2022

People like fast code.

knodi123 · on Nov 29, 2022

In 2022, is there any other reasons to use C besides "fast code" or "codebase already written in C"?

int_19h · on Dec 2, 2022

No, and, in fact, the first one isn't valid - you can use C++ (or a subset of it) for the same performance profile with less footguns.

So really the only time to use C is when the codebase already has it and there is a policy to stick to it even for new code, or when targeting a platform that simply doesn't have a C++ toolchain for it, which is unfortunately not uncommon in embedded.

wnoise · on Nov 29, 2022

"codebase already written in C" includes both "all the as yet unwrapped libraries" and "the OS interface".

bluecalm · on Dec 1, 2022

There isn't. Fast code is pretty important though to a lot of people while security isn't (games, renderers, various solvers, simulations etc.).

It's great C is available for that. If you're ok with slow use Java or whatever.

vikingerik · on Nov 28, 2022

> Integer division is a slow operation, and under the rules of C, it has no side effects.

Then C isn't following this rule - crashing is a pretty major side effect.

zerocrates · on Nov 28, 2022

The basic deal is that in the presence of undefined behavior, there are no rules about what the program should do.

So if you as a compiler writer see: we can do this optimization and cause no problems _except_ if there's division by zero, which is UB, then you can just do it anyway without checking.

t0mas88 · on Nov 28, 2022

Only non-zero integer division is specified as having no side effects.

Division by zero is in the C standard as "undefined behavior" meaning the compiler can decide what to do with it, crashing would be nice but it doesn't have to. It could also give you a wrong answer if it wanted to.

Edit: And just to illustrate, I tried in clang++ and it gave me "5 / 0 = 0" so some compilers in some cases indeed make use of their freedom to give you a wrong answer.

vikingerik · on Nov 28, 2022

To my downvoters, since I can no longer edit: I've been corrected that the rule is integer division has no side effects except for dividing by zero. This was not the rule my parent poster stated.

a1369209993 · on Nov 28, 2022

> I've been corrected

No you haven't. The incorrect statement was a verbatim quote from nayuki's post, which you were responding to. Please refrain from apologising for other people gaslighting you (edit: particularly, but not exclusively, since it sets a bad precedent for everyone else).

nayuki · on Nov 28, 2022

At the CPU level, division by zero can behave in a number of ways. It can trap and raise an exception. It can silently return 0 or leave a register unchanged. It might hang and crash the whole system. The C language standard acknowledges that different CPUs may behave differently, and chose to categorize division-by-zero under "undefined behavior", not "implementation-defined behavior" or "must trap".

I wrote:

> Integer division is a slow operation, and under the rules of C, it has no side effects.

This statement is correct because if the divisor is not zero, then division truly has no side effects and can be reordered anywhere, otherwise if the divisor is zero, the C standard says it's undefined behavior so this case is irrelevant and can be disregarded, so we can assume that division always has no side effects. It doesn't matter if the underlying CPU has a side effect for div-zero or not; the C standard permits the compiler to completely ignore this case.

a1369209993 · on Nov 29, 2022

> I wrote:

> > Integer division is a slow operation, and under the rules of C, it has no side effects.

Yes, you did, and while that's a reasonable approximation in some contexts, it is false in the general case, since division by zero has a side effect in the form of invoking undefined behaviour. (Arguably that means it has every possible side effect, but that's more of a philosophical issue. In practice it has various specific side effects like crashing, which are specific realizations of its theoretical side effect of invoking undefined behaviour.)

vikingerik's statement was correct:

> [If "Integer division [...] has no side effects",] Then C isn't following this rule - crashing is a pretty major side effect.

DougBTX · on Nov 29, 2022

> it is false in the general case, since division by zero has a side effect in the form of invoking undefined behaviour.

They were careful to say “under the rules of C,” the rules define the behaviour of C. On the other hand, undefined behaviour is outside the rules, so I think they’re correct in what they’re saying.

The problem for me is that the compiler is not obliged to check that the code is following the rules. It puts so much extra weight on the shoulders of the programmer, though I appreciate that using only rules which can be checked by the compiler is hard too, especially back when C was standardised.

a1369209993 · on Nov 29, 2022

> They were careful to say "under the rules of C,"

Yes, and under the rules of C, division by zero has a side effect, namely invoking undefined behaviour.

> The problem for me is that the compiler is not obliged to check that the code is following the rules.

That part's actually fine (annoying, but ultimately a reasonable consequence of the "rules the compiler can check" issue); the real(ly bad and insidious) problem is that when the compiler does check that the code is following the rules, it's allowed to do it in deliberately backward way that uses any case of not following the rules as a excuse to break unrelated code.

Dylan16807 · on Dec 3, 2022

Undefined behavior is not a side effect to be "invoked" by the rules of C. If UB happens, it means your program isn't valid. UB is not a side effect or any effect at all, it is the void left behind when the system of rules disappears.

Dylan16807 · on Dec 3, 2022

Side effects are a type of defined behavior. Crashing is not a "side effect" in C terms.

hzhou321 · on Nov 28, 2022

> Indeed. UB in C doesn't mean "and then the program goes off the rails", it means that the entire program execution was meaningless, and no part of the toolchain is obligated to give any guarantees whatsoever if the program is ever executed, from the very first instruction.

This is the greatest sin modern compiler folks committed to abuse C. C as the language never says the compiler can change the code arbitrarily due to an UB statement. It is undefined. Most UB code in C, while not fully defined, has an obvious part of semantics that every one understands. For example, an integer overflow, while not defined on what should be the final value, it is understood that it is an operation of updating a value. It is definitely not, e.g., an assertion on the operand because UB can't happen.

Think about our natural language, which is full of undefined sentences. For example, "I'll lasso the moon for you". A compiler, which is a listener's brain, may not fully understand the sentence and it is perfectly fine to ignore the sentence. But if we interpret an undefined sentence as a license to misinterpret the entire conversation, then no one would dare to speak.

As computing goes beyond arithmetic and the program grows in complexity, I personally believe some amount of fuzziness is the key. This current narrow view from the compiler folks (and somehow gets accepted at large) is really, IMO, a setback in the computing evolution.

kllrnohj · on Nov 28, 2022

> It is definitely not, e.g., an assertion on the operand because UB can't happen.

C specification says a program is ill-formed if any UB happens. So yes, the spec does say that compilers are allowed to assume UB doesn't happen. After all, a program with UB is ill-formed and therefore shouldn't exist!

I think you're conflating "unspecified behavior" and "undefined behavior" - the two have different meanings in the spec.

hzhou321 · on Nov 28, 2022

> C specification says a program is ill-formed if any UB happens. So yes, the spec does say that compilers are allowed to assume UB doesn't happen.

I disagree on the logic from "ill-formed" to "assume it doesn't happen".

> I think you're conflating "unspecified behavior" and "undefined behavior" - the two have different meanings in the spec.

I admit I don't differentiate those two words. I think they are just word-play.

kmm01 · on Nov 28, 2022

The C standard defines them very differently though:

  undefined behavior
    behavior, upon use of a nonportable or erroneous program
    construct or of erroneous data, for which this International
    Standard imposes no requirements

  unspecified behavior
    use of an unspecified value, or other behavior where this
    International Standard provides two or more possibilities
    and imposes no further requirements on which is chosen in
    any instance

Implementations need not but may obviously assume that undefined behavior does not happen. Assume that however the program behaves if undefined behavior is invoked is how the compiler chose to implement that case.

marssaxman · on Nov 28, 2022

"Nonportable" is a significant element of this definition. A programmer who intends to compile their C program for one particular processor family might reasonably expect to write code which makes use of the very-much-defined behavior found on that architecture: integer overflow, for example. A C compiler which does the naively obvious thing in this situation would be a useful tool, and many C compilers in the past used to behave this way. Modern C compilers which assume that the programmer will never intentionally write non-portable code are.... less helpful.

kllrnohj · on Nov 28, 2022

> I disagree on the logic from "ill-formed" to "assume it doesn't happen".

Do you feel like elaborating on your reasoning at all? And if you're going to present an argument, it'd be good if you stuck to the spec's definitions of things here. It'll be a lot easier to have a discussion when we're on the same terminology page here (which is why specs exist with definitions!)

> I admit I don't differentiate those two words. I think they are just word-play.

Unfortunately for you, the spec says otherwise. There's a reason there's 2 different phrases here, and both are clearly defined by the spec.

bluecalm · on Nov 28, 2022

That's the whole point of UB though: the programmer helping the compiler do deduce things. It's too much to expect the compiler to understand your whole program to know a+b doesn't overflow. The programmer might understand it doesn't though. The compiler relies on that understanding.

If you don't want it to rely on it insert a check into the program and tell it what to do if the addition overflows. It's not hard.

Whining about UB is like reading Shakespeare to your dog and complaining it doesn't follow. It's not that smart. You are though. If you want it to check for an overflow or whatever there is a one liner to do it. Just insert it into your code.

a1369209993 · on Nov 29, 2022

> That's the whole point of UB though

No, the whole (entire, exclusive of that) point of undefined behaviour is to allow legitimate compilers to generate sensible and idiomatic code for whichever target architechture they're compiling for. Eg, a pointer dereference can just be `ld r1 [r0]` or `st [r0] r1`, without paying any attention to the possibility that the pointer (r0) might be null, or that there might be memory-mapped IO registers at address zero that a read or write could have catastrophic effects on.

It is not a licence to go actively searching for unrelated things that the compiler can go out of its way to break under the pretense that the standard technically doesn't explicitly prohibit a null pointer dereference from setting the pointer to a non-null (but magically still zero) value.

bluecalm · on Nov 29, 2022

If you don't want the compiler to optimize that much then turn down the optimization level.

lmm · on Nov 28, 2022

> If you don't want it to rely on it insert a check into the program and tell it what to do if the addition overflows. It's not hard.

Given that even experts routinely fail to write C code that doesn't have UB, available evidence is that it's practically impossible.

marssaxman · on Nov 28, 2022

> So yes, the spec does say that compilers are allowed to assume UB doesn't happen.

They are allowed to do so, but in practice this choice is not helpful.

saagarjha · on Nov 29, 2022

On the contrary, it is quite helpful–it is how C optimizers reason.

LegionMammal978 · on Nov 28, 2022

> Indeed. UB in C doesn't mean "and then the program goes off the rails", it means that the entire program execution was meaningless, and no part of the toolchain is obligated to give any guarantees whatsoever if the program is ever executed, from the very first instruction.

I don't think this is exactly accurate: a program can result in UB given some input, but not result in UB given some other input. The time travel couldn't extend before the first input that makes UB inevitable.

kllrnohj · on Nov 28, 2022

They might be referring to eg. the `_Nonnull` annotation being added to memset. The result is that this:

   if (ptr == null) {
      set_some_flag = true;
   } else {
      set_some_flag = false;
   }
   memset(ptr, 0, size);

Will never see `set_some_flag == true`, as the memset call guarantees that ptr is not null, otherwise it's UB, and therefore the earlier `if` statement is always false and the optimizer will remove it.

Now the bug here is changing the definition of memset to match its documentation a solid, what, 20? 30? years after it was first defined, especially when that "null isn't allowed" isn't useful behavior. After all, every memset ever implemented already totally handles null w/ size = 0 without any issue. And it was indeed rather quickly reverted as a change. But that really broke people's minds around UB propagation with modern optimizing passes.

nayuki · on Nov 28, 2022

False. If a program triggers UB, then all behaviors of the entire program run is invalid.

> However, if any such execution contains an undefined operation, this International Standard places no requirement on the implementation executing that program with that input (not even with regard to operations preceding the first undefined operation).

-- https://devblogs.microsoft.com/oldnewthing/20140627-00/?p=63...

LegionMammal978 · on Nov 28, 2022

Executing the program with that input is the key term. The program can't "take back" observable effects that happen before the input is completely read, and it can't know before reading it whether the input will be one that results in an execution with UB. This is a consequence of basic causality. (If physical time travel were possible, then perhaps your point would be valid.)

Filligree · on Nov 28, 2022

The standard does permit time-travel, however. As unlikely as it might seem, I could imagine some rare scenarios in which something seemingly similar happens -- let's say the optimiser reaching into gets() and crashing the program prior to the gets() call that overflows the stack.

LegionMammal978 · on Nov 28, 2022

Time travel only applies to an execution that is already known to contain UB. How could it know that the gets() call will necessarily overflow the stack, before it actually starts reading the line (at which point all prior observable behavior must have already occurred)?

lmm · on Nov 28, 2022

It doesn't matter how it knows. The standard permits it to do that. The compiler authors will not accept your bug report.

LegionMammal978 · on Nov 29, 2022

If you truly believe so, then can you give an example of input-conditional UB causing unexpected observable behavior, before the input is actually read? This should be impossible, since otherwise the program would have incorrect behavior if a non-UB-producing input is given.

lmm · on Nov 29, 2022

If it's provably input-conditional then of course it's impossible. But the C implementation does not have to obey the sequence point rules or perform observable effects in the correct order for invocations that contain UB, and it doesn't have to implement "possible" non-UB-containing invocations if you can't find them. E.g. if you write a program to search for a counterexample for something like the Collantz Conjecture, that loops trying successively higher numbers until it finds one and then exits, GCC may compile that into a program that exits immediately (since looping forever is, arguably, undefined behaviour) - there's a real example of a program that does this for Fermat's Last Theorem.

LegionMammal978 · on Nov 30, 2022

> If it's provably input-conditional then of course it's impossible.

My entire point pertains to programs with input-conditional UB: that is, programs for which there exists an input that makes it result in UB, and there also exists an input that makes it not result in UB. Arguably, it would be more difficult for the implementation to prove that input-dependent UB is unconditional: that every possible input results in UB, or that no possible input results in UB.

> But the C implementation does not have to obey the sequence point rules or perform observable effects in the correct order for invocations that contain UB

Indeed, the standard places no requirements on the observable effects of an execution that eventually results in UB at some point in the future. But if the UB is input-conditional, then a "good" execution and a "bad" execution are indistinguishable until the point that the input is entered. Therefore, the implementation is required to correctly perform all observable effects sequenced prior to the input being entered, since otherwise it would produce incorrect behavior on the "good" input.

> E.g. if you write a program to search for a counterexample for something like the Collantz Conjecture, that loops trying successively higher numbers until it finds one and then exits, GCC may compile that into a program that exits immediately (since looping forever is, arguably, undefined behaviour) - there's a real example of a program that does this for Fermat's Last Theorem.

That only works because the loop has no observable effects, and the standard says it's UB if it doesn't halt, so the compiler can assume it does nothing but halts. As noted on https://blog.regehr.org/archives/140, if you try to print the resulting values, then the compiler is actually required to run the loop to determine the results, either at compile time or runtime. (If it correctly proves at compile time that the loop is infinite, only then can it replace the program with one that does whatever.)

It's also irrelevant, since my point is about programs with input-conditional UB, but the FLT program has unconditional UB.

saagarjha · on Nov 29, 2022

How this might happen is that one branch of your program may have unconditional undefined behavior, which can be detected at the check itself. This would let a compiler elide the entire branch, even side effects that would typically run.

LegionMammal978 · on Nov 29, 2022

The compiler can elide the unconditional-UB branch and its side effects, and it can elide the check itself. But it cannot elide the input operation that produces the value which is checked, nor can it elide any side effects before that input operation, unless it can statically prove that no input values can possibly result in the non-UB branch.

tpush · on Nov 29, 2022

That example doesn't contradict LegionMammal978's point though, if I understood correctly. He's saying that the 'time-travel' wouldn't extend to before checking the conditional.

andrewmcwatters · on Nov 28, 2022

Personally, I've found that some of the optimizations cause undefined behavior, which is so much worse. You can write perfectly good, strict C that does not cause undefined behavior, then one pass of optimization and another together can CAUSE undefined behavior.

When I learned this, if it was and is correct, I felt that one could be betrayed by the compiler.

LegionMammal978 · on Nov 28, 2022

Optimizations themselves (except for perhaps -ffast-math) can't cause undefined behavior: the undefined behavior was already there. They can just change the program from behaving expectedly to behaving unexpectedly. The problem is that so many snippets, which have historically been obvious or even idiomatic, contain UB that has almost never resulted in unexpected behavior. Modern optimizing compilers have only been catching up to these in recent years.

titzer · on Nov 28, 2022

There have been more than a few compiler bugs that have introduced UB and then that was subsequently optimized, leading to very incorrect program behavior.

nayuki · on Nov 28, 2022

A compiler bug cannot introduce UB by definition. UB is a contract between the coder and the C language standard. UB is solely determined by looking at your code, the standard, and the input data; it is independent of the compiler. If the compiler converts UB-free code into misbehavior, then that's a compiler bug / miscompilation, not an introduction of UB.

joosters · on Nov 28, 2022

A compiler bug is a compiler bug, UB or not. You might as well just say "There have been more than a few compiler bugs, leading to very incorrect program behavior."

titzer · on Nov 28, 2022

The whole thread is about how UB is not like other kinds of bugs. Having a compiler optimization erroneously introduce a UB operation means that downstream the program can be radically altered in ways (as discussed in thread) that don't happen in systems without the notion of a UB.

While it's technically true that any compiler bug (in any system) introduces bizarre, incorrect behavior into a program, UB just supercharges the things that can go wrong due to downstream optimizations. And incidentally, makes things much, much harder to diagnose.

LegionMammal978 · on Nov 28, 2022

I just don't think it makes much sense to say that an optimization can "introduce a UB operation". UB is a property of C programs: if a C program executes an operation that the standard says is UB, then no requirement is imposed on the compiler for what should happen.

In contrast, optimizations operate solely on the compiler's internal representation of the program. If an optimization erroneously makes another decide that a branch is unreachable, or that a condition can be replaced with a constant true or false, then that's not "a UB operation", that's just a miscompilation.

The latter set of optimizations is just commonly associated with UB, since C programs with UB often trigger those optimizations unexpectedly.

titzer · on Nov 29, 2022

LLVM IR has operations that have UB for some inputs. It also has poison values that act...weird. They have all the same implications of source-level UB, so I see no need to make a distinction. The compiler doesn't.

skitter · on Nov 28, 2022

Any optimization that causes undefined behavior is bugged – please report them to your compiler's developers.

masklinn · on Nov 28, 2022

By definition an optimisation can’t cause UB as UB is a langage level construct.

An optimisation can cause a miscompilation. They happens and is very annoying.

lmm · on Nov 28, 2022

Miscompilations are rarer and less annoying in compilers that do not have the design behaviour of compiling certain source code inputs into bizarre nonsense that bears no particular relation to those inputs.

nayuki · on Nov 29, 2022

You realize these two statements are equivalent, right?

> compiling certain source code inputs into bizarre nonsense

> winning at compiled-binary-execution-speed benchmarks, giving fewer reasons for people to hand-write assembly code for the sake of speed (assembly code is much harder to read/write and not portable), reducing code size by eliminating unnecessary operations (especially -Os), reordering operations to fit CPU pipelines and instruction latencies and superscalar capabilities

If you don't like the complexity of modern, leading-edge optimizing compilers, you are free to build or support a basic compiler that translates C code as literally as possible. As long as such compiler conforms to the C standard, you have every right to promote this alternative. Don't shame other people building or using optimizing compilers.

lmm · on Nov 29, 2022

> compiling certain source code inputs into bizarre nonsense

> winning at compiled-binary-execution-speed benchmarks, giving fewer reasons for people to hand-write assembly code for the sake of speed (assembly code is much harder to read/write and not portable), reducing code size by eliminating unnecessary operations (especially -Os), reordering operations to fit CPU pipelines and instruction latencies and superscalar capabilities

Mainstream C compilers actually make special exceptions for the undefined behaviour that's seen in popular benchmarks so that they can continue to "win" at them. The whole exercise is a pox on the industry; maybe at some point in the past those benchmarks told us something useful, but they're doing more harm than good when people use them to pick a language for modern line-of-business software, which is written under approximately none of the same conditions or constraints.

> Don't shame other people building or using optimizing compilers.

The people who are contributing to security vulnerabilities that leak our personal information deserve shame.

nayuki · on Nov 29, 2022

It's true that I don't like security vulnerabilities either. I think the question boils down to, whose responsibility is it to avoid UB - the programmer, compiler, or the standard?

I view the language standard as a contract, an interface definition between two camps. If a programmer obeys the contract, he has access to all compliant compilers. If a compiler writer obeys the contract, she can compile all compliant programs. When a programmer deviates from the contract, the consequences are undefined. Some compilers might cater to these cases (e.g. -fwrapv, GNU language extensions) as a superset of all standard-compliant programs.

Coming from programming in Java first, I honestly would like to see a lot of UB eliminated from C/C++, downgrading them to either unspecified behavior (weakest), implementation-defined behavior, or single behavior (best). But the correct place to petition is not compiler implementations; we have to change the language standard - the contract that both sides abide by. Otherwise we can only get as far as having a patchwork of vendor-specific language extensions.

lmm · on Nov 29, 2022

> Coming from programming in Java first, I honestly would like to see a lot of UB eliminated from C/C++, downgrading them to either unspecified behavior (weakest), implementation-defined behavior, or single behavior (best). But the correct place to petition is not compiler implementations; we have to change the language standard - the contract that both sides abide by. Otherwise we can only get as far as having a patchwork of vendor-specific language extensions.

That feels backwards in terms of how the C standard actually gets developed - my impression is that most things that eventually get standardised start life as a vendor-specific language extensions, and it's very rare to have the C standard to introduce something and the compiler vendors then follow.

And really in a lot of cases the concept of UB isn't the problem, it's the compiler culture that's grown up around it. For example, the original reason for null dereference being UB was to allow implementations to trap on null dereference, on architectures where that's cheap, without being obliged to maintain strict ordering in all code that dereferences pointers. It's hard to imagine how what the standard specifies about that case could be improved; the problem is compiler writers prioritising benchmark performance over useful diagniostic behaviour.

saagarjha · on Nov 29, 2022

> If you don't like the complexity of modern, leading-edge optimizing compilers, you are free to build or support a basic compiler that translates C code as literally as possible.

Most optimizing compilers can do this already, it's just the -O0 flag.

nayuki · on Nov 29, 2022

I tried compiling "int x = 1 / 0;" in both the latest GCC and Clang with -O0 on x86-64 on Godbolt. GCC intuitively preserves the calculation and emits an idiv instruction. Clang goes ahead and does constant folding anyway, and there is no division to be seen. So the oft-repeated advice of using -O0 to try to compile the code as literally as possible in hopes of diagnosing UB or making it behave sanely, is not great advice.