New things in clang land (5.0.0)

ladberg · on Dec 7, 2017

Hmm, the first example kind of bugs me. I get why they are doing it, but I feel like there should be a better solution. Instead of cramming as much optimization into undefined behavior as you can, maybe don't allow the undefined behavior in the first place? I think if I had something like this accidentally in my code then I would want it to tell me that something's wrong instead of just giving bizarre results.

MaxBarraclough · on Dec 7, 2017

> maybe don't allow the undefined behavior in the first place?

So your problem is with C and C++, not Clang?

> I think if I had something like this accidentally in my code then I would want it to tell me that something's wrong instead of just giving bizarre results.

Then you misunderstand the problem.

Compilers don't seek out UB and then generate zany instruction sequences to spite the programmer. Instead, they assume there is no UB in the program, and optimise accordingly. If the code is bad and contains UB, all bets are off.

When I say 'all bets are off', I really mean it. Bizarre things happen in practice, legal by virtue of UB. I've read tales of bad C code printing that x is 0, and then evaluating that (0 == x) is false. (This was due to the compiler taking advantage of an 'is zero?' flag in the underlying CPU, which didn't perfectly align with the bits that stored the value of x.)

A good blog post on UB: https://blog.regehr.org/archives/213 To paraphrase:

> One might ask: Wouldn’t it be great if the compiler provided a warning or something when it detected [the UB]? Sure! But that is not the compiler’s priority.

This approach is part of what makes C, C. Very few other modern languages share this approach (the ones that do tend to be C based), and instead set out to have no UB. Java and C#, for instance, have no concept of UB.

It's impossible, even in theoretical terms, to build a perfect UB-detection program. This follows directly from Rice's theorem. The best we can do is better-than-nothing static-analysis tools, and dynamic-analysis tools. Neither can be perfect and exhaustive.

The good news is that these tools can much better than nothing, even if we know they're always going to be imperfect.

masklinn · on Dec 7, 2017

> Instead of cramming as much optimization into undefined behavior as you can, maybe don't allow the undefined behavior in the first place?

1. that's what -fsanitize=undefined is for

2. because the code you see here plainly written could also be the result of multiple rounds of previous optimisations (mostly inlining) which ends up leaving a bunch of redundant or nonsensical null checks

Sean1708 · on Dec 7, 2017

I think clang-tidy can also do some UB checks without having to run the program.

exDM69 · on Dec 7, 2017

> Instead of cramming as much optimization into undefined behavior as you can, maybe don't allow the undefined behavior in the first place?

A compiler that doesn't allow any undefined behavior in C or C++ code will reject most of the existing C or C++ code bases out there.

Idealistically speaking, it would be a good thing. But a C compiler that wouldn't compile the Linux kernel or Chromium or Firefox wouldn't be that great in practice.

Clang and LLVM are doing good work in detecting, reporting and rejecting undefined behavior but there's a lot of work to be done before rejecting such code is a realistic option.

moomin · on Dec 7, 2017

Silently changing behaviour isn't great either, though. Because all the examples you list rely on said undefined behaviour being predictable.

masklinn · on Dec 7, 2017

Not necessarily. The issue at hand here is that of false positives: if you move something to the typesystem, you want no false negatives, so you commonly get false positives instead, the developer must prove to the type system's satisfaction that <incorrect behaviour> can not occur, that doesn't mean it could occur otherwise, just that the type system was not able to understand it.

A well-known modern example is Rust's borrow checker, there are cases which it will reject despite the code being correct (there are no actual issues in it) but the borrow checker can't be sure.

At a fundamental, "UB" simply means the compiler assumes it doesn't happen and carries on from there, yielding predictable behaviour to an UB is nonsensical. And incidentally while C is rife with UB most languages do have them to a certain extent e.g. sorting assumes your comparator is correct/stable, feeding a nonsensical comparator to a sorting routine is usually UB (neither the language nor the function define what'll happen), likewise hashmaps and hash/eq being coherent. Of course the consequences are also more significant in C given the language also is not memory-safe (and even then… "safe" Rust is memory safe but if you create an UB in unsafe code and leak it to safe Rust all bets are off) (and yes Rust does have UBs though not that many compared to C: https://doc.rust-lang.org/beta/reference/behavior-considered...)

moomin · on Dec 8, 2017

Historically, UB was _platform defined behaviour_, not ignored behaviour.

masklinn · on Dec 8, 2017

That's "implementation-defined behaviour", a different category from UB. Both exist in the C standard.

Athas · on Dec 7, 2017

But it might not be undefined behaviour at compile-time. Consider if instead of just 'f(nullptr)', the call in the main() function had been 'f(getchar() == 'x' ? nullptr : &x)' (where x is some variable in scope). It cannot be known until run-time whether a NULL pointer is being dereferenced, but since the pointer inside of 'f()' is being dereferenced, the compiler can assume that it will not be NULL.

jsteemann · on Dec 7, 2017

Potentially unsafe optimizations, undefined behavior and random crashes that are hard to reproduce are normally nothing one wants to have in production builds.

It's nice that UBSan will detect the problem at runtime, but that requires one to actually use UBSan in development and have all relevant code paths triggered in a UBSan-instrumented build. Which is unrealistic at least when 3rd party code (libraries etc.) comes into play that is not entirely under the application's control. So I would really favor optimizations that do not make a program less safe, or are only applied when can be proven that they do not cause any harm.

lallysingh · on Dec 7, 2017

3rd party code is either in source form, that you can use with UBSan (via invocation from your own code's tests), or already in binary form, where compiler optimizations have already occurred and are now just flat out bugs in their build.

ladberg · on Dec 7, 2017

Sorry, I was referring to the program as a whole. I get why it did the optimization for that one function in particular, but in this case, it can be proven that the function call will result in undefined behavior and (in my opinion, not necessarily the way the C standard works) it should be flagged or not compiled.

jhasse · on Dec 7, 2017

> it should be flagged or not compiled.

It will be, if you use clang-tidy (or clang's static analyzer):

    $ clang-tidy main.cpp -- -std=c++14
    1 warning generated.
    main.cpp:10:5: warning: Forming reference to null pointer [clang-analyzer-core.NonNullParamChecker]
        f_unused_parameter(*i);
        ^
    main.cpp:19:7: note: Passing null pointer value via 1st parameter 'i'
        f(nullptr);
        ^
    main.cpp:19:5: note: Calling 'f'
        f(nullptr);
        ^
    main.cpp:10:5: note: Forming reference to null pointer
        f_unused_parameter(*i);
        ^

cesarb · on Dec 7, 2017

Do not put your faith in whole program optimization. If it can be proven that the function call will result in undefined behavior, the compiler will assume that the function will never be called. If it can be proven that it will always be called, the compiler is free to assume that the program will never be run, and optimize the whole program down to nothing.

That is, undefined behavior propagates not only forwards (if you dereference a pointer, from them on you can be sure that it's not null), but also backwards (if you dereference a null pointer, that's a contradiction since null pointers can't be dereferenced, and the contradiction can only be resolved if the statement is unreachable).

oconnor663 · on Dec 7, 2017

> If it can be proven that it will always be called, the compiler is free to assume that the program will never be run, and optimize the whole program down to nothing.

Does that happen when the compiler assumes you'll never take branches that lead to UB, and then finds a path through your branches that more or less does nothing? Or can it happen even if you're not branching at all?

Also, say my program is

    int main() {
        printf("hello world\n");
        do_some_UB();
    }

Is the compiler really allowed to produce a program that doesn't print?

masklinn · on Dec 7, 2017

> Does that happen when the compiler assumes you'll never take branches that lead to UB, and then finds a path through your branches that more or less does nothing? Or can it happen even if you're not branching at all?

Yes and yes. How far it will propagate these backwards depends on the compiler.

> Is the compiler really allowed to produce a program that doesn't print?

Yes. A program which contains UB is illegal, there are no guarantees about any of its behaviour, including behaviour which precedes any possible actual invocation of UB, this is pointed rather clearly by the standard:

> However, if any such execution contains an undefined operation, this International Standard places no requirement on the implementation executing that program with that input (not even with regard to operations preceding the first undefined operation).

oconnor663 · on Dec 7, 2017

Good heavens! :)

Joky · on Dec 7, 2017

Yes the compiler can remove the printf.

dataflow · on Dec 7, 2017

You can't "not allow undefined behavior in the first place" without a massive performance hit (like checking the validity of every pointer at run time). In this case I agree they should be able to detect it given that the whole program is available and the deduction that the pointer is null is easy to reach via a fixed-point iteration. But it just cannot be done efficiently in general in a language that inherently allows unverifiable code to begin with, and the kind of performance hit you'd get would make the language slower than a verifiable language like C# (at which point you should be using that kind of a language instead, since that's what they're for).

ladberg · on Dec 7, 2017

I'm not saying add any runtime checking, but if undefined behavior can be proven at compile time, then it should flag that instead of doing something weird.

adrianN · on Dec 7, 2017

That's what clang's static analyzer is for. The analysis is pretty difficult and would slow down compilation unnecessarily.

rurban · on Dec 7, 2017

No, that is what -Wall is for. Nobody runs the static analyzer, there are more people using ubsan fwiw.

mikeash · on Dec 7, 2017

The static analyzer is extremely useful. Why aren’t people running it?

pjmlp · on Dec 7, 2017

Because reasons.

See Herb Sutter's talk at CppCon 2015. At a given moment he asks the audience if they use any kind of static analysis tools.

About 1% of the audience said they were using one.

Which is why everything that requires external tooling, instead of being part of the language is easily ignored.

mikeash · on Dec 7, 2017

Would it help if the static analyzer were folded into clang itself and enabled with a flag (-Wanalyze?) rather than using a separate tool? That way it could just be a slight tweaking of build flags, and left enabled if it performs well on a given code base, rather than requiring extra work and thought.

pjmlp · on Dec 7, 2017

Well it helped when XCode came out with integrated clang static analysis, vs with the largely ignored lint.

cwzwarich · on Dec 7, 2017

The main reason people avoid static analysis is excessive false positives.

2trill2spill · on Dec 7, 2017

> No, that is what -Wall is for. Nobody runs the static analyzer, there are more people using ubsan fwiw.

How do you know what other people are using? I personally use the clang static analyzer all the time. Also it seems like you think these options are mutually exclusive they are not.

rurban · on Dec 7, 2017

Because I know a lot of software.

Everybody uses -Wall, some -Wextra and even -pedantic, a few asan or ubsan, nobody uses clang specific analyze.

2trill2spill · on Dec 7, 2017

> nobody uses clang specific analyze.

That's a pretty bold claim. I use it for all my C projects, FreeBSD use's it[1], Chromium uses it[2]. It seems your basing this whole claim on personal anecdotes which by definition is unreliable.

[1]: https://chromium.googlesource.com/chromium/src/+/lkcr/docs/c...

[2]: https://github.com/freebsd/freebsd-ci/tree/master/scan-build

rurban · on Dec 18, 2017

Which other software probes for a static analyzer? Is there a autoconf or cmake probe? FreeBSD and chromium are islands unfortunately.

jhasse · on Dec 7, 2017

> The analysis is pretty difficult and would slow down compilation unnecessarily.

dataflow · on Dec 7, 2017

Are you imagining Clang actually did detect it and merely didn't tell you? That seems possible but it seems more likely that it didn't.

Joky · on Dec 7, 2017

So to take an example, signed integer arithmetic can't overflow (or it is UB), would you like a compiler that flag for every single arithmetic operation on int when it can't prove it won't overflow? So we couldn't compile `int foo(int a, int b) { return a + b; }` right?

IshKebab · on Dec 7, 2017

Well.. you can. But not in C++. Most modern languages don't have undefined behaviour.

Asooka · on Dec 7, 2017

As a C++ developer, stuff like that is what keeps me up at night. Undefined behaviour optimizations, i.e. optimizations which are undertaken because the compiler proves that a variable cannot have a specific value, since if it did, that would be undefined behaviour, should probably only be undertaken within the scope of a single expression. Reason being that first, I shouldn't be afraid that editing code in one end of the function can change the semantics of unrelated code somewhere else and second, the code is always written with a specific physical architecture in mind and being unable to reason what it would actually do is very problematic.

Said undefined behaviour is also always only undefined within the scope of the C++ language, the machine defines the behaviour perfectly fine. I can, for instance, install a segmentation fault handler and catch places where I dereference a null pointer. In fact, the null page could even be mapped to some hardware device, making null pointers valid. Right now I don't have a way to communicate that fact to the compiler. Worse, we don't have tools that can tell us when code changes semantics from what a trivial -O0 compilation would produce.

saaadhu · on Dec 7, 2017

GCC has fdelete-null-pointer-checks to choose null pointer dereferencing semantics

jchw · on Dec 7, 2017

This one is very simple: It's a bug. It should realize that you're not actually dereferencing i because it's immediately passed into a reference. Sans any bugs, this optimization should actually work just fine.

crististm · on Dec 7, 2017

I shiver when I read the kind of replies you get... The ones that defend idiotic compiler behaviour justified by "optimizations"

jchw · on Dec 7, 2017

To be fair, the point of having a set of "defined behavior" was in part so that compilers could optimize. I still believe this case is a compiler bug, but it is one that can only be invoked with undefined behavior, in which compilers are allowed to do absolutely anything. See also: DeathStation 9000

Joky · on Dec 7, 2017

I'd really suggest watching https://www.youtube.com/watch?v=yG1OZ69H_-o

crististm · on Dec 7, 2017

https://youtu.be/yG1OZ69H_-o?t=401

Thank you very much. Maybe the defenders of idiotic compiler behaviour can go now and hide under a rock.

As for the slide at https://youtu.be/yG1OZ69H_-o?t=508 Correct or incorrect programs are defined in terms of the language. With an underspecified language like C you get what you deserve: -O2 being a different language than -O0

I've seen enough of this video where incorrect programs are being equaled to undefined language behavior. I'll stop looking further.

MichaelMoser123 · on Dec 7, 2017

actually address sanitizer has been ported to gcc, so it is no longer a clang only thing. last time i checkted it did not not support stack switching (cooperative scheduling). Is there any change in this area?