*the compiler is allowed to assume the loop terminates precisely because signed ...

joosters · on April 14, 2020

If the compiler knows that the loop will terminate in 'x' iterations, it can do things like hoist some arithmetic out of the loop. The simplest example would be if the code inside the loop contained a line like 'counter++'. Instead of executing 'x' ADD instructions, the binary can just do one 'counter += x' add at the end.

iainmerrick · on April 14, 2020

What I’m driving at is, if the loop really doesn’t terminate, it would still be safe to do that optimization because the incorrectly-optimized code would never be executed.

I guess that doesn’t necessarily help in the “+=2” case, where you probably want the optimizer to do a “result += x/2”.

In general, I’d greatly prefer to work with a compiler that detected the potential infinite loop and flagged it as an error.

klodolph · on April 14, 2020

> …what would the impact be if the compiler assumed (correctly) that the loop might not terminate?

Loaded question—the compiler is absolutely correct here. There are two viewpoints where the compiler is correct. First, from the C standard perspective, the compiler implements the standard correctly. Second, if we have a real human look at this code and interpret the programmer’s “intent”, it is most reasonable to assume that overflow does not happen (or is not intentional).

The only case which fails is where N = INT_MAX. No other case invokes undefined behavior.

Here is an example you can compile for yourself to see the different optimizations which occur:

    typedef int length;
    int sum_diff(int *arr, length n) {
        int sum = 0;
        for (length i = 0; i < n; i++) {
            sum += arr[2*i+1] - arr[2*i];
        }
        return sum;
    }

At -O2, GCC 9.2 (the compiler I happened to use for testing) will use pointer arithmetic, compiling it as something like the following:

    int sum_diff(int *arr, length n) {
        int sum = 0;
        int *ptr = arr;
        int *end = arr + n;
        while (ptr < end) {
            sum += ptr[1] - ptr[0];
            ptr += 2;
        }
        return sum;
    }

At -O3, GCC 9.2 will emit SSE instructions. You can see this yourself with Godbolt.

Now, try replacing "int" with "unsigned". Neither of these optimizations happen any more. You get neither autovectorization nor pointer arithmetic. You get the original loop, compiled in the most dumb way possible.

I wouldn’t read into the exact example here too closely. It is true that you can often figure out a way to get the optimizations back and still use unsigned types. However, it is a bit easier if you work with signed types in the first place.

Speaking as someone who does some numerics work in C, there is something of a “black art” to getting good numerics performance. One easy trick is to switch to Fortran. No joke! Fortran is actually really good at this stuff. If you are going to stick with C, you want to figure out how to communicate to the compiler some facts about your program that are obvious to you, but not obvious to the compiler. This requires a combination of understanding the compiler builtins (like __builtin_assume_aligned, or __builtin_unreachable), knowledge of aliasing (like use of the "restrict" keyword), and knowledge of undefined behavior.

If you need good performance out of some tight inner loop, the easiest way to get there is to communicate to the compiler the “obvious” facts about the state of your program and check to see if the compiler did the right thing. If the compiler did the right thing, then you’re done, and you don’t need to use vector intrinsics, rewrite your code in a less readable way, or switch to assembly.

(Sometimes the compiler can’t do the right thing, so go ahead and use intrinsics or write assembly. But the compiler is pretty good and you can get it to do the right thing most of the time.)

iainmerrick · on April 14, 2020

Thanks for the code, this is exactly the kind of concrete example I was looking for!

You're correct about how it behaves with "int" and "unsigned", very interesting. But it occurs to me that on x64 we'd probably want to use 64-bit values. If I change your typedef to either "long" or "unsigned long" that seems to give me the SSE version of the code! (in x86-64 gcc 9.3) Why should longs behave so differently from ints?

I very much agree that getting good numerics performance out of the optimizer seems to be a black art. But does the design of C really help here, or are there ways it could help more? Does changing types from signed to unsigned, or int to long, really convey your intentions as clearly as possible?

I remain skeptical that undefined behaviour is a good "hook" for compilers to use to judge programmer intention, in order to balance the risks and rewards of optimizations. (Admittedly I'm not in HPC where this stuff is presumably of utmost importance!) It all seems dangerously fragile.

If you need good performance out of some tight inner loop, the easiest way to get there is to communicate to the compiler the “obvious” facts about the state of your program and check to see if the compiler did the right thing. If the compiler did the right thing, then you’re done, and you don’t need to use vector intrinsics, rewrite your code in a less readable way, or switch to assembly.

I strongly agree with the first part of this -- communicating your intent to the compiler is key.

It's the second part that seems really risky. Just because your compiler did the right thing this time doesn't mean it will continue to do so in future, or on a different architecture, and of course who knows what a different compiler might do? And if you end up with the "wrong thing", that may not just mean slow code, but incorrect code.

klodolph · on April 15, 2020

> But does the design of C really help here, or are there ways it could help more?

I’m sure there are ways that it could help more. But you have to find an improvement that is also feasible as an incremental change to the language. Given the colossal inertia of the C standard, and the zillions of lines of existing C code that must continue to run, what can you do?

What I don’t want to see are tiny, incremental changes that make one small corner of your code base slightly safer. Most people don’t want to see performance regressions across their code base. That doesn’t leave a lot of room for innovation.

> It all seems dangerously fragile.

If performance is critical you run benchmarks on CI to detect regression.

> It's the second part that seems really risky.

It is safer than the alternatives, unless you write it in a different language. The “fast” code here is idiomatic, simple C the way you would write it in CS101, with maybe a couple builtins added. The alternative is intrinsics, which poses additional difficulty. Intrinsics are less portable and less safe. Less safe because their semantics are often unusual or surprising, and also less safe because code written with intrinsics is hard to read and understand (so if it has errors, they are hard to find). If you are not using intrinsics or the autovectorizer, then sorry, you are not getting vector C code today.

This is also not, strictly speaking, just an HPC concern. Ordinary phones, laptops, and workstations have processors with SIMD for good reason—because they make an impact on the real-life usability of ordinary people doing ordinary tasks on their devices.

So if we can get SIMD code by writing simple, idiomatic, and “obviously correct” C code, then let’s take advantage of that.

flatfinger · on April 15, 2020

I can certainly understand the value in allowing compilers to perform integer arithmetic using larger types than specified, at their leisure, or behave as though they do. Such allowance permits `x+y > y` to be replaced with `x > 0`, or `x30/15` to be replaced with `x2`, etc. and also allows for many sorts of useful loop induction.

Some additional value would be gained by allowing stores to automatic objects whose address isn't taken to maintain such extra range at their convenience, without any requirement to avoid having such extra range randomly appear and disappear. Provided that a program coerces values into range in when necessary, such semantics would often be sufficient to meet application requirements without having to prevent overflow.

What additional benefits are achieved by granting compilers unlimited freedom beyond that? I don't see any such benefits that would be worth anything near the extra cost imposed on programmers.

flatfinger · on April 17, 2020

What should be relevant is not programmer "intent", but rather whether the behavior would likely match the that of an implementation which give that describe behavior of actions priority over parts of the Standard that would characterize them as "Undefined Behavior".

saagarjha · on April 15, 2020

You shouldn't even need compiler builtins, just perform undefined behavior on a branch:

  if ((uintptr_t)ptr & alignment) {
      char *p = NULL;
      printf("%c\n", *p);
  }