Tracing JITs and modern CPUs part 3: A bad case

nkurz · on Aug 9, 2015

This is necessary because there is a branch (if) in our code and traces are not allowed to have internal branches.

I (mostly) understand the performance of the assembly, but don't know much (if anything) about the limitations of tracing JITs. Predictable branches such as the one in this example are approximately free on modern CPUs, so it's not clear to me why the JIT couldn't just treat the branching assembly as part of the root trace. Is there a fundamental reason why a trace is not allowed to have internal branches, or is this just a implementation detail of the current trace selection algorithm?

rayiner · on Aug 9, 2015

A branch-free section of straight-line code is pretty much the definition of a trace. You can use something other than traces in a "tracing JIT" of course,[1] but your complexity goes up.

One nice property of straight-line traces is that it makes optimization extremely simple. Any instruction dominates all instructions that occur later in the trace. In the example with the "if" statement inside the loop, that is not true. Operations that happen in either branch do not dominate anything that comes after control flow merges back together after the if statement. So you have to do more sophisticated analysis to merge data flows at that point. It's doable--the point of SSA representation is to make that easier--but it's a lot slower and more complex than what you can do with straight line code.

[1] E.g. trees of traces: https://github.com/oleganza/iovm2/blob/master/doc/papers/Inc.... Trace trees avoid the above problem by basically doing aggressive tail duplication.

nkurz · on Aug 10, 2015

Thanks for the answer and the link. I think my confusion is/was that an "if" statement in the interpreted language might even be implemented as a conditional move in assembly. Thus it seems like it would be useful to distinguish "if (condition) a = a + 1" from "if (condition) frobnicate(a)" when defining a trace, even if both are written with an "if" in the interpreted language.

obl · on Aug 9, 2015

The big advantage in optimizing a trace at a time is that you don't have inbound control flow. In fact, the first transform that a tracing JIT does to the code (trace extraction) can be seen as removing every inbound CF edge except the loop head (leaving them in the deoptimized version of the trace).

One reason for that is that it makes optimization of the trace a lot easier since every forward dataflow problem (e.g. constant propagation or type inference) becomes more precise :

  if a is a string && b is a string
     x = concat(a,b)
  else if a is a int && b is a int
     x = add_int(a,b)
  else ...

if you have knowledge (thanks to runtime instrumentation) that most of the time the second branch is taken, you can transform it to :

  bail if a is a string && b is a string
  bail if !(a is a int && b is a int)
  x = add_int(a,b)
  ...

(where bail goes to the full code at the right place)

This version is a lot easier to analyze because you don't have to worry about what might be true in alternate branches : for example, since we can assume we didn't bail out, at the end of this snippet x is always an integer.

There is no reason to my knowledge why they couldn't support unbiased branches by including them in a trace, but it would complicate the implementation of various (forward) optimizing passes which don't have to worry about control flow in a single-trace model.

mikemike · on Aug 10, 2015

By definition, a trace doesn't have internal branches.

The solution is to use Hyperblock Scheduling. This is an extra pass that merges multiple traces, e.g. the described root trace and its side trace. The result is a single trace with a predicated IR. This is amenable to most linear optimizations, with only minor limitations.

A predicated IR is the ideal representation to apply branch-free optimizations, using bit operations or SIMD tricks. If there are any predicates left in the IR, the compiler backend will either turn it into predicated machine code (on CPUs which support that to some extent, e.g. ARM32) or generate machine code with internal branches.

seanmcdirmid · on Aug 9, 2015

I'm working on a live programming environment that is quite reactive, requiring a lot of book keeping to know when recomputations occur. Since code can be executed multiple times over time, I perform a trace-based optimization that caches bookkeeping work in a fast path as long as the objects used and branches taken don't change. So the fast path has to check that these invariants haven't changed, and its branching behavior is necessarily always the same as the slow path that preceded it (otherwise, another slow path re-computation is triggered).

And branching behavior for the most part doesn't change: your condition is often a continuous function where small changes in input cause small changes in output, meaning the condition flips between true and false only very rarely. Nowhere continuous conditions like the one pointed out in the example (if i % 2 == 0) are very rare in real code! Modern CPUs also depend on coherent branching behavior, and conditions like this throw it off. I bet most of the 15X slowdown that the author is experiencing is not from the extra conditions, but from thrashing in the BPU (branch prediction unit)!

nkurz · on Aug 10, 2015

Modern CPUs also depend on coherent branching behavior, and conditions like this throw it off. I bet most of the 15X slowdown that the author is experiencing is not from the extra conditions, but from thrashing in the BPU (branch prediction unit)!

You might be underestimating the efficiency of branch prediction on modern CPUs. It's getting close to the point where if you (the human) can easily predict the pattern based on past history, the CPU will perfectly predict it as well. Manufacturers are usually a little cagey about the exact specifications, so one is often limited to empirical testing, but a simple pattern like one in the example is going to be predicted perfectly after the first couple iterations of training. Here's what one authority who's done lots of testing has to say about Haswell, the CPU the post is concerned with:

  The Haswell is able to predict very long repetitive jump  
  patterns with few or no mispredictions. I found no specific 
  limit to the length of jump patterns that could be 
  predicted. Loops are successfully predicted up to a count
  of 32 or a little more. Nested loops and branches inside 
  loops are predicted reasonably well.

http://www.agner.org/optimize/microarchitecture.pdf

seanmcdirmid · on Aug 10, 2015

Looks like you guys are right, though I do have to execute the loop multiple times to find the same performance result. Weird.

obl · on Aug 10, 2015

This is wrong. The notion of predictability in modern branch predictors is much more sophisticated than simply "strongly taken"/"strongly not taken". i%2==0 is periodic of period 2. If I am to believe Agner, we've been successful in perfectly predicting repetitive patterns of period up to 5 since the Pentium II. A recent chip is considerably smarter than that.

seanmcdirmid · on Aug 10, 2015

Interesting. I do know that some BPUs can disable branch prediction if it fails a few times (to say, given i % 2 == 0, it would disable branch prediction on the third or fourth iteration). I wasn't aware that they would actually go and find continuity in non-continuous branch behavior based on modulo alone, which isn't that common.

So let's run an experiment, shall we. Given Count = x.Length = 10000000 in C# (.NET, but the GC will never run):

Code:

        for (var i = 0; i < Count; i += 1)
          if (i % 2 == 0)
            x[i] = 0;
          else x[i] = 1;

Time: 55-80 milliseconds

Now the same functionality in two non-branching loops:

        for (var i = 0; i < Count; i += 2)
          x[i] = 0;
        for (var i = 1; i < Count; i += 2)
          x[i] = 1;

About 25-28 milliseconds. Now, to make it interesting, we do zero for the first half and one for the second half:

        for (var i = 0; i < Count; i += 1)
          if (i < Count / 2)
            x[i] = 0;
          else x[i] = 1;

40-42 milliseconds. Definitely, a coherent branch is much faster than an incoherent one, while no branching wins the day as always. (ok, not true, see edit).

Edit: so I was probably running into some jitter problems. Running the tests 20 times gives me 41 for the first, 21 for the second, and 39 for the last. So the first and last are comparable in time. Sorry for the sloppy benchmarking!

Edit 2: so if I take the Count/2 out of the loop, my numbers for the third test become fast again: 42 for the first, 19 for the second, 26 for the third (average time milliseconds for doing the loops 10 times). The comparison is still hardly fair since I'm not lifting the modulo out of the first loop, however. If I use a boolean that is continuously negated instead the modulo, the results are 29 for the first, which is much closer to the third.

caf · on Aug 10, 2015

Running the C equivalent of your first example under cachegrind with --branch-sim=yes, it shows 10,000,000 conditional branches at the if (i % 2 == 0) line, with 3 mispredicts.

nkurz · on Aug 10, 2015

Cachegrind simulates caching on a generic CPU, and thus does not tell you how the code will actually run on any particular modern CPU. In this case, the results agree, but I'm of the strong opinion that cachegrind's time has passed. Unless you want a 'theoretical' rather than real answer, you are almost always better off today using the CPUs built in performance counters. That said, I do appreciate that you are offering real numbers rather than guesswork like I am.

caf · on Aug 10, 2015

I think that basic branch-sims are still useful as a kind of lowest common denominator - if the simple-minded branch sim can predict it OK, it'll almost certainly go OK on a wide variety of real CPUs. Do you disagree?

Benchmarking on a particular CPU does give you more precise results, but at the risk of them being less generally applicable.

nkurz · on Aug 10, 2015

Sure, I think it's fair to say that if cachegrind finds the branching to be predictable, that any modern processor will do fine on it. But if cachegrind predicts poor performance, I wouldn't suggest changing your code unless you've discovered that the actual performance is poor on a real processor. Overall, I think cachegrind's branch prediction is nowadays less accurate than the rule of thumb that "if there is a pattern the CPU will find it". Since using performance counters is so simple and accurate (at least on x86/x64), I'm not sure that there is a benefit to using the slower, less-accurate older method of emulation.

You response encouraged me to check whether the branch predictor in cachegrind had been updated since I last looked at it. It doesn't look like it. It's still pretty simple, about 20 lines of actual code: https://github.com/fredericgermain/valgrind/blob/master/cach...

I greatly appreciated the honesty in one of the comments: "TODO: use predictor written by someone who understands this stuff."

It is a valid question might whether the consistency from processor to processor is any better than cachegrind or a rule of thumb. My experience was that for Nehalem/Sandy Bridge/Haswell things were more same than different (and only got better), but I don't know about other lines.

seanmcdirmid · on Aug 10, 2015

Here is a more exhaustive test that throws in a random boolean array (bb) for a total mispredict scenario (always read, sometimes used)

First, modulo oscillation:

      var b = true;
      for (var i = 0; i < x.Length; i += 1)
      {
        var cc = bb[i];
        if (b)
          x[i] = 0;
        else
          x[i] = 1;
        b = !b;
      }

Time: 36 milliseconds (averaged over 10 runs)

Case 2, no branch:

    for (var i = 0; i < x.Length; i += 2)
    {
      var c = bb[i];
      x[i] = 0;
    }
    for (var i = 1; i < x.Length; i += 2)
    {
      var c = bb[i];
      x[i] = 1;
    }

Time: 22 milliseconds

Case 3, coherent branching:

    var kk = x.Length / 2;
    for (var i = 0; i < x.Length; i += 1)
    {
      var c = bb[i];
      if (i < kk)
        x[i] = 0;
      else x[i] = 1;
    }

Time: 31 milliseconds

Case 4, totally random branching:

    for (var i = 0; i < x.Length; i += 1)
        if (bb[i])
          x[i] = 0;
        else x[i] = 1;

Time: 75 milliseconds (the slowness could come from bad branch prediction, or the fact that a load needs to complete before the condition can be verified!)

caf · on Aug 10, 2015

My point is just to reiterate that the 0,1,0,1,0,1,... pattern of i % 2 == 0 is actually exactly the kind of pattern that branch predictors excel at, which you seemed to be skeptical of.

seanmcdirmid · on Aug 10, 2015

Right, I was wrong. I'm wondering what the use case is for that kind of silicon... interpreters?

malkia · on Aug 10, 2015

Long time ago, I've did some experiments with luajit and branching (well more like the CPU). It was based on this article: http://igoro.com/archive/fast-and-slow-if-statements-branch-...

https://raw.githubusercontent.com/malkia/ufo/master/samples/...

  local ffi = require("ffi")
  local band = bit.band

  local function test(n, m)
     local count = 0; for i=1, n do if band(i, m)==0 then count = count + 1 end end return count
  end

  local function timeit(m)
     local t = os.clock()
     test(0xFFFFFFF,m)
     local t = os.clock() - t
     print(string.format("%08X",m),t)
  end

  timeit(0x80000000)
  timeit(0xffffffff)
  timeit(1)
  timeit(3)
  timeit(2)
  timeit(4)
  timeit(8)
  timeit(16)

  --[[
  -- This is on OSX 10.7.2 MBP 2008 Jan build
    ./luajit samples/badif.lua
  80000000	14.067395
  FFFFFFFF	20.252955
  00000001	13.497108
  00000003	17.337942
  00000002	14.142266
  00000004	14.221376
  00000008	14.42377
  00000010	14.708237
  --]]

Results now - 2015 (again Macbook OSX, much better than before and also luajit 2.0.4):

  80000000	0.247822
  FFFFFFFF	1.253168
  00000001	0.664586
  00000003	0.974336
  00000002	0.676823
  00000004	0.679609
  00000008	0.675207
  00000010	0.673961

without jit, just interpretter (pretty much the same results):

  $ src/luajit -joff ~/badif.lua 

  80000000	3.329909
  FFFFFFFF	3.553143
  00000001	3.440378
  00000003	3.484646
  00000002	3.424834
  00000004	3.424539
  00000008	3.616981
  00000010	3.598598

haberman · on Aug 9, 2015

Perhaps this example is just meant to be illustrative, but if this is a real case, it seems easy to remove the branch:

    local counter = require("core.counter")

    local n = 1e9
    local c = counter.open("test")
    for i = 1,n do
       -- Add 9 for odd i, 0 for even.
       counter.add(c, 1 + (bit.band(i, 1) * 9))
    end

tedunangst · on Aug 9, 2015

Yup. The link at the end goes to http://wiki.luajit.org/Numerical-Computing-Performance-Guide which mentions much the same: "Use bit.*, e.g. for conditional index computations."

It's certainly a real problem that can affect real code, this "issue" just falls a wee bit short demonstrating how to fix it.

mraleph · on Aug 9, 2015

Arguably it's an optimization that compiler could perform instead and a method JIT most likely would. However tracing JIT's optimizations are confined to a linear trace - which I guess exactly the limitation the author wanted to demonstrate.

sillyryan · on Aug 10, 2015

I had read somewhere that LuaJIT specifically warns about avoiding branches in loops