Fast and slow if-statements: branch prediction in modern processors

InclinedPlane · on May 25, 2010

This particular bit of information is worse than useless for the vast majority of developers. For the parts-per-million development efforts where this sort of optimization is necessary (a tiny subset even of kernel developers) there will be many orders of magnitude more developers for whom this information is actively harmful. Devs will spend their time trying to second guess the compiler's optimizations of their if statements in order to eek out micro-seconds of performance improvement. Meanwhile, by focusing their efforts on the wrong thing they will take attention away from quality of design and execution as well as macro-optimizations. Their code will be lower quality and, ironically, slower.

Micro-optimization is a silly diversion the vast majority of the time. Wait to optimize at that level until you have the tooling and the measurements that indicate you truly need it (more often than not you won't).

scott_s · on May 25, 2010

Developers should know how the layers beneath them work. Sometimes that means they have to be exposed to information that could be "dangerous" if they use it in the wrong context. Such is the risk of education.

InclinedPlane · on May 25, 2010

Sure. I'm not objecting to the information itself, I'm objecting to paying more attention to it than is warranted (the amount that is warranted being so close to 0.00 for the average developer that it's hardly worth mentioning).

The best-case scenario for this information is for the vast majority of developers to ignore it, or to read about it and then do absolutely nothing about "the problem".

Even in the rare case where you happen to be that one guy who's writing kernel, library, or game code which needs to be heavily optimized at every level this particular issue is extremely unlikely to be relevant. And even in the super rare cases where it is relevant the given article doesn't provide very good advice on improving performance (hint: for a lot of cases it'd be better to rewrite your loop so you don't even have a conditional inside it).

In short, this article is silly trivia, nothing more, and roughly equally useful to the practice of programming as looking at pictures of cats with funny captions, maybe less so.

P.S. The real risk here is that devs become overly focused on a "problem", lose their situational awareness and ignore more serious problems they should be concerned with. It's the sort of thing that causes pilots to crash planes while they are distracted by a warning light they can't figure out: http://en.wikipedia.org/wiki/Eastern_Air_Lines_Flight_401

nkurz · on May 25, 2010

Maybe you'd like the message better if it was inverted: "Contrary to popular wisdom, with modern processors there's very little penalty for conditionals where the same branch is taken each time. It's only when the branches are random that slowdowns occur, and even so it's usually not a big deal."

I disagree about the "one guy" theory. While it may be true that very few Ruby programmers would benefit from this knowledge, if you are bothering to use C or C++ you might as well take full advantage of the speed it has to offer. If not, why not use a higher level language? I'd certainly want the author of my interpreter to be fully aware of these issues: http://bugs.python.org/issue4753

scott_s · on May 25, 2010

That you consider basic computer architecture "silly trivia" is something I don't know how to respond to.

igoros · on May 25, 2010

Any piece of information can be abused.

The article simply explains what branch prediction is and why CPUs implement it. No part of the article advocates that you attempt to optimize your code to exploit branch prediction. In fact, the conclusion of the article is that in the majority of real-world cases, the branch predictor will automatically do the right thing.

nkurz · on May 25, 2010

I disagree. If you are programming in C or C++, and you are not aware of this effect, you are probably doing something wrong, perhaps using the wrong language. And the article isn't suggesting that you should blindly optimize for it, rather it's giving you some background and telling you about which tools you can use to measure the effect of branch prediction.

I think that developers become better rather than worse when they start to understand what happens under the hood. Are we to protect people from the dangers of superficial levels of knowledge by keeping them ignorant? Teach them, and if that's not enough, then teach them more:

"Mispredicted branches can cause a conditional itself to run about 10 times slower. In very tight loops, this can be a problem, but most of the time this effect isn't even measurable. It's dwarfed by other effects, such as accesses to main memory, which are in turn dwarfed by disk access. If you read from memory in your routine more than a couple times, or touch disk even once, it's probably not worth worrying about branch prediction errors."

Or would you suggest that kernel programmers shouldn't be aware of these effects either? :)

adg001 · on May 25, 2010

@InclinedPlane

Hi!

That's true, there are other development tasks where the vast majority of developers should look at with greater priority.

Still, I'd like to suggest what is the main takeaway from the Igor's post, if you may.

I would like to put the emphasis on the role played on timings by the presence data-dependent branches, and not on how developers should try to optimize the performance by keeping the pipelines stages as busy as possible.

This have practical importance, for example, while implementing cryptographic algorithms or secret-dependent algorithms, as the timing patterns can be exploited by side channel attacks.

Nothing new, something already happened:

Predicting Secret Keys via Branch Prediction by Onur Aciicmez and Jean-Pierre Seifert and Cetin Kaya Koc http://eprint.iacr.org/2006/288.pdf

On the Power of Simple Branch Prediction Analysis by Onur Aciicmez and Cetin Kaya Koc and Jean-Pierre Seifert http://eprint.iacr.org/2006/351.pdf

Of course, Igor's post is not about cryptanalysis, agreed. Still provides some useful quantities relating mispredicted branch number to the spent cycles.

Cheers

codesink · on May 25, 2010

The Linux kernel exploits that by using two macros (likely() and unlikely()) that give hints to the compiler using GCC's __builtin_expect:

if(likely(condition)) dosomething();

matthavener · on May 25, 2010

even without likely/unlikely, branch prediction still occurs in branch-prediction supporting processors. the kernel optimization just optimizes for the first case. once the branch prediction state machine as aligned itself towards the more likely branch, accurate branch prediction will occur for the resulting calls (typically iterations or something like that)

wendroid · on May 25, 2010

:) That's what I came to say

http://kerneltrap.org/node/4705

I also have some other links :

These people saw 10% gain in their Scheme http://people.csail.mit.edu/jaffer/CNS/interpreter-branch

And this concerns reducing power consumption using Branch Prediction Prediction ! http://www.eecg.toronto.edu/~moshovos/research/iccd02.pdf

jwegan · on May 25, 2010

What the article fails to mention is branch predictors in modern CPUs are generally correct >95% of the time for real world benchmarks (usually SPEC).

Furthermore each CPU uses different approaches to branch prediction such that a bad pattern for one will not be in another. So spending time trying to optimize your code in this way will only optimize it for a particular processor which wont work if you're trying to make a distributable binary. (i.e. different x86 processors use different branch prediction algorithms. You will basically be optimizing specifically for the Intel i7 or specifically for the AMD Athlon 3)

Zot95 · on May 25, 2010

Interesting article. It has been a long time since I have worked this "close to the metal." Back when I coded at the machine level, there were no branch predictions performed by the processor to help maintain the pipeline. What you had were different opcodes for your conditional jumps - 1 for when the jump was likely, 1 for when it was not. It had amazed me (a bit) that never got reflected in higher level languages. Now that I see that the processor makes (educated?) guesses, I suppose I am no longer surprised.

warfangle · on May 25, 2010

Would systems that use trace trees (e.g., TraceMonkey) have the same sort of issues?

I have a feeling that they would notice repeated calls to the same if-statement and optimize accordingly. They may still have the same sorts of issues, but would the used benchmark (looping and querying the if-statement n times) have completely different results in a JIT system?

TrevorBurnham · on May 25, 2010

I'm astonished at the magnitude of the difference correct prediction can make in a scenario where the effect of a misprediction is to either increment a sum or not. How is that possible? How many clock cycles does an accurate prediction save in this context, and how many are used in moving through the loop and evaluating the condition?

malkia · on May 25, 2010

Results from Luajit - http://gist.github.com/413482

D:\p\badif>..\luajit\src\luajit badif.lua

(i & 0x80000000) == 0 time: 1.057

(i & 0xFFFFFFFF) == 0 time: 0.696

(i & 0x00000001) == 0 time: 3.682

(i & 0x00000003) == 0 time: 5.339

(i & 0x00000002) == 0 time: 3.686

(i & 0x00000004) == 0 time: 3.683

(i & 0x00000008) == 0 time: 3.686

(i & 0x00000010) == 0 time: 3.856

j_baker · on May 25, 2010

Just out of curiosity, how applicable is this to optimizing a higher-level language like python or ruby?

nkurz · on May 25, 2010

It holds just as true, but I can't think of a case where it would be worth worrying about. For example, accesses to main memory are much more expensive than mispredicted branches.

Unless you're also concerned with how Python or Ruby stores its values internally, you don't need to think about the speed of conditionals. It's really only an issue if you're writing the Python or Ruby interpreter itself.

jwegan · on May 25, 2010

It is not applicable unless you plan on running your code on only a specific CPU (and by specific CPU I don't mean that you plan on running only on x86, I mean you only plan on running on the Intel i7).

GCC has some macros where you can specify if a branch is likely or unlikely so the compiler will optimize the code for you based on which cpu you are compiling on, but at higher level languages you would have to do this by hand. The performance gain from this optimization will be negligible compared to if you had spent that time trying to optimize something else.

nkurz · on May 25, 2010

I think you are misinterpreting this. All branch predicting CPU's will deal better with basic and regular patterns: 100 trues in a row followed by 100 falses. It's only with regard to how well they handle different ABAB-type patterns that they differ. Modern and more expensive processors are better at detecting more complex patterns.

The macros you refer to a are an even more minor effect --- they only affect what happens the first time through the loop. It's a tiny optimization of an optimization. Branch prediction is a hardware feature, and is going to happen whether you use these macros or not. It's the logic of your code that's going to make the big difference, not the macros.

prodigal_erik · on May 25, 2010

It's the first time through the loop after each time the loop is evicted from the instruction cache. If the loop is short or has been unrolled heavily (causing each branch to be predicted separately) it could matter.

nkurz · on May 25, 2010

Is the Branch Target Buffer synonymous with the instruction cache[1]? But yes, it certainly can matter. I thought the links that 'wendroid' offered above were interesting: a Scheme interpreter, presumably already fairly well optimized, saw about a 10% improvement through the hinting macros.

But I'd argue that if this level of optimization is required, the bigger gains are to be see through changing your algorithms to become more regular: for example, using fixed length blocks in a compression scheme rather than checking for null termination [2]. But if this is for some reason impossible, then use hinting!

[1] http://www.agner.org/optimize/microarchitecture.pdf

[2] http://www2008.org/papers/pdf/p387-zhangA.pdf

prodigal_erik · on May 25, 2010

Thanks, I didn't know that branch predictions are now cached separately from instruction decodings. Hopefully evictions are a rarer problem.

scott_s · on May 25, 2010

This is the sort of thing a good VM's JIT should handle for you.