A history of branch prediction

userbinator · on Aug 23, 2017

The use of previous branch history and branch address as a "context" for prediction reminds me of the very similar technique used for prediction in arithmetic compression as used in e.g. JBIG2, JPEG2000, etc. --- the goal being that, if an event X happens several times in context C, then whenever context C occurs, the probability of X is more likely.

Also, since modern CPUs internally have many functional units to which operations can be dispatched, I wonder if, in the case that the "confidence" of a branch prediction is not high, "splitting" the execution stream and executing both branches in parallel until the result is known (or one of the branches encounters another branch...), would yield much benefit over predicting and then having to re-execute the other path if the prediction is wrong. I.e. does it take longer to flush the pipeline and restart on the other path at full rate, or to run both paths in parallel at effectively 1/2 the rate until the prediction is known?

cwzwarich · on Aug 23, 2017

This wouldn't really buy you much, because after N branches you'd be pursuing 2^N possible execution paths, each of which requires its own resources throughout the CPU, to fetch, decode, rename, schedule, execute, and retire the instructions. Going any deeper than a few branches would be impractical, and you'd be spending most of your resources on computation that doesn't affect the final result. It also doesn't work for indirect branches, unless your predictor can produce a ranked list of possibilities, and you only execute the top few.

marcosdumay · on Aug 23, 2017

So, if you have low confidence in a short time in too many branches you would lose some performance. But that would happen anyway, you can't optimize low confidence stuff.

The GP's question is still a good one. Is doing both branches in parallel better than going superscalar into the slightly more favored one? Is the low confidence situation common enough that it's worth adding the extra circuitry into the CPU.

mjevans · on Aug 23, 2017

You could do it, but 'work' produces heat.

From that point of view a branch predictor /saves/ you from spending the heat of the cases you /don't/ need to have processed.

The performance per watt of such a design would probably leave it on the back of a napkin as an educated guess of how costly that would be.

deepnotderp · on Aug 23, 2017

Yes, but functional units are only around ~6% of the power consumption of a modern superscalar processor. And with Moore's law squeaking out its last few iterations, that number will get even smaller. If you can remove a good part of that by using predication, then you actually win out on power and heat as well.

Simon_says · on Aug 23, 2017

Pie-in-the-sky idea here, but only irreversible computations produce heat. Maybe in the distant future we can make chips that do many parallel computations of all branches reversibly, and only make the results irreversible once the correct branch is known?

dfox · on Aug 23, 2017

In theory, irreversible computations have to produce heat, while reversible do not. In practice this is mostly irrelevant, because the heat involved is so minuscule that there will always be other significantly larger sources of inefficiency involved. Also it is somewhat questionable whether one could actually construct physical realization of useful logic primitive that is truly reversible.

buildbot · on Aug 23, 2017

People has tried this but as other have pointed out, it doesn’t get you very far.

Along the same lines is runahead execution which is not directly related to branch prediction, but follows the similar idea you had that if you have all these functional units , you might as well try to figure out what you should start prefetching by speculatly executing the most likely sequence of instructions- even if you are waiting on data for those instructions!

deepnotderp · on Aug 23, 2017

Runahead is a totally different concept though, it attempts to extract MLP and throws away all the work even if it was valid.

buildbot · on Aug 23, 2017

True, I was just reminded of it by the original poster’s idea.

deepnotderp · on Aug 23, 2017

Fair enough, runahead is a cool idea after all :)

jasonwatkinspdx · on Aug 23, 2017

> Also, since modern CPUs internally have many functional units to which operations can be dispatched, I wonder if, in the case that the "confidence" of a branch prediction is not high, "splitting" the execution stream and executing both branches in parallel until the result is known...

This is called disjoint eager execution: http://dl.acm.org/citation.cfm?id=225208

What killed the idea is that branch predictors are simply too good. You'd end up almost never speculating off the main spine the predictor gives you, and so all the other machinery needed doesn't provide nearly enough value for its cost.

terryf · on Aug 23, 2017

They do. Video here from 7 years ago that talks about it: https://www.infoq.com/presentations/click-crash-course-moder...

Basically, they do speculative execution with register renaming to get quick turn-around if the memory is available in cache.

It really is quite crazy how much faster the cpu is than memory and what tricks it pulls to get around that problem.

deepnotderp · on Aug 23, 2017

You're thinking of predication, also sometimes called "if- conversion" or more loosely, speculation

It's very useful for getting rid of small control flow that doesn't really change the path of execution much, but doing it for too long is impossible due to the exponential complexity.

ramshorns · on Aug 23, 2017

Very informative. I missed the part about 1500000 BC though – a time when our ancestors lived in the branches of trees?

Another beginner-friendly explanation of the effects of branch prediction is this Stack Overflow post which compares a processor to a train: https://stackoverflow.com/questions/11227809/why-is-it-faste...

eberkund · on Aug 23, 2017

Maybe it is just an arbitrarily long time in the past, since there were no computers then there is no history related to branch prediction so it's a joke that makes it sound like a longer spanning history than it actually is.

jlebrech · on Aug 23, 2017

the BC made no sense to me and there's no reference to it.

If he made an illustration about hunters tracking their prey and there was a fork in the road and they split into 2 groups that could allude to branch prediction, I don't know

josch · on Aug 24, 2017

You might be interested in this note by Oleg Kiselyov which talks about branch prediction in the context of one of the 1st computers: http://okmij.org/ftp/Computation/Zuse-accolades.txt

PyComfy · on Aug 23, 2017

Maybe the author is referring to the apparition of the homo erectus

ufo · on Aug 23, 2017

One surprising thing that I discovered recently is that after Haswell, Intel processors got much much better at predicting "interpreter loops", which are basically a while true loop with a very large seemingly unpredictable switch statement. It lead to a dramatic improvement in micro benchmarks and made some traditional optimizations involving computed goto and " indirect threading" obsolete.

Does anyone know how it achieved this?

fulafel · on Aug 23, 2017

It doesn't look like a series of comparisons from the CPU point of view. Normally switch statements are compiled like a series of "if" statements, but the interpreter loop style switch gets compiled into a table of jump targets that is indexed by bytecode. Same kind of indirect branch prediction features that were previously designed to help C++ "virtual" functions help here - a branch target buffer, etc.

The VM interpreter loop is mostly a main bottleneck in languages that have rather low-level VM instructions and data types. In high level VMs the dispatch on operand type is the main bottleneck. This too benefits from indirect branch prediction.

deepnotderp · on Aug 24, 2017

Longer switches are usually compiled into binary search trees.

fulafel · on Aug 24, 2017

It depends on whether the switched-on values are sparse. Contiguous ranges of bytecodes are more efficiently compiled to straight jump tables (and enable indirect branch prediction mechanisms in CPUs to work).

For another boost to leveraging indirect branch prediction, threaded code is still a little better since each VM instruction has a unique jump call site: http://eli.thegreenplace.net/2012/07/12/computed-goto-for-ef...

ufo · on Aug 24, 2017

That blog post is from 2012. What I observed is that with the more recent processors threaded code doesn't make much of a difference.

to3m · on Aug 23, 2017

https://hal.inria.fr/hal-01100647 suggests it might use an improved version of the TAGE scheme referred to in the article.

deepnotderp · on Aug 23, 2017

perhaps they have a dedicated loop length predictor? Iirc, loop exits account for a majority of branch mispredicts nowadays, so it makes sense.

ufo · on Aug 23, 2017

It must have been something else. I was talking about VM loops, which never exit.

deepnotderp · on Aug 23, 2017

Ah , maybe they have a dedicated inner predictor?

Sniffnoy · on Aug 23, 2017

> PA 8000 (1996): actually implemented as a 3-bit shift register with majority vote

This actually seems interestingly different from the two-bit saturating counter. Like, it's not just a different way of implementing it; you can't realize the saturating counter as a "quotient" of the shift/vote scheme.

oshepherd · on Aug 23, 2017

It's the same just with redundant representations (the 2-bit repr is the count of ones in the 3-bit repr)

  '00' -> '000'
  '01' -> '001', '010', '100'
  '10' -> '011', '101', '110'
  '11' -> '111'

this kind of transformation is truly bread & butter in hardware; we regularly numbers between binary counts, mask/number-of-set-bits and one-hot representations for optimisation purposes.

Sniffnoy · on Aug 23, 2017

That's the obvious attempt at an equivalence one would come up with on being told that they're the same, but, as I stated, it doesn't work. As an example: In the 2-bit saturating counter, if you start at 00, if you see a 1 and then a 0 you're back to where you started. Whereas in the bit-shift register, if you start at 000 and see a 1 and then a 0, you're now at 010, which would correspond to 01 rather than 00.

irishsultan · on Aug 23, 2017

I seem to be missing something when the two bit scheme is introduced it's said that it's the same as the one bit scheme except for storing two bits (seems logical), but then the index in the lookup table seems to involve both the branch index (already the case in the one bit scheme) and the branch history (as far as I can see never introduced).

Sniffnoy · on Aug 23, 2017

Looks like he just used the wrong picture -- used the picture from "two-level adaptive, global" instead.

ajkjk · on Aug 23, 2017

Is there any system out there that supports branch 'annotations', of a sort, so that the programmer or the compiler can just tell the CPU what the branch behavior is going to be?

Like -- it seems kinda silly for the CPU to do so much work to figure out if a loop is going to be repeated frequently, when the code could just explicitly say "fyi, this branch is going to be taken 99 times out of 100".

Or, if there's a loop that is always taken 3 times and then passed once, that could be expressed explicitly, with a "predict this branch if i%4 != 0" annotation.

ryangittins · on Aug 23, 2017

Yes and no. I say yes because C and C++ both have likely() and unlikely() functions (well, technically It's part of a compiler extension rather than the language), which you wrap around the condition inside your if() statement like so:

  for(i=10000; i<100000; i++) {
    if(unlikely(isPrime(i))) {
      // do something special
    }
    else {
      // do something boring
    }
  }

I say no because most modern compilers simply ignore the functions. Compilers have become so sophisticated over the years that developers trying to help them along or optimize often make things worse, whether that's by getting in the compiler's way or just writing code that's harder to read and debug.

I say no (or at least probably not) to your second questions as well. No language I know if implements anything as sophisticated as a specification for regular intervals of branch switching. Many modern compilers have sophisticated branch prediction routines which can detect simple regular intervals like you describe.

To develop such a specification would optimize the branch prediction by a tiny margin which would be absolutely dwarfed by the overhead of learning the syntax for the specification, not messing it up, debugging it if you do mess it up, communicating the decision to other team members, and all of the other real-world stuff that gets in the way.

Computers are faster than ever and branch prediction algorithms are smarter than ever. Yes, you could help it along in theory but the portion of applications which really require you to do so is dwindling all the time.

_chris_ · on Aug 23, 2017

Yes, a few ISAs have things like "branch.likely", but the CPU will just ignore it because it will do a better job anyways (the compiler is only as good as the test program used to guide its analysis).

GCC allows the programmer to guide "unlikely" branches too. I assume that means GCC moves that code away so it doesn't pollute the I$ and static predictors can predict them correctly.

There are some (usually small, embedded) processors that have "loop count" instructions. They are typically stateful and hard to make work in high-performance pipelines.

legulere · on Aug 23, 2017

I really have problems reading this website. You don't have to make a website bloated to make it readable: http://bettermotherfuckingwebsite.com

suprfnk · on Aug 23, 2017

This is even better, in my opinion: https://bestmotherfucking.website/

Anyway, I agree. Text should always have a max width. Long lines of text aren't really readable.

I use this in my SurfingKeys settings, and use it on a lot of sites to make them more readable:

  mapkey('<Ctrl-m>', 'Centers the current page', function() {
    document.body.style.cssText = "font-family: sans-serif !important";
    document.body.style.cssText += "color: black !important";
    document.body.style.cssText += "line-height: 1.4 !important";
    document.body.style.cssText += "margin: 0 auto !important";
    document.body.style.cssText += "max-width: 60em !important";
    document.body.style.cssText += "background: none !important";
    document.body.style.cssText += "background-color: #FEFEFE !important";
    
    return true;
  });

lmm · on Aug 23, 2017

Disagree, or at least if there's a width that's too wide then I haven't found it yet. Let the user set the width they want in their browser, don't force your own max width on them.

adrianN · on Aug 23, 2017

Typography people generally agree that lines that are too long are harder to read and recommend fairly short lines, less than 15 words. They have a number of studies backing them up. Maybe you're an outlier.

mrob · on Aug 23, 2017

For a long time I used to just disable CSS in Firefox (View, Page Style, No Style), which gives a readable page most of the time. I knew about Reader View, but because it only allowed low contrast gray text, and because I already had a workable solution, I dismissed it at worthless.

I recently noticed that I was using No Style so frequently that it would be worth checking for a better solution. I found it's possible to fix the low contrast text in Reader View with custom userContent.css:

    @namespace url(http://www.w3.org/1999/xhtml);
    @-moz-document url-prefix("about:reader") {
    body {
      background-color: #FFFFFF !important;
      color: #000000 !important;
      }
    }

No Style is still occasionally useful to get something readable where Reader View fails, but now I use Reader View most of the time instead.

PyComfy · on Aug 23, 2017

Have a look at the Chromium/Chrome's extension Just Read

before / after https://i.imgur.com/Ihy5wQh.png

jmkni · on Aug 23, 2017

I've tried a couple of these types of extensions, Just Read is the best one I've found so far.

agumonkey · on Aug 23, 2017

and for those against apps : https://app.printfriendly.com/print?source=homepage&url_s=uG...

sn9 · on Aug 23, 2017

It takes less than a second to Ctrl-+ a few times until the site becomes extremely readable.

filereaper · on Aug 23, 2017

Ryzen has rolled out a Neural-Net based branch predictor, would be curious to see its accuracy compared to the listed approaches.

glenneroo · on Aug 23, 2017

It's even mentioned in the article:

> Some modern CPUs have completely different branch predictors; AMD Zen (2017) and AMD Bulldozer (2011) chips appear to use perceptron based branch predictors. Perceptrons are single-layer neural nets[0].

[0]: https://www.cs.utexas.edu/~lin/papers/hpca01.pdf

buildbot · on Aug 23, 2017

It’s actually a fairly old technique, which has been applied in other processors before! I recall and arm processor used the technique as well as AMD’s piledriver.

https://www.cs.utexas.edu/~lin/papers/tocs02.pdf https://www.cs.utexas.edu/~lin/papers/hpca01.pdf

lordnacho · on Aug 23, 2017

Top quality article. Now we need one with specifics of how to write code that's aware of this. For instance when do use what compiler hints. Anyone have links or books?

rwmj · on Aug 23, 2017

There's a couple of things you can realistically do:

(1) Mark branches as likely/unlikely to be taken. eg. debug code might be marked as unlikely. The Linux kernel does this, and it's explained here (for GCC): https://stackoverflow.com/questions/109710/likely-unlikely-m... [There is some question about whether this is really worthwhile, but I trust the kernel developers ...]

(2) Use profile-guided optimization (PGO), which involves running the program, collecting information about branches and other optimizations, then recompiling with these hints. See eg: https://developer.mozilla.org/en-US/docs/Mozilla/Developer_g...

There are also two negative things to avoid doing:

(3) Some things like threaded code (used by FORTH interpreters) and bytecode interpreters reuse the same piece of code followed by a branch, where the branch jumps to the next high level instruction. This code structure defeats branch prediction.

(4) Don't do strange stuff to the return stack, because modern processors keep track of the return stack and predict branch targets (ie. of RET instructions). A CALL with an unmatched RET or vice versa can defeat branch prediction.

I would say that doing anything else is too difficult, or means that you have to become a compiler writer, but I'm interested to know if there are any other practical techniques.

SolarNet · on Aug 23, 2017

The compiler is (probably) smarter than you. Generally speaking, it will automatically decide which branches are most likely and arrange them accordingly (e.g. for things like for loops and while loops especially, where the biggest gains are). You'd likely gain more performance out of algorithmic changes, and then a number of other processor optimizations (like vectorization and pre-fetch hints) first. Also, CPU manufactures ignore the classic hints, and don't really have information on how to tune branch predictors. So while GCC/Clang does have a special intrinsic (`__builtin_expect`) for it in c/c++ - most other languages are too high level for it to matter - it probably won't do much and is an insanely early optimization to consider making.

redcalx · on Aug 23, 2017

I noticed recently that there are conditional select vector instructions, so e.g. you can implement max(x,y) with an instruction instead of doing a CMP and JMP. When I tried it it was substantially faster than the CMP/JMP approach, like 20 times faster (on an Intel Skylake i7) even though the vector had only 4 values (4 * 64 bit double precision floats) - hence I was expecting 4x speedup at most.

I figured as far as the brnach predictor is concerned this is not a branch and therefore branch mispredictions are totally avoided.

innocenat · on Aug 23, 2017

Unless you are targeting pre-SSE2 CPUs, no compiler should generate CMP+JMP for max operation on floating point. `maxsd` is in SSE2, along with all double precision operation. Before that double precision still run on x87. (SSE only has single-precision operation)

With AVX (Sandy Bridge and later), `vmaxpd` on `ymm` would allow you to operate on 4 double-precisions at once. If you observe 20x speed up, it would probably also come from memory access optimization.

redcalx · on Aug 23, 2017

I'm playing with the new(ish) C#/dotnet vector/SIMD support e.g. Vector<double>.Max() generates a vmaxpd instruction, whereas Math.Max() generates a method call.

That method is implemented like so:

if (val1 > val2) { return val1; } if (double.IsNaN(val1)) { return val1; } return val2;

This appears to not have an optimized implementation in the CLR (the dotnet VM) and the presence of 'if' statements is likely the reason why this was not inlined. So I think the 20x speedup is genuine, it's because the baseline version is horribly slow!

adrianN · on Aug 23, 2017

Whether or not it's an early optimization depends on when you add the __builtin_expect, no? Perhaps you already have identified a very hot branch in your code that the compiler fails to treat correctly even though you provided it with profile data.

SolarNet · on Aug 23, 2017

Except that anything newer than 1995 probably ignores the hint provided by __builtin_expect, and you'd probably have more luck changing the algorithm or vectorizing the code.

seedragons · on Aug 23, 2017

Is this correct? "Without branch prediction, we then expect the “average” instruction to take branch_pct * 1 + non_branch_pct * 20 = 0.8 * 1 + 0.2 * 20 = 0.8 + 4 = 4.8 cycles"

other than branch_pct and non_branch_pct being reversed, this seems to be assuming that 100% of branches are guessed incorrectly. Shouldn't something like 50% be used, to assume a random guess? ie 0.8 * 1 + 0.2 * (0.5 * 20 + 0.8 * 1)=2.96

0xffff2 · on Aug 23, 2017

It's correct if you take "without branch prediction" to include any pipelining of instructions after a branch.

The very first branch prediction algorithm ("predict taken") is to simply enable pipelining by assuming the generally more likely branch.

>...this seems to be assuming that 100% of branches are guessed incorrectly...

Rather, it's assuming that 100% of branches are not guessed at all.

seedragons · on Aug 23, 2017

Ah I see, I assumed it was a penalty for only mis-guessed, not the number of cycles required to evaluate the statement. Thanks!

simcop2387 · on Aug 23, 2017

This is talking about the case where there is no branch prediction. So every branch incurs the penalty of a bad prediction because it never tries to do the favorable action, instead it stalls and waits on the result to do the branch.

zaptheimpaler · on Aug 23, 2017

I love your posts dan. High quality writing, no fluff and bullshit every time :)

unkown-unknowns · on Aug 23, 2017

Figures 12 and 14 are the same but I think the figure used is only supposed to be like that for figure 14, not for figure 12.

The "two-bit" scheme that fig 12 is for does not have branch history, whereas "two-level adaptive, global" which has fig 14 fits the bill.

agumonkey · on Aug 23, 2017

Beautiful article. The kind that makes you want to dig deeper in the whole field.

deepnotderp · on Aug 23, 2017

TAGE and perceptron combined are the SOTA right now, right?

redraga · on Aug 24, 2017

Yes. TAGE still does better than perceptron, but combining the two is likely to give you the best performance currently. Of course, all this is dependent on area allocated to the predictor and the workloads being run.