The use of previous branch history and branch address as a "context" for prediction reminds me of the very similar technique used for prediction in arithmetic compression as used in e.g. JBIG2, JPEG2000, etc. --- the goal being that, if an event X happens several times in context C, then whenever context C occurs, the probability of X is more likely.
Also, since modern CPUs internally have many functional units to which operations can be dispatched, I wonder if, in the case that the "confidence" of a branch prediction is not high, "splitting" the execution stream and executing both branches in parallel until the result is known (or one of the branches encounters another branch...), would yield much benefit over predicting and then having to re-execute the other path if the prediction is wrong. I.e. does it take longer to flush the pipeline and restart on the other path at full rate, or to run both paths in parallel at effectively 1/2 the rate until the prediction is known?
This wouldn't really buy you much, because after N branches you'd be pursuing 2^N possible execution paths, each of which requires its own resources throughout the CPU, to fetch, decode, rename, schedule, execute, and retire the instructions. Going any deeper than a few branches would be impractical, and you'd be spending most of your resources on computation that doesn't affect the final result. It also doesn't work for indirect branches, unless your predictor can produce a ranked list of possibilities, and you only execute the top few.
So, if you have low confidence in a short time in too many branches you would lose some performance. But that would happen anyway, you can't optimize low confidence stuff.
The GP's question is still a good one. Is doing both branches in parallel better than going superscalar into the slightly more favored one? Is the low confidence situation common enough that it's worth adding the extra circuitry into the CPU.
Yes, but functional units are only around ~6% of the power consumption of a modern superscalar processor. And with Moore's law squeaking out its last few iterations, that number will get even smaller. If you can remove a good part of that by using predication, then you actually win out on power and heat as well.
Pie-in-the-sky idea here, but only irreversible computations produce heat. Maybe in the distant future we can make chips that do many parallel computations of all branches reversibly, and only make the results irreversible once the correct branch is known?
In theory, irreversible computations have to produce heat, while reversible do not. In practice this is mostly irrelevant, because the heat involved is so minuscule that there will always be other significantly larger sources of inefficiency involved. Also it is somewhat questionable whether one could actually construct physical realization of useful logic primitive that is truly reversible.
People has tried this but as other have pointed out, it doesn’t get you very far.
Along the same lines is runahead execution which is not directly related to branch prediction, but follows the similar idea you had that if you have all these functional units , you might as well try to figure out what you should start prefetching by speculatly executing the most likely sequence of instructions- even if you are waiting on data for those instructions!
> Also, since modern CPUs internally have many functional units to which operations can be dispatched, I wonder if, in the case that the "confidence" of a branch prediction is not high, "splitting" the execution stream and executing both branches in parallel until the result is known...
What killed the idea is that branch predictors are simply too good. You'd end up almost never speculating off the main spine the predictor gives you, and so all the other machinery needed doesn't provide nearly enough value for its cost.
You're thinking of predication, also sometimes called "if- conversion" or more loosely, speculation
It's very useful for getting rid of small control flow that doesn't really change the path of execution much, but doing it for too long is impossible due to the exponential complexity.
Maybe it is just an arbitrarily long time in the past, since there were no computers then there is no history related to branch prediction so it's a joke that makes it sound like a longer spanning history than it actually is.
the BC made no sense to me and there's no reference to it.
If he made an illustration about hunters tracking their prey and there was a fork in the road and they split into 2 groups that could allude to branch prediction, I don't know
One surprising thing that I discovered recently is that after Haswell, Intel processors got much much better at predicting "interpreter loops", which are basically a while true loop with a very large seemingly unpredictable switch statement. It lead to a dramatic improvement in micro benchmarks and made some traditional optimizations involving computed goto and " indirect threading" obsolete.
It doesn't look like a series of comparisons from the CPU point of view. Normally switch statements are compiled like a series of "if" statements, but the interpreter loop style switch gets compiled into a table of jump targets that is indexed by bytecode. Same kind of indirect branch prediction features that were previously designed to help C++ "virtual" functions help here - a branch target buffer, etc.
The VM interpreter loop is mostly a main bottleneck in languages that have rather low-level VM instructions and data types. In high level VMs the dispatch on operand type is the main bottleneck. This too benefits from indirect branch prediction.
It depends on whether the switched-on values are sparse. Contiguous ranges of bytecodes are more efficiently compiled to straight jump tables (and enable indirect branch prediction mechanisms in CPUs to work).
> PA 8000 (1996): actually implemented as a 3-bit shift register with majority vote
This actually seems interestingly different from the two-bit saturating counter. Like, it's not just a different way of implementing it; you can't realize the saturating counter as a "quotient" of the shift/vote scheme.
this kind of transformation is truly bread & butter in hardware; we regularly numbers between binary counts, mask/number-of-set-bits and one-hot representations for optimisation purposes.
That's the obvious attempt at an equivalence one would come up with on being told that they're the same, but, as I stated, it doesn't work. As an example: In the 2-bit saturating counter, if you start at 00, if you see a 1 and then a 0 you're back to where you started. Whereas in the bit-shift register, if you start at 000 and see a 1 and then a 0, you're now at 010, which would correspond to 01 rather than 00.
I seem to be missing something when the two bit scheme is introduced it's said that it's the same as the one bit scheme except for storing two bits (seems logical), but then the index in the lookup table seems to involve both the branch index (already the case in the one bit scheme) and the branch history (as far as I can see never introduced).
Is there any system out there that supports branch 'annotations', of a sort, so that the programmer or the compiler can just tell the CPU what the branch behavior is going to be?
Like -- it seems kinda silly for the CPU to do so much work to figure out if a loop is going to be repeated frequently, when the code could just explicitly say "fyi, this branch is going to be taken 99 times out of 100".
Or, if there's a loop that is always taken 3 times and then passed once, that could be expressed explicitly, with a "predict this branch if i%4 != 0" annotation.
Yes and no. I say yes because C and C++ both have likely() and unlikely() functions (well, technically It's part of a compiler extension rather than the language), which you wrap around the condition inside your if() statement like so:
for(i=10000; i<100000; i++) {
if(unlikely(isPrime(i))) {
// do something special
}
else {
// do something boring
}
}
I say no because most modern compilers simply ignore the functions. Compilers have become so sophisticated over the years that developers trying to help them along or optimize often make things worse, whether that's by getting in the compiler's way or just writing code that's harder to read and debug.
I say no (or at least probably not) to your second questions as well. No language I know if implements anything as sophisticated as a specification for regular intervals of branch switching. Many modern compilers have sophisticated branch prediction routines which can detect simple regular intervals like you describe.
To develop such a specification would optimize the branch prediction by a tiny margin which would be absolutely dwarfed by the overhead of learning the syntax for the specification, not messing it up, debugging it if you do mess it up, communicating the decision to other team members, and all of the other real-world stuff that gets in the way.
Computers are faster than ever and branch prediction algorithms are smarter than ever. Yes, you could help it along in theory but the portion of applications which really require you to do so is dwindling all the time.
Yes, a few ISAs have things like "branch.likely", but the CPU will just ignore it because it will do a better job anyways (the compiler is only as good as the test program used to guide its analysis).
GCC allows the programmer to guide "unlikely" branches too. I assume that means GCC moves that code away so it doesn't pollute the I$ and static predictors can predict them correctly.
There are some (usually small, embedded) processors that have "loop count" instructions. They are typically stateful and hard to make work in high-performance pipelines.
Disagree, or at least if there's a width that's too wide then I haven't found it yet. Let the user set the width they want in their browser, don't force your own max width on them.
Typography people generally agree that lines that are too long are harder to read and recommend fairly short lines, less than 15 words. They have a number of studies backing them up. Maybe you're an outlier.
For a long time I used to just disable CSS in Firefox (View, Page Style, No Style), which gives a readable page most of the time. I knew about Reader View, but because it only allowed low contrast gray text, and because I already had a workable solution, I dismissed it at worthless.
I recently noticed that I was using No Style so frequently that it would be worth checking for a better solution. I found it's possible to fix the low contrast text in Reader View with custom userContent.css:
> Some modern CPUs have completely different branch predictors; AMD Zen (2017) and AMD Bulldozer (2011) chips appear to use perceptron based branch predictors. Perceptrons are single-layer neural nets[0].
It’s actually a fairly old technique, which has been applied in other processors before! I recall and arm processor used the technique as well as AMD’s piledriver.
Top quality article. Now we need one with specifics of how to write code that's aware of this. For instance when do use what compiler hints. Anyone have links or books?
There's a couple of things you can realistically do:
(1) Mark branches as likely/unlikely to be taken. eg. debug code might be marked as unlikely. The Linux kernel does this, and it's explained here (for GCC): https://stackoverflow.com/questions/109710/likely-unlikely-m... [There is some question about whether this is really worthwhile, but I trust the kernel developers ...]
There are also two negative things to avoid doing:
(3) Some things like threaded code (used by FORTH interpreters) and bytecode interpreters reuse the same piece of code followed by a branch, where the branch jumps to the next high level instruction. This code structure defeats branch prediction.
(4) Don't do strange stuff to the return stack, because modern processors keep track of the return stack and predict branch targets (ie. of RET instructions). A CALL with an unmatched RET or vice versa can defeat branch prediction.
I would say that doing anything else is too difficult, or means that you have to become a compiler writer, but I'm interested to know if there are any other practical techniques.
The compiler is (probably) smarter than you. Generally speaking, it will automatically decide which branches are most likely and arrange them accordingly (e.g. for things like for loops and while loops especially, where the biggest gains are). You'd likely gain more performance out of algorithmic changes, and then a number of other processor optimizations (like vectorization and pre-fetch hints) first. Also, CPU manufactures ignore the classic hints, and don't really have information on how to tune branch predictors. So while GCC/Clang does have a special intrinsic (`__builtin_expect`) for it in c/c++ - most other languages are too high level for it to matter - it probably won't do much and is an insanely early optimization to consider making.
I noticed recently that there are conditional select vector instructions, so e.g. you can implement max(x,y) with an instruction instead of doing a CMP and JMP. When I tried it it was substantially faster than the CMP/JMP approach, like 20 times faster (on an Intel Skylake i7) even though the vector had only 4 values (4 * 64 bit double precision floats) - hence I was expecting 4x speedup at most.
I figured as far as the brnach predictor is concerned this is not a branch and therefore branch mispredictions are totally avoided.
Unless you are targeting pre-SSE2 CPUs, no compiler should generate CMP+JMP for max operation on floating point. `maxsd` is in SSE2, along with all double precision operation. Before that double precision still run on x87. (SSE only has single-precision operation)
With AVX (Sandy Bridge and later), `vmaxpd` on `ymm` would allow you to operate on 4 double-precisions at once. If you observe 20x speed up, it would probably also come from memory access optimization.
I'm playing with the new(ish) C#/dotnet vector/SIMD support e.g. Vector<double>.Max() generates a vmaxpd instruction, whereas Math.Max() generates a method call.
That method is implemented like so:
if (val1 > val2)
{
return val1;
}
if (double.IsNaN(val1))
{
return val1;
}
return val2;
This appears to not have an optimized implementation in the CLR (the dotnet VM) and the presence of 'if' statements is likely the reason why this was not inlined. So I think the 20x speedup is genuine, it's because the baseline version is horribly slow!
Whether or not it's an early optimization depends on when you add the __builtin_expect, no? Perhaps you already have identified a very hot branch in your code that the compiler fails to treat correctly even though you provided it with profile data.
Except that anything newer than 1995 probably ignores the hint provided by __builtin_expect, and you'd probably have more luck changing the algorithm or vectorizing the code.
Is this correct? "Without branch prediction, we then expect the “average” instruction to take branch_pct * 1 + non_branch_pct * 20 = 0.8 * 1 + 0.2 * 20 = 0.8 + 4 = 4.8 cycles"
other than branch_pct and non_branch_pct being reversed, this seems to be assuming that 100% of branches are guessed incorrectly. Shouldn't something like 50% be used, to assume a random guess? ie 0.8 * 1 + 0.2 * (0.5 * 20 + 0.8 * 1)=2.96
This is talking about the case where there is no branch prediction. So every branch incurs the penalty of a bad prediction because it never tries to do the favorable action, instead it stalls and waits on the result to do the branch.
Yes. TAGE still does better than perceptron, but combining the two is likely to give you the best performance currently. Of course, all this is dependent on area allocated to the predictor and the workloads being run.
Also, since modern CPUs internally have many functional units to which operations can be dispatched, I wonder if, in the case that the "confidence" of a branch prediction is not high, "splitting" the execution stream and executing both branches in parallel until the result is known (or one of the branches encounters another branch...), would yield much benefit over predicting and then having to re-execute the other path if the prediction is wrong. I.e. does it take longer to flush the pipeline and restart on the other path at full rate, or to run both paths in parallel at effectively 1/2 the rate until the prediction is known?