Unexpected ways memory subsystem interacts with branch prediction

jazzyjackson · on Dec 29, 2023

https://web.archive.org/web/20231228221536/https://johnnyssw...

t8sr · on Dec 29, 2023

The irony of a blog about software performance going down after being on HN for 2 hours is too much. (I know, it’s just Wordpress, I’m just being a grump.)

The archive version seems to be missing the branchless versions of the algorithms - is that missing from the article itself as well? It’d be interesting to put the different versions into compiler explorer.

codebrrr · on Dec 29, 2023

I'm pretty sure the following optimization is invalid, since a is not a bool array, but one of positive/negative numbers (see top of article):

> You can use arithmetics to go branchless.

    if (a[i] > 0) {
       cnt++;
    }

> Rewriting using arithmetic takes advantage of the fact that the expression a[i] > 0 has an arithmetic value 1 if true and 0 if false. So the whole expression can be rewritten as:

    cnt += a[i]

Kwantuum · on Dec 29, 2023

Surely this was meant to be cnt += a[i] > 0

amelius · on Dec 29, 2023

But a[i] > 0 can still compile to a branch, depending on the compiler.

dkersten · on Dec 29, 2023

Really? In what case? It’s just an expression and the entire statement just uses the expression result.

Of course when optimising code like this, I think it’s important to look at the generated assembly anyway and then you can be sure that it does what you expect on the compilers you intend on supporting (doubly so when you want to generate conditional moves — I’ve found that to be a bit of a puzzle where I sometimes need to move things around as the obvious code still generated a branch), but at least GCC and Clang won’t generate a branch for using just a comparison. Maybe it’s not the compiler and instead the target architecture? In x86, comparisons set flags, so by themselves aren’t branches. In any case, I recommend using compiler explorer when working on code where this matters.

flohofwoe · on Dec 29, 2023

Setting the CPU flag after a comparison doesn't do anything useful yet, you also need to perform an addition with 1 or do nothing depending on the flag, and selecting between these two options is usually done with a conditional branch, unless the CPU can execute ALU instructions conditionally (ARM can do this, x86 only has conditional mov AFAIK).

gpderetta · on Dec 29, 2023

You can copy a flag in the al register then add that. That's usually what the compiler generates.

flohofwoe · on Dec 30, 2023

Also, apparently ARM64 did drop the conditional instructions (somehow I missed the memo):

https://offlinemark.com/2023/07/29/til-arm64-doesnt-include-...

avianlyric · on Dec 29, 2023

Can it? I understand it’s always possible to decompose that to

if (a[i] > 0) return 1 else return 0 end

But why would a compiler do that?

Comparison operations are basic primitives that usually store their result into a register. With branching comparison operators potentially also available, but if there’s a branching comparator operation in your ISA, then there’s almost certainly also a pure comparator operation in your ISA, because you can always compose a branching comparator from simple comparison operation followed by a basic equality comparison of the result.

So I guess my question is, while it’s technically possible for a compiler to compile this into a branching operations, under what circumstances would a compiler actually choose to do that, given there’s isn’t a clear benefit?

tylerhou · on Dec 29, 2023

In the branchless version, the CPU has to wait for the comparison to resolve before it can start executing the add for the next loop iteration. However, if the branch is predictable, the CPU can assume the result of the conditional and does not need to wait to add one or not. I wrote a more in depth comment about why this is true a few months ago: https://news.ycombinator.com/item?id=37245594

If I alter the code slightly to do `result += (a[i] == 0) * 2`, gcc emits a branch if the comparison is predictable: https://godbolt.org/z/df3fsoYK8

Here is a benchmark: https://quick-bench.com/q/NSGHu_wfhrMXp0-pZQp9qybCIok. Note how the branchless version takes the same time for the random and the zeroes vector, while the branch version is faster when the branch is predictable but slower when the branch is not predictable.

cesarb · on Dec 29, 2023

> Comparison operations are basic primitives that usually store their result into a register.

In one of the most common processor architectures (the x86 family), comparison operations do store their result into a register, but it's the flags register, which can't be used directly in arithmetic operations. So you have to follow a comparison operation with either a conditional branch or a conditional move (and earlier processors in the x86 family didn't have conditional moves).

> So I guess my question is, while it’s technically possible for a compiler to compile this into a branching operations, under what circumstances would a compiler actually choose to do that, given there’s isn’t a clear benefit?

It depends on the compiler heuristics and on the surrounding code; for instance, it might decide that "compare; conditional branch over next instruction; increment" is better than "copy to second register; increment second register; compare; conditional move from second register", because it uses one less register (the x86 family is register starved) and one less instruction (relevant when optimizing for size).

xoranth · on Dec 29, 2023

> So you have to follow a comparison operation with either a conditional branch or a conditional move (and earlier processors in the x86 family didn't have conditional moves).

The x86 family has the `setCC` instructions [^1] that move bits from the flag register to a general purpose one. Example from godbolt, see `setg`:

https://c.godbolt.org/z/MY37oP9vz

[^1]: https://www.felixcloutier.com/x86/setcc

gpderetta · on Dec 29, 2023

SETcc is indeed what GCC typically uses. You can also play tricks with the carry flag and ADC but I don't think I have ever seen GCC do it.

xoranth · on Dec 30, 2023

The latest version can [^1], though anecdotally I've seen clang/LLVM being smarter about it.

[^1]: https://c.godbolt.org/z/vP8edfen7

inetknght · on Dec 29, 2023

cnt += a[i]

No, you've changed the right-hand side expression. Now you're adding a[i] to cnt and nowhere has it stated that a[i] is 1. But a[i] > 0's result is always zero or one.

I suggest what you want is this:

    cnt += !!a[i]

Now, a[i] is not-notted -- the expression returns a bool indicating whether the value is converted to true.

or this, depending on your style:

    cnt += bool(a[i])

which does the same thing

082349872349872 · on Dec 29, 2023

Contrary to title, TFA got the results they should've already expected

sylware · on Dec 29, 2023

I am currently coding x86_64 and I am trying to favor as much as reasonably possible "non-predicted branch"/"branchless" code paths. Setcc and cmovcc instructions are really usefull, even though if I can think of real "branchless" algorithms, I will favor them.

x86_64 assembly for me is just the transition step before the actual RISC-V jump. This is "register-ization" of some code paths. Once done, it is kind of easy to do a port to another modern ISA. Bu then, I am thinking about all that branch prediction on RISC-V:

Will we have a way to hint a core/hart to dodge prediction for some branches, fine-grained and "cleanly"? I was thinking about implicit branch prediction exclusion via some "known" instruction fusions, but the 'implicit' here is scary.

celeritascelery · on Dec 29, 2023

> Will we have a way to hint a core/hart to dodge prediction for some branches, fine-grained and "cleanly"?

Why would you want to avoid the branch predictor? All it can do is speedup your code.

sylware · on Dec 30, 2023

If I did understand properly, in code paths with many branches near from each other, it would help to hint the predictor about which ones are not worth it.

nayuki · on Dec 30, 2023

I explored the idea of making code branchless by replacing if/else with bit masking arithmetic. I did this in the context of preventing timing attacks in cryptographic primitive algorithms. https://stackoverflow.com/questions/27865974/is-masking-effe...

ceeam · on Dec 29, 2023

BTW, there's no way, documented or not, to switch off branch predictor on a modern CPU, right?

NobodyNada · on Dec 29, 2023

AArch64 has the SB ("speculation barrier") instruction, which you could insert after both sides of a branch instruction if you really wanted to: https://developer.arm.com/documentation/ddi0596/2021-12/Base...

However, it's not specified to completely disable speculation, only "to the extent that such speculation can be observed through side-channels as a result of control flow speculation or data value speculation". So e.g. an implementation could decide to disable speculative memory access while still allowing arithmetic operations.

Preventing Spectre vulnerabilities is pretty much the only reason you'd want to disable branch prediction though (except for curiosity of course). Without prediction, performance would be as bad as if every branch was mispredicted, since the pipeline has to stop and wait at every single branch. The idea of branch prediction is that the "wasted" cycles waiting for the branch to resolve can instead be used to do some computation that may or may not be useful; if it turned out to be useful you saved some time, if not you didn't lose anything (besides maybe some electricity). So even a branch predictor that randomly guesses with 50% accuracy is a huge performance win over not speculating at all.

mgaunard · on Dec 29, 2023

Simply replace your processor by a 8086.

synergy20 · on Dec 29, 2023

you can also use likely and unlikely to help the branch prediction.

menaerus · on Dec 29, 2023

It's a common myth that I hear a lot - it has nothing to do with the branch-prediction per se. It's only a hint to the compiler itself to emit a more instruction-cache and CPU-decode friendly code layout. E.g. codegen for unlikely branch(es) in the binary itself will be placed farther away from the "likely" execution path.

NobodyNada · on Dec 29, 2023

There's a grain of truth to it -- when a CPU encounters a branch it has not seen before, some CPUs (but not most modern ones: https://stackoverflow.com/a/51848422) have static prediction that assumes a backwards branch is taken and a forwards branch is not. And compilers tend to put "unlikely" execution paths past the bottom of the function so that they're not interrupting the "hot path" of the function. So if the processor uses static branch prediction, or if it hasn't executed the function before and doesn't know there's a branch there until after it's been fetched, the branch predictor is more likely to follow the "hot path" than a forwards jump that skips ahead. But yeah, static branch prediction isn't common anymore, and "likely"/"unlikely" intrinsics have more to do with tuning the optimizer than the branch predictor.