"It seems like at least for this kind of large-scale, complex application, the cost of pervasive runtime bounds checking is negligible."
Right. The myth that bounds checking is expensive may have come from some terrible compilers in the early days. Berkeley Pascal was a notable example. Each bounds check was a subroutine call.
The common cases for bounds checks are:
- It's in an inner loop iterating over arrays. That's the case where the highest percentage of the time goes into the bounds check. It's also the case likely to be optimized out. This is the case people worry about.
- It's in code that doesn't do a lot of subscript operations. So it doesn't matter.
- It's in non-optimizable code that does a lot of subscript operations. That's unusual, but it does come up. An modern case might be Unreal Engine's Nanite meshes, which have lots of small offsets within the data stream. On the other hand, if you don't check that stuff, it's a great attack vector.
If even our performance-critical code moves to languages that always bounds-check, perhaps that will put pressure on ISA designers to add instructions for never-taken branches that just don't participate in any of the branch prediction logic. You'll always get a mispredict on failed bounds checks or final loop condition checks, but you'll avoid causing mispredictions elsewhere.
Yes, some architectures (including x86) have instructions that hint to the branch predictor, but I think they still end up influencing branch predictor state.
This is the sort of thing shamefully absent from RISC-V. Shameful because its designers were in a position to know better.
There are four versions of a conditional branch: to be predicted, almost always taken, almost never taken, and too random to waste a predictor on. Compilers nowadays have "likely" and "unlikely" intrinsics, but offer, yet, none for the last. I think x86 now ignores its "likelihood" branch prefix because back when introduced
it was too often wrong; and now compilers don't emit it because it they know it is ignored.
The "too random to predict" would be good for sorting, where it could provide a 2x speedup. To get the effect today you need to use conditional-move instructions, which have become unfashionable. Getting your compiler to emit a cmov is tricky.
Intel added a special instruction (maybe in skylake?) to use for spinning on an atomic flag update, that just halts until the cache line being watched gets clobbered by a message from some other cache. Compilers don't emit it, unfortunately, even when spinning on what they know is an atomic flag.
There are zillions of such instructions that were good ideas but were never taken up by programmers who they were meant for.
If the instruction is tagged to keep it from claiming a branch prediction slot, it does not thereby evict some other branch from its prediction slot.
A prediction slot is where statistics on past behavior of a branch at a particular address accumulate. If no past behavior is needed to predict whether the branch will be taken, the predictor needs no statistics to decide, and statistics on some other branch instruction may accumulate in that slot instead.
I'm far from an expert here, but don't most modern branch predictors actually keep a shift register of the last few conditional branches, hash that branch history with the instruction's address, and use that as the index into the predictor's state table, and assume hash collisions won't happen? (Hash collisions only have a material impact if two conditional branches in the same hot code path collide, which is much more rare than general hash collisions.)
If my understanding is correct, slots aren't really "taken up", but rather avoiding the branch predictor reduces the probability of hash collisions for other branches.
I also have a question for the crowd: for indirect branch target prediction, are the BTB slots actually tagged with the address? If you have BTB miss, do you pre-emptively stall the pipeline? It's more power-efficient to do so, but maybe it's better to avoid tags and tag-checking and spend that transistor budget on more BTB slots, and just go ahead and mispredict if you get a collision on BTB slots.
Details of actual branch predictors are proprietary, so we must speculate.
You don't stall on branches you have no history of, but apply heuristics that don't depend on history. E.g., backward branches are usually taken, forward branches less so. I think newer chips can try both alternatives, to some degree, although this cannot be carried far as it blows up exponentially.
I was asking about indirect branches, and the Branch Target Buffer... there are two main cases where you're making indirect jumps or indirect function calls in ELF C/C++ executables: shared library function calls (an IP-relative call to a Program Linkage Table (PLT) entry, which contains an indirect jump through a Global Offset Table (GOT) entry). The other common case is C++ virtual function calls, where you're loading the vtable entry into a register and immediately making an indirect function call through that register. You'd like to be able to start speculatively executing that jump (library call)/virtual function call while fetching the GOT/vtable entry is still in progress.
Using only information resident in the architectural registers and the decoded instruction stream, there's no heuristic that can accurately tell you where printf (or this->do_my_thing()) lives in the address space. If the address for printf isn't in the BTB, do you just stall the pipeline while loading the proper GOT entry? I thought that in general, they just hashed the instruction address to get the BTB entry and blindly assumed that BTB entry would be the target address.. (and of course, issue the GOT read and check the assumption as soon as the real GOT entry is read). The alternative would be to stall the whole pipeline until the GOT could be read anyway, so might as well do something... unless you're really concerned about power consumption. Even in extremely power limited situations, assuming hash collisions in the BTB are rare might be less power-hungry than checking the source address in the BTB entry on each and every indirect branch.
Storing both the source and destination addresses in the BTB entry means twice as much space used, and extra latency and power usage in checking the source address. Does anyone here have good knowledge of when it's worth it to store both the source and destination address in the branch target buffer, instead of just storing the target address and presuming hash collisions are rare?
On another side note: has anyone heard of an architecture that supports IP-relative indirect function calls? Such an addressing mode would allow making a function call through the GOT entry without having to load the PLT entry into the instruction cache. I'm guessing that since there's a fair overhead for sharded library calls, the most performance-sensitive code already avoids dynamic library calls, so there's not much to be gained in exchange for the added complexity of the extra call addressing mode.
Interesting questions, arbitrarily hard to research.
Attention seems to be reserved for events that occur in the sorts of tight loops that affect benchmark results. Thus, the first encounter with any branch may stall indefinitely long without (market) penalty.
Game coders have learned to batch their indirect calls according to actual destination, to throw the branch predictor a bone.
Right. The myth that bounds checking is expensive may have come from some terrible compilers in the early days. Berkeley Pascal was a notable example. Each bounds check was a subroutine call.
The common cases for bounds checks are:
- It's in an inner loop iterating over arrays. That's the case where the highest percentage of the time goes into the bounds check. It's also the case likely to be optimized out. This is the case people worry about.
- It's in code that doesn't do a lot of subscript operations. So it doesn't matter.
- It's in non-optimizable code that does a lot of subscript operations. That's unusual, but it does come up. An modern case might be Unreal Engine's Nanite meshes, which have lots of small offsets within the data stream. On the other hand, if you don't check that stuff, it's a great attack vector.