Disclaimer: Going off memory of the presentations myself, not at all an expert, ...

cpleppert · on July 19, 2017

I'm not sure I follow about the benefits of this branch prediction scheme in the first place. I already have a branch target from the branch target buffer( assuming I'm in a conventional processor), so either the frontend can begin immediately loading code from the cache or I stall while I load from main memory. I'm never prefetching anything(at least in terms of an initial target)

Let's assume that Mill can produce a prefetch chain though that can load additional code segments beyond the initial branch target. Does this really benefit the processor? I don't understand how a static prediction of where I will branch to next after an initial branch target is even useful. If I have already branched to that location before the code will be sitting in the L1 cache waiting for me, if it isn't, well, how does the branch predictor know more about the behavior of my program than the L1 cache?

To put it another way: if the instruction set doesn't fit in the L1 cache is there some predictable way that I can load and unload the cache just based on a given branch target( that in turn implies other branch targets will be loaded next with a total instruction set larger than the cache). I don't really think there is any possible way to predict the behavior of an entire program like this. Larger instruction segments and more branch targets imply more complex programs that are correspondingly harder to predict so I'm always better off just relying on the instruction cache.

Assuming Mill tries to load entire chains of instructions into the cache based on a simple static target I would be much better off just disabling this and relying on the instruction cache in almost all cases therefore.

igodard · on July 19, 2017

The target is not static; it is dynamically updated as part of history.

Prefetch chaining is to get code out of DRAM, and it runs DRAM-latency ahead of execution. Separately, fetch chaining is to get the code up to the L1, and it runs L2/L3-latency ahead of execution. Load chaining gets the lines from the L1 to the L0 micro-caches and the decoders, and runs decode-latency (3 cycles on the Mill) ahead of execution.

The Mill stages instructions like this because the further ahead in time an action is the more likely that the action is down a path that execution will not take. We don't prefetch to the L1 because we don't have enough confidence that we will need the code to be willing to spam a scarce resource like the L1. But avoiding a full DRAM hit is important too, so we stage the fetching. It doesn't matter at all in small codes that spend all their time in a five-line loop, but that's not the only kind of codes there are :-)