Is there really anything wrong with conflating a hash table and an associative cache in this context? An associative cache uses a portion of the address as the key, and has a usually limited capability for handling collisions. From a software perspective this is pretty similar to how a hash table works, even if the circuit that you visualize when hearing "hash table" is different.
If we assume that they really do mean "associative cache" for the predictor table, then that's obviously something that really works in hardware, and can be updated on mis-predicts. Flushing the predictor state out to some OS-accessible area when the task quits doesn't sound like magic to me. Pre-loading the cache with warm data when the task is launched next time would be very latency-sensitive and is thus more questionable, but I don't think you've adequately explained why it's impossible to pull off.
>>If we assume that they really do mean "associative cache" for the predictor table, then that's obviously something that really works in hardware, and can be updated on mis-predicts. Flushing the predictor state out to some OS-accessible area when the task quits doesn't sound like magic to me.
That is quite a complex operation though. How large is the associative table that is being flushed and how often does the flushing occur? Assuming we are talking about something in the general size of a BTB loading this entire segment into the branch predictor on very context switch is possible(although how this is implemented on silicon would have to be quite complex)
The data structure described in the paper must be quite a lot larger than a simple BTB though if the predictor is to work as described. It has to hold significantly more information than a BTB if Mill wants to replace all the other associated branch prediction machinery with a table that needs to predict every target. I don't believe a prediction table this size would be feasible to load on every context switch. It's possible they want a smaller table and to replicate some of the branch prediction machinery in a modern CPU. But this seems to defeat the entire point of having a prediction table produced by a compiler and updated by successive runs of the program.
It's very hard to tell exactly what they are loading or what the table will actually do. That's why I'm not surprised they described it as a hash table. It's a simple explanation for how their solution works that avoids having to explain tricky implementation details.
>> but I don't think you've adequately explained why it's impossible to pull off.
I think it's theoretically possible in the sense that they could design silicon that does something similar to what they described. I totally understand what they are trying to do:
Have the compiler produce a complete set of predicted branch targets and improve the targets over time by instrumenting the process. It sounds great superficially, you don't have to bake any branch prediction logic into the cpu at all and you can use the all the information the compiler has combined with how the program actually runs. The problem(performance-wise) is that most branches are executed many times and have dynamic behavior that can't be exploited by just producing a table of branches.
Silicon-wise I don't see any way they could integrate such a complex piece of standalone logic into the frontend of the CPU and still maintain acceptable latency. The branch predictor of modern CPUs is incredibly well engineered and optimized and this just isn't in the same ballpark and can't perform as well.
> It has to hold significantly more information than a BTB if Mill wants to replace all the other associated branch prediction machinery with a table that needs to predict every target.
The exit table and the prediction cache that they've described are only intended to be the common subset of prediction mechanisms that differ from a typical CPU, while a high-end Mill could also be incorporating many of the tricks you would expect on a desktop-class CPU, but those details were beyond the scope of the prediction talk.
High-end x86 CPUs already have pretty complex BTBs, often multi-level caches with different capacity for different branch types and limits on how many branches per I$ line. The only extra thing the Mill is adding with its prediction cache is the length information on how many bytes of instructions to prefetch and decode to get to the expected branch point (whereupon the prediction cache will be consulted a second time to start on the next EBB). As with x86 cores, I don't think the prediction cache/BTB would always be the same buffer or same size as the history pattern table or other model-specific prediction mechanisms.
The exit table is further from the execution pipeline and isn't on the critical path for recovering from a mispredict. It gets feedback from the core but only directly feeds into the prefetcher. This is the cache that will have the save/restore mechanisms and will load data from RAM in batches. It sounds a lot like a TLB in these respects, though the context switch that triggers loading new data is a function call rather than only a process switch. It definitely doesn't need to predict everything. In the talk it was mentioned that the exit table would have victim buffers and has a 3-cycle latency to the prefetcher. (And remember that Mill clock cycles will probably be pretty long in initial implementations.)
If we assume that they really do mean "associative cache" for the predictor table, then that's obviously something that really works in hardware, and can be updated on mis-predicts. Flushing the predictor state out to some OS-accessible area when the task quits doesn't sound like magic to me. Pre-loading the cache with warm data when the task is launched next time would be very latency-sensitive and is thus more questionable, but I don't think you've adequately explained why it's impossible to pull off.