>> One of the perennial questions that I've never received a satisfactory answer to is: Where's the ILP?
Anyone with any experience in the field can come up with an instruction encoding structure that produces more instruction level parallelism in theory. It's usually a simple mental exercise to determine that such a CPU isn't implementable in the "real world." That's exactly what is happening here with Mill except they have really dived off the deep end and do so much handwaving that it's hard to tell exactly how mill is supposed to work.
Take branch prediction: Last time I checked they wanted the processor to load what they described as an "exit table" associated with a module of code that serves as a branch prediction unit. The problem is that this doesn't work at all. You can't have a CPU load a separate data structure that replaces its internal state like this.
First off, this is way too complex as a piece of front end logic for the processor. Trying to access this as code is running is basically impossible. Secondly, A branch predictor has to produce a prediction immediately, you can't just handwave and say you will ask an internal data store structure to produce a result. Their description of this is that it is in some way a "hash table" (http://millcomputing.com/wiki/Prediction). That's really interesting because you would never ever want to implement a hash table in a CPU. Either a) they described it analogously as acting like a hash table; which doesn't make a lot of sense here in a technical description b) they really want to implement a hash table lookup as a branch predictor which be horrendously complex and would never work latency wise or c) they are are calling an associative cache a hash table. They also want to somehow update this "hash table" with a correct prediction on a mispredict. So somehow the processor will update a separate data structure (using a hash table lookup) every time there is a misprediction. Which again, doesn't work latency wise.
By the way, all of this somehow has to be loaded and unloaded seamlessly into main memory.
Notice that there is no description of any advanced logic that is used in modern processors to dramatically increase the accuracy of predictors. I don't even understand how they would possibly implement any of that advanced logic with the branch prediction unit so complex and decoupled from the front end of the processor.
Of course, this kind of compiler driven prediction has been tried before and it doesn't work for obvious reasons. The compiler has a lot less information about the current state of the program than the processor does.
So to recap: they described a feature that doesn't work as described, if it did work would be horrendously complex and wouldn't properly work as a branch predictor and finally doesn't actually provide better predictions at all.
Disclaimer: Going off memory of the presentations myself, not at all an expert, etc.
Ivan Goddard's descriptions of the branch predictor sounds quite simple that I can remember; The predictor itself is completely detached from the compiler, outside of the instruction layout falling into EBBs. A point that (I think is on the Mill forum somewhere) they've said about the prediction algorithim itself, is that they're doing simulations with nearly the simplest known prediction algorithm, and getting something like 90% success rates from that compared to the 98% the state-of-the-art sees. The difference is entirely in what data is carried by the prediction. (I believe this is in the forum thread on Switch statements, referenced in this video) What the predictor stores and delivers as output is the logical equivalent to an instruction pointer, and where in the EBB they exit, instead of a boolean (or an entire exit table). Which is more state, but not replacing the entire state of the processor. And, as explained in this presentation, the impact of having (effectively) an instruction pointer from the branch predictor is you know what to prefetch from memory. And that pointer, being associated with some other EBB that the predictor might know about, is also enough to ask the predictor for the next prediction, get a second prefetch, and so on.
That also has implications to a point Goddard made in this presentation, in that some entire Switch statements have only one "EBB" (although that's loose terminology itself), and thus only one prediction associated with it. The prediction is to where the block will exit, not whether any specific prediction will succeed, thus there is no case where the processor will stall multiple times before finding the correct control flow. It can only fail once before it determines the correct exit to the block (in general, at least...)
Which is I believe how they are expecting good results, rather than anything peculiar about how they determine the prediction from history.
I'm not sure I follow about the benefits of this branch prediction scheme in the first place. I already have a branch target from the branch target buffer( assuming I'm in a conventional processor), so either the frontend can begin immediately loading code from the cache or I stall while I load from main memory. I'm never prefetching anything(at least in terms of an initial target)
Let's assume that Mill can produce a prefetch chain though that can load additional code segments beyond the initial branch target. Does this really benefit the processor? I don't understand how a static prediction of where I will branch to next after an initial branch target is even useful. If I have already branched to that location before the code will be sitting in the L1 cache waiting for me, if it isn't, well, how does the branch predictor know more about the behavior of my program than the L1 cache?
To put it another way: if the instruction set doesn't fit in the L1 cache is there some predictable way that I can load and unload the cache just based on a given branch target( that in turn implies other branch targets will be loaded next with a total instruction set larger than the cache). I don't really think there is any possible way to predict the behavior of an entire program like this. Larger instruction segments and more branch targets imply more complex programs that are correspondingly harder to predict so I'm always better off just relying on the instruction cache.
Assuming Mill tries to load entire chains of instructions into the cache based on a simple static target I would be much better off just disabling this and relying on the instruction cache in almost all cases therefore.
The target is not static; it is dynamically updated as part of history.
Prefetch chaining is to get code out of DRAM, and it runs DRAM-latency ahead of execution. Separately, fetch chaining is to get the code up to the L1, and it runs L2/L3-latency ahead of execution. Load chaining gets the lines from the L1 to the L0 micro-caches and the decoders, and runs decode-latency (3 cycles on the Mill) ahead of execution.
The Mill stages instructions like this because the further ahead in time an action is the more likely that the action is down a path that execution will not take. We don't prefetch to the L1 because we don't have enough confidence that we will need the code to be willing to spam a scarce resource like the L1. But avoiding a full DRAM hit is important too, so we stage the fetching. It doesn't matter at all in small codes that spend all their time in a five-line loop, but that's not the only kind of codes there are :-)
Perhaps you could revisit the prediction talk. Short summary:
Table reload is part of our support for the micro-process coding style. It is essentially irrelevant for typical long running benchmarks, especially those that discard the first billion instructions or so to "warm up the caches".
Table reload provides faster response for cold code (either new processes or running processes at a program phase boundary) than simply letting the predictor accumulate history experience. There are heuristics that decide whether execution has entered a new phase and should table-load; the table is not reloaded on every miss. Like any, the heuristics may be better or worse for a given code.
The historical prediction information is in the load module file and is mapped into DRAM at process load time, just like the code and static data sections. Table-load is memory-to-predictor in hardware and is no more difficult than any of the other memory-to-cache-like-structure loading that all cores use, such as loading translation-table entries to a TLB.
While a newly-compiled load module file gets a prediction table from the compiler, purely as being better than nothing, the memory image from the file is continually updated during execution based on execution experience. When the process terminates, this newly-augmented history is retained in the file, so a subsequent run of the same load module is in effect self-profiling to take advantage of actual execution history. Of course, programs behave differently from run to run and the saved profile experience may be inapt for the next run; there are heuristics that try to cope with that too, although we have insufficient experience as yet to know how well those work. However, we are reasonably confident that even inapt history will be better than the random predictions made by a conventional predictor on cold code.
As always in the Mill, we welcome in the Forum (millcomputing.com/forum) posts of the form "I don't understand how <feature> works - don't you have trouble with <problem>?". Unfortunately, time and audience constraints don't let us go as deep into the details in our talks as we'd like, but the details are available for you. If, after you have understood what a feature is for and how it works, you still see a problem that we have overlooked (as has happened a lot over the years; part of the reason it's been years) then we'd really welcome your ideas about what to do about it, too.
Is there really anything wrong with conflating a hash table and an associative cache in this context? An associative cache uses a portion of the address as the key, and has a usually limited capability for handling collisions. From a software perspective this is pretty similar to how a hash table works, even if the circuit that you visualize when hearing "hash table" is different.
If we assume that they really do mean "associative cache" for the predictor table, then that's obviously something that really works in hardware, and can be updated on mis-predicts. Flushing the predictor state out to some OS-accessible area when the task quits doesn't sound like magic to me. Pre-loading the cache with warm data when the task is launched next time would be very latency-sensitive and is thus more questionable, but I don't think you've adequately explained why it's impossible to pull off.
>>If we assume that they really do mean "associative cache" for the predictor table, then that's obviously something that really works in hardware, and can be updated on mis-predicts. Flushing the predictor state out to some OS-accessible area when the task quits doesn't sound like magic to me.
That is quite a complex operation though. How large is the associative table that is being flushed and how often does the flushing occur? Assuming we are talking about something in the general size of a BTB loading this entire segment into the branch predictor on very context switch is possible(although how this is implemented on silicon would have to be quite complex)
The data structure described in the paper must be quite a lot larger than a simple BTB though if the predictor is to work as described. It has to hold significantly more information than a BTB if Mill wants to replace all the other associated branch prediction machinery with a table that needs to predict every target. I don't believe a prediction table this size would be feasible to load on every context switch. It's possible they want a smaller table and to replicate some of the branch prediction machinery in a modern CPU. But this seems to defeat the entire point of having a prediction table produced by a compiler and updated by successive runs of the program.
It's very hard to tell exactly what they are loading or what the table will actually do. That's why I'm not surprised they described it as a hash table. It's a simple explanation for how their solution works that avoids having to explain tricky implementation details.
>> but I don't think you've adequately explained why it's impossible to pull off.
I think it's theoretically possible in the sense that they could design silicon that does something similar to what they described. I totally understand what they are trying to do:
Have the compiler produce a complete set of predicted branch targets and improve the targets over time by instrumenting the process. It sounds great superficially, you don't have to bake any branch prediction logic into the cpu at all and you can use the all the information the compiler has combined with how the program actually runs. The problem(performance-wise) is that most branches are executed many times and have dynamic behavior that can't be exploited by just producing a table of branches.
Silicon-wise I don't see any way they could integrate such a complex piece of standalone logic into the frontend of the CPU and still maintain acceptable latency. The branch predictor of modern CPUs is incredibly well engineered and optimized and this just isn't in the same ballpark and can't perform as well.
> It has to hold significantly more information than a BTB if Mill wants to replace all the other associated branch prediction machinery with a table that needs to predict every target.
The exit table and the prediction cache that they've described are only intended to be the common subset of prediction mechanisms that differ from a typical CPU, while a high-end Mill could also be incorporating many of the tricks you would expect on a desktop-class CPU, but those details were beyond the scope of the prediction talk.
High-end x86 CPUs already have pretty complex BTBs, often multi-level caches with different capacity for different branch types and limits on how many branches per I$ line. The only extra thing the Mill is adding with its prediction cache is the length information on how many bytes of instructions to prefetch and decode to get to the expected branch point (whereupon the prediction cache will be consulted a second time to start on the next EBB). As with x86 cores, I don't think the prediction cache/BTB would always be the same buffer or same size as the history pattern table or other model-specific prediction mechanisms.
The exit table is further from the execution pipeline and isn't on the critical path for recovering from a mispredict. It gets feedback from the core but only directly feeds into the prefetcher. This is the cache that will have the save/restore mechanisms and will load data from RAM in batches. It sounds a lot like a TLB in these respects, though the context switch that triggers loading new data is a function call rather than only a process switch. It definitely doesn't need to predict everything. In the talk it was mentioned that the exit table would have victim buffers and has a 3-cycle latency to the prefetcher. (And remember that Mill clock cycles will probably be pretty long in initial implementations.)
Anyone with any experience in the field can come up with an instruction encoding structure that produces more instruction level parallelism in theory. It's usually a simple mental exercise to determine that such a CPU isn't implementable in the "real world." That's exactly what is happening here with Mill except they have really dived off the deep end and do so much handwaving that it's hard to tell exactly how mill is supposed to work.
Take branch prediction: Last time I checked they wanted the processor to load what they described as an "exit table" associated with a module of code that serves as a branch prediction unit. The problem is that this doesn't work at all. You can't have a CPU load a separate data structure that replaces its internal state like this.
First off, this is way too complex as a piece of front end logic for the processor. Trying to access this as code is running is basically impossible. Secondly, A branch predictor has to produce a prediction immediately, you can't just handwave and say you will ask an internal data store structure to produce a result. Their description of this is that it is in some way a "hash table" (http://millcomputing.com/wiki/Prediction). That's really interesting because you would never ever want to implement a hash table in a CPU. Either a) they described it analogously as acting like a hash table; which doesn't make a lot of sense here in a technical description b) they really want to implement a hash table lookup as a branch predictor which be horrendously complex and would never work latency wise or c) they are are calling an associative cache a hash table. They also want to somehow update this "hash table" with a correct prediction on a mispredict. So somehow the processor will update a separate data structure (using a hash table lookup) every time there is a misprediction. Which again, doesn't work latency wise.
By the way, all of this somehow has to be loaded and unloaded seamlessly into main memory.
Notice that there is no description of any advanced logic that is used in modern processors to dramatically increase the accuracy of predictors. I don't even understand how they would possibly implement any of that advanced logic with the branch prediction unit so complex and decoupled from the front end of the processor.
Of course, this kind of compiler driven prediction has been tried before and it doesn't work for obvious reasons. The compiler has a lot less information about the current state of the program than the processor does.
So to recap: they described a feature that doesn't work as described, if it did work would be horrendously complex and wouldn't properly work as a branch predictor and finally doesn't actually provide better predictions at all.
This is basically Mill in a nutshell.