The reason I bring up the comparison is that both the Mill and this paper rely upon the existence of these sorts of dependency chains (I like to call them "picothreads" or "femtothreads") in general purpose code in order to get good performance. This paper shows that the existence of such chains are far too wide and shallow to be useful for their intended purpose.
FWIW, someone (disjoint from the Mill team!) has actually made a model of the Mill's phasing and found that it fails to deliver.
And "stronger characteristics" means that it dynamically identifies dependencies.
Not really, publication quality. Just a Sunday-morning effort to understand the ceiling on the phased-execution model. All resources like decode, #pipes, cache and branch prediction are unmodeled, assumed infinite. I'm sure anyone can improve on it, but I lost interest with disappointing results.
gunzip the data, run it through the script and count the lines out vs lines in. (I get "IPC" ~2.12.) For a bit more insight, the number on the left of each output line is how many instructions fused. You can make a histogram like:
To think I messed around with Valgrind ;) This is much easier.
I look at the particulars of the grouping and think 'no, that's not right', and 'that can't be with this', and 'that can be with that', and so on. The results are far too back-of-the-envelope to be right. But the approach is excellent, and that's the main thing :)
I'm going to crunch through the IR recording and work out the ILP with/without speculation etc. Will leave pipelining and auto-vectorization off the table. As this is all public stuff I reckon the results can be made public.
Wanna work with me on this? I'd like you back. This time it might work. Mail me :)
> This paper shows that the existence of such chains are far too wide and shallow to be useful for their intended purpose.
But doesn't this paper model a machine that is substantially narrower and deeper than a high-end Mill is expected to be? And what you judge to be the depth of a dependency chain depends on whether you're looking at a 128-instruction reorder buffer or if you're a compiler looking at the entire function.
Figure 4 in the paper particularly seems to indicate that the failing of the dependence-based model is entirely in the front-end, which seems to be the least Mill-like part of the model.
> it shows that the reason for that anguishing front end is due to the stalled instructions.
Stalled by what, though? Stalled waiting for memory (partially alleviated by the Mill's deferred loads), stalled waiting for dispatch from the queues that the Mill doesn't have, or stalled when at the head of the dispatch queue and ready to issue when the assigned execution unit is busy but some other execution unit is free?
The whole dispatch vs issue mismatch they're complaining about doesn't exist on the Mill or any other statically-scheduled machine (though there's a related problem in how many instruction slots end up as NOPs). There are at least three other features of the Mill that make the example in figure 6(a) inapplicable, and they didn't diagram the dependencies from the real code used to generate figure 6(b).
I don't think those scheduling constraints exist on the Mill. Certainly the ones illustrated in figure 6(a) don't. Add operation number 5 could issue during the empty slot on cycle number 5 instead of 6 thanks to deferred loads, or thanks to the fact that the Mill uses a multitude of specialized execution units rather than a handful of general-purpose execution units—a busy load unit could never stall an independent add operation to begin with.
It looks like add operation number 5 can't use the empty cycle on lane 2 because compare number 3 would already be queued in that lane. The time-reversed/consumers first scheduling heuristic that a Mill compiler uses wouldn't paint itself into this corner the way the this model's dispatcher does.
Figure 6(a) for a Mill would look like an ALU pipeline doing an add operation every cycle without stall, a load pipeline issuing a load operation every cycle without stall, another ALU pipeline doing compare operations every cycle but lagging behind by several cycles instead of just one, and a branch unit doing the conditional jump every cycle immediately after the corresponding compare. And on a mid-level or high-end Mill there would be plenty of execution units to spare for a non-trivial loop body without reducing throughput, and the whole thing would be vectorized too.
FWIW, someone (disjoint from the Mill team!) has actually made a model of the Mill's phasing and found that it fails to deliver.
And "stronger characteristics" means that it dynamically identifies dependencies.