> it shows that the reason for that anguishing front end is due to the stalled instructions.
Stalled by what, though? Stalled waiting for memory (partially alleviated by the Mill's deferred loads), stalled waiting for dispatch from the queues that the Mill doesn't have, or stalled when at the head of the dispatch queue and ready to issue when the assigned execution unit is busy but some other execution unit is free?
The whole dispatch vs issue mismatch they're complaining about doesn't exist on the Mill or any other statically-scheduled machine (though there's a related problem in how many instruction slots end up as NOPs). There are at least three other features of the Mill that make the example in figure 6(a) inapplicable, and they didn't diagram the dependencies from the real code used to generate figure 6(b).
I don't think those scheduling constraints exist on the Mill. Certainly the ones illustrated in figure 6(a) don't. Add operation number 5 could issue during the empty slot on cycle number 5 instead of 6 thanks to deferred loads, or thanks to the fact that the Mill uses a multitude of specialized execution units rather than a handful of general-purpose execution units—a busy load unit could never stall an independent add operation to begin with.
It looks like add operation number 5 can't use the empty cycle on lane 2 because compare number 3 would already be queued in that lane. The time-reversed/consumers first scheduling heuristic that a Mill compiler uses wouldn't paint itself into this corner the way the this model's dispatcher does.
Figure 6(a) for a Mill would look like an ALU pipeline doing an add operation every cycle without stall, a load pipeline issuing a load operation every cycle without stall, another ALU pipeline doing compare operations every cycle but lagging behind by several cycles instead of just one, and a branch unit doing the conditional jump every cycle immediately after the corresponding compare. And on a mid-level or high-end Mill there would be plenty of execution units to spare for a non-trivial loop body without reducing throughput, and the whole thing would be vectorized too.
Stalled by what, though? Stalled waiting for memory (partially alleviated by the Mill's deferred loads), stalled waiting for dispatch from the queues that the Mill doesn't have, or stalled when at the head of the dispatch queue and ready to issue when the assigned execution unit is busy but some other execution unit is free?
The whole dispatch vs issue mismatch they're complaining about doesn't exist on the Mill or any other statically-scheduled machine (though there's a related problem in how many instruction slots end up as NOPs). There are at least three other features of the Mill that make the example in figure 6(a) inapplicable, and they didn't diagram the dependencies from the real code used to generate figure 6(b).