Fun fact, the Mill's proposed method for hiding that latency, "deferred loads" has already been done by Duke's Architecture Group: http://people.duke.edu/~bcl15/documents/huang2016-nisc.pdf (warning PDF link).
The big gain? A measly ~8%.
IIRC the Mill's presentation about deferred loads predates this paper, though the paper is a lot more detailed and has simulations. It's not clear how the Mill's gain from deferred loads would compare (it differs in a lot of other ways that would interact).
8% for an unoptimized PoC is pretty substansial, or it could be nothing when all details are accounted for. If it comes with a simpler silicon as well it's better and surely worth evaluating for GP processors.
I'm pretty out of my depth here, but I thought the deferred loads latency was part of the 12% gained from dynamic scheduling, not the 88% that the Mill claims to tackle with its phasing thing. In that case, 8% doesn't sound measly at all.
But like I said, I wouldn't be surprised if I'm just not understanding what I'm reading/watching.
Fun fact, the Mill's proposed method for hiding that latency, "deferred loads" has already been done by Duke's Architecture Group: http://people.duke.edu/~bcl15/documents/huang2016-nisc.pdf (warning PDF link). The big gain? A measly ~8%.