Hacker News new | past | comments | ask | show | jobs | submit login

It seems to me that the main fundamentally new idea of the Mill is the deferred load mechanism described in the Memory talk. They claim that it can complete with OoO execution with a much simpler implementation. It does require some of the other Mill mechanisms for full efficiency, e.g. it uses the belt to defer loads across calls.

If this mechanism doesn't work as expected with real software, then there's no reason to think that the Mill will fare any better for general-purpose computation than any earlier VLIW/EPIC machine, for all the same reasons as before. And in that same talk they make some claims about OoO that seem a bit naive. The state-of-the-art has evolved a lot since the company was founded in 2003.




Copy paste:

Fun fact, the Mill's proposed method for hiding that latency, "deferred loads" has already been done by Duke's Architecture Group: http://people.duke.edu/~bcl15/documents/huang2016-nisc.pdf (warning PDF link). The big gain? A measly ~8%.


IIRC the Mill's presentation about deferred loads predates this paper, though the paper is a lot more detailed and has simulations. It's not clear how the Mill's gain from deferred loads would compare (it differs in a lot of other ways that would interact).


8% seems huge for an architectural change to me.


8% for an unoptimized PoC is pretty substansial, or it could be nothing when all details are accounted for. If it comes with a simpler silicon as well it's better and surely worth evaluating for GP processors.


I'm pretty out of my depth here, but I thought the deferred loads latency was part of the 12% gained from dynamic scheduling, not the 88% that the Mill claims to tackle with its phasing thing. In that case, 8% doesn't sound measly at all.

But like I said, I wouldn't be surprised if I'm just not understanding what I'm reading/watching.


Something equivalent to deferred loads is a part of AMD's GCN architecture, which has existed for many years now. (It's not exactly a deferred load; rather, there's an explicit wait instruction that can block a thread until a previous load has returned its results. But in the end that's almost the same thing.)

Of course, the characteristics of memory accesses are quite different in a GPU: very high bandwidth at the cost of very high latency.


Actually only ~12% of the performance obtained by OoO machines comes from its dynamic reaction capabilities (eg cache misses and branch mispredicts), ~88% comes from better schedules.

I encourage you to read McFarlin's publications, they're very good OoO statistics.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: