Hacker News new | past | comments | ask | show | jobs | submit login

> The Mill has a 10x single-thread power/performance gain over conventional out-of-order (OoO) superscalar architectures

It would be nice to know how they got that number. Because it seems to be too good to be true.




I am pretty sure they are talking about per-cycle performance. Since they can do 33 operations per cycle. IIRC the peak performance of an Intel chip at the moment is 6 FLOP per 2 cycles (or there abouts).

Of course this is beyond ridiculous since a 780 TI can pull off 5 TFLOP/sec on a little under a GHz clock, 5,000 FLOP per cycle is a little more than 33.

It seems like an interesting design, but comparing performance against what an x64 chip can do is a bit silly, you can't just pick numbers at random and call that the overall improvement.


A Haswell core can do 2 vector multiply-adds per cycle, which results in a peak of 32 single-precision FLOP per cycle per core or 16 double-precision FLOP per cycle per core.


The mill's 33 ops/cycle are all independent operations, i.e. not counting individual vector elements.


The instruction encoding talk starts with comparison between Mill, DSP and Haswell and tries to explain the basic math. The Mill is a DSP that can run normal, "general purpose" code better - 10x better - than an OoO superscalar. The Mill used in the comparison - one for your laptop - is able to issue 8 SIMD integer ops and 2 SIMD FP ops each cycle, plus other logic.


I was strictly replying to the Intel FLOPs claim of the parent comment. I have only a faint idea how the Mill CPU works, so I can't really compare against it.

From the little I have read, the Mill CPU looks like a cool idea, but I'm skeptical about the claims. I'd rather see claims of efficiency on particular kernels (this can be cherry-picked too, but at least it will be useful to somebody) than pure instruction decoding/issuing numbers. Those are like peak FLOPs: depending on the rest of the architecture they can become effectively impossible to achieve in reality. In any case, I'm looking forward to hearing more about this.


Apologies, I was replying to the thread in general and not your post in particular.

Art has now published the 33 pipeline breakdown on the "Gold" Mill here: http://ootbcomp.com/topic/introduction-to-the-mill-cpu-progr...

A key thing generally is that vectorisation on the Mill is applicable to almost all while loops, so is about speeding up normal code (which is 80% loops with conditions and flow of control) as well as classic math.


5,000 FLOP per cycle using 2,880 CUDA cores, so less than 2 FLOP/cycle/core.


But I bet if we compared core sizes that distinction would disappear, a CUDA core is incredibly compact after all.


Well, at least they should tell _which_ number they picked. They also mention power, so I would expect it's something more than just instructions per cycle.


Given that there's not a publicly-available simulator or compiler for the Mill, I hope they washed their hands after retrieving those figures. The GI tract is not a friendly place... ;)


They have been running simulations and had working compilers for a while now. We cannot verify the numbers yet, but they don't seem to just be pulled out of thin air.


I think their first to third talks go into more detail on this. IIRC it's something like ops per second per watt (and not completely theoretical best-case either: based on running realistic code in sim).




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: