> The Mill has a 10x single-thread power/performance gain over conventional out-...

Guvante · on Feb 7, 2014

I am pretty sure they are talking about per-cycle performance. Since they can do 33 operations per cycle. IIRC the peak performance of an Intel chip at the moment is 6 FLOP per 2 cycles (or there abouts).

Of course this is beyond ridiculous since a 780 TI can pull off 5 TFLOP/sec on a little under a GHz clock, 5,000 FLOP per cycle is a little more than 33.

It seems like an interesting design, but comparing performance against what an x64 chip can do is a bit silly, you can't just pick numbers at random and call that the overall improvement.

pbsd · on Feb 7, 2014

A Haswell core can do 2 vector multiply-adds per cycle, which results in a peak of 32 single-precision FLOP per cycle per core or 16 double-precision FLOP per cycle per core.

infogulch · on Feb 8, 2014

The mill's 33 ops/cycle are all independent operations, i.e. not counting individual vector elements.

willvarfar · on Feb 8, 2014

The instruction encoding talk starts with comparison between Mill, DSP and Haswell and tries to explain the basic math. The Mill is a DSP that can run normal, "general purpose" code better - 10x better - than an OoO superscalar. The Mill used in the comparison - one for your laptop - is able to issue 8 SIMD integer ops and 2 SIMD FP ops each cycle, plus other logic.

pbsd · on Feb 8, 2014

I was strictly replying to the Intel FLOPs claim of the parent comment. I have only a faint idea how the Mill CPU works, so I can't really compare against it.

From the little I have read, the Mill CPU looks like a cool idea, but I'm skeptical about the claims. I'd rather see claims of efficiency on particular kernels (this can be cherry-picked too, but at least it will be useful to somebody) than pure instruction decoding/issuing numbers. Those are like peak FLOPs: depending on the rest of the architecture they can become effectively impossible to achieve in reality. In any case, I'm looking forward to hearing more about this.

willvarfar · on Feb 8, 2014

Apologies, I was replying to the thread in general and not your post in particular.

Art has now published the 33 pipeline breakdown on the "Gold" Mill here: http://ootbcomp.com/topic/introduction-to-the-mill-cpu-progr...

A key thing generally is that vectorisation on the Mill is applicable to almost all while loops, so is about speeding up normal code (which is 80% loops with conditions and flow of control) as well as classic math.

jtuente · on Feb 7, 2014

5,000 FLOP per cycle using 2,880 CUDA cores, so less than 2 FLOP/cycle/core.

Guvante · on Feb 9, 2014

But I bet if we compared core sizes that distinction would disappear, a CUDA core is incredibly compact after all.

solarexplorer · on Feb 7, 2014

Well, at least they should tell _which_ number they picked. They also mention power, so I would expect it's something more than just instructions per cycle.

al2o3cr · on Feb 8, 2014

Given that there's not a publicly-available simulator or compiler for the Mill, I hope they washed their hands after retrieving those figures. The GI tract is not a friendly place... ;)

kazagistar · on Feb 9, 2014

They have been running simulations and had working compilers for a while now. We cannot verify the numbers yet, but they don't seem to just be pulled out of thin air.

rcxdude · on Feb 8, 2014

I think their first to third talks go into more detail on this. IIRC it's something like ops per second per watt (and not completely theoretical best-case either: based on running realistic code in sim).