A Haswell core can do 2 vector multiply-adds per cycle, which results in a peak ...

infogulch · on Feb 8, 2014

The mill's 33 ops/cycle are all independent operations, i.e. not counting individual vector elements.

willvarfar · on Feb 8, 2014

The instruction encoding talk starts with comparison between Mill, DSP and Haswell and tries to explain the basic math. The Mill is a DSP that can run normal, "general purpose" code better - 10x better - than an OoO superscalar. The Mill used in the comparison - one for your laptop - is able to issue 8 SIMD integer ops and 2 SIMD FP ops each cycle, plus other logic.

pbsd · on Feb 8, 2014

I was strictly replying to the Intel FLOPs claim of the parent comment. I have only a faint idea how the Mill CPU works, so I can't really compare against it.

From the little I have read, the Mill CPU looks like a cool idea, but I'm skeptical about the claims. I'd rather see claims of efficiency on particular kernels (this can be cherry-picked too, but at least it will be useful to somebody) than pure instruction decoding/issuing numbers. Those are like peak FLOPs: depending on the rest of the architecture they can become effectively impossible to achieve in reality. In any case, I'm looking forward to hearing more about this.

willvarfar · on Feb 8, 2014

Apologies, I was replying to the thread in general and not your post in particular.

Art has now published the 33 pipeline breakdown on the "Gold" Mill here: http://ootbcomp.com/topic/introduction-to-the-mill-cpu-progr...

A key thing generally is that vectorisation on the Mill is applicable to almost all while loops, so is about speeding up normal code (which is 80% loops with conditions and flow of control) as well as classic math.