Hacker News new | past | comments | ask | show | jobs | submit login

A Haswell core can do 2 vector multiply-adds per cycle, which results in a peak of 32 single-precision FLOP per cycle per core or 16 double-precision FLOP per cycle per core.



The mill's 33 ops/cycle are all independent operations, i.e. not counting individual vector elements.


The instruction encoding talk starts with comparison between Mill, DSP and Haswell and tries to explain the basic math. The Mill is a DSP that can run normal, "general purpose" code better - 10x better - than an OoO superscalar. The Mill used in the comparison - one for your laptop - is able to issue 8 SIMD integer ops and 2 SIMD FP ops each cycle, plus other logic.


I was strictly replying to the Intel FLOPs claim of the parent comment. I have only a faint idea how the Mill CPU works, so I can't really compare against it.

From the little I have read, the Mill CPU looks like a cool idea, but I'm skeptical about the claims. I'd rather see claims of efficiency on particular kernels (this can be cherry-picked too, but at least it will be useful to somebody) than pure instruction decoding/issuing numbers. Those are like peak FLOPs: depending on the rest of the architecture they can become effectively impossible to achieve in reality. In any case, I'm looking forward to hearing more about this.


Apologies, I was replying to the thread in general and not your post in particular.

Art has now published the 33 pipeline breakdown on the "Gold" Mill here: http://ootbcomp.com/topic/introduction-to-the-mill-cpu-progr...

A key thing generally is that vectorisation on the Mill is applicable to almost all while loops, so is about speeding up normal code (which is 80% loops with conditions and flow of control) as well as classic math.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: