A Haswell core can do 2 vector multiply-adds per cycle, which results in a peak of 32 single-precision FLOP per cycle per core or 16 double-precision FLOP per cycle per core.
The instruction encoding talk starts with comparison between Mill, DSP and Haswell and tries to explain the basic math. The Mill is a DSP that can run normal, "general purpose" code better - 10x better - than an OoO superscalar. The Mill used in the comparison - one for your laptop - is able to issue 8 SIMD integer ops and 2 SIMD FP ops each cycle, plus other logic.
I was strictly replying to the Intel FLOPs claim of the parent comment. I have only a faint idea how the Mill CPU works, so I can't really compare against it.
From the little I have read, the Mill CPU looks like a cool idea, but I'm skeptical about the claims. I'd rather see claims of efficiency on particular kernels (this can be cherry-picked too, but at least it will be useful to somebody) than pure instruction decoding/issuing numbers. Those are like peak FLOPs: depending on the rest of the architecture they can become effectively impossible to achieve in reality. In any case, I'm looking forward to hearing more about this.
A key thing generally is that vectorisation on the Mill is applicable to almost all while loops, so is about speeding up normal code (which is 80% loops with conditions and flow of control) as well as classic math.