I am pretty sure they are talking about per-cycle performance. Since they can do 33 operations per cycle. IIRC the peak performance of an Intel chip at the moment is 6 FLOP per 2 cycles (or there abouts).
Of course this is beyond ridiculous since a 780 TI can pull off 5 TFLOP/sec on a little under a GHz clock, 5,000 FLOP per cycle is a little more than 33.
It seems like an interesting design, but comparing performance against what an x64 chip can do is a bit silly, you can't just pick numbers at random and call that the overall improvement.
A Haswell core can do 2 vector multiply-adds per cycle, which results in a peak of 32 single-precision FLOP per cycle per core or 16 double-precision FLOP per cycle per core.
The instruction encoding talk starts with comparison between Mill, DSP and Haswell and tries to explain the basic math. The Mill is a DSP that can run normal, "general purpose" code better - 10x better - than an OoO superscalar. The Mill used in the comparison - one for your laptop - is able to issue 8 SIMD integer ops and 2 SIMD FP ops each cycle, plus other logic.
I was strictly replying to the Intel FLOPs claim of the parent comment. I have only a faint idea how the Mill CPU works, so I can't really compare against it.
From the little I have read, the Mill CPU looks like a cool idea, but I'm skeptical about the claims. I'd rather see claims of efficiency on particular kernels (this can be cherry-picked too, but at least it will be useful to somebody) than pure instruction decoding/issuing numbers. Those are like peak FLOPs: depending on the rest of the architecture they can become effectively impossible to achieve in reality. In any case, I'm looking forward to hearing more about this.
A key thing generally is that vectorisation on the Mill is applicable to almost all while loops, so is about speeding up normal code (which is 80% loops with conditions and flow of control) as well as classic math.
Well, at least they should tell _which_ number they picked. They also mention power, so I would expect it's something more than just instructions per cycle.
Given that there's not a publicly-available simulator or compiler for the Mill, I hope they washed their hands after retrieving those figures. The GI tract is not a friendly place... ;)
They have been running simulations and had working compilers for a while now. We cannot verify the numbers yet, but they don't seem to just be pulled out of thin air.
I think their first to third talks go into more detail on this. IIRC it's something like ops per second per watt (and not completely theoretical best-case either: based on running realistic code in sim).
It would be nice to know how they got that number. Because it seems to be too good to be true.