How doing something can be faster than not doing it

userbinator · on June 14, 2014

Note that were the comparisons unsigned, we could use adc instead, which uses one register less, and as a result gets rid of quite a few instructions; the unsigned version is basically

     mov edx, [esp+8]
     xor eax, eax
     mov ecx, array_start
    L:
     cmp [ecx], edx        ; CF = (unsigned) [ecx] < edx
     adc eax, 0            ; eax += CF
     add ecx, 4
     cmp ecx, array_end
     jb L
     ret

and this version is 40% faster on my machine than the signed version.

nkurz · on June 14, 2014

Nice. I think you might be able to shave another cycle off if you were to use [base + 4 * index] addressing with base = array_end and a negative index. This allows you to get rid of the second compare and use 'add index_register, 4' 'jnz L' for the loop test and exit when index == 0. This saves another instruction as well as a cycle of dependency.

Of course, you could also go all out and vectorize: vpcmpgtd, then vpaddd for per-column subtotal, then sum the elements at the end, then negate. Combined with the negative index trick, you might be able to get it down to single cycle per 8 ints. Ahh, the joys of micro-optimizing toy loops!

phkahler · on June 13, 2014

Has anyone made a processor that speculatively executes both branch options? I guess good simulations would show weather that or branch prediction is better.

js2 · on June 13, 2014

I think this is called dual or multi-path execution if you want to Google for it.

http://people.cs.clemson.edu/~mark/eager.html says ...I've seen it alleged that some mainframe processors in the 1960s and 1970s executed down both paths beyond a branch; but, as far as I'm aware, multiple-path execution has never been done by a commercial processor.

onan_barbarian · on June 14, 2014

From what I recall reading about this in the academic literature, the primary concern here is that speculative operations on the alternate paths consume power just like any other operation.

1amzave · on June 13, 2014

I think there were some older mainframes (CDC? System 360?) that may have done this to some extent -- it'd only really be feasible up to a certain (very low) point though, since you have the potential for an exponential explosion in the number of possible execution paths if more branches come up before you've resolved the outcome of one you're speculating on (especially in the deeper pipelines of modern machines, with perhaps 3- or 4-wide issue and a dozen or so cycles before a branch gets resolved).

sitkack · on June 14, 2014

Here is a whole chapter on it http://www.ele.uri.edu/~uht/papers/uht-6-multipath.pdf

http://196.29.172.66:8080/jspui/bitstream/123456789/4424/1/E...

userbinator · on June 14, 2014

x86 does this already since the P6, in fact the performance difference would be much worse if it didn't. The reason why this particular case has that behaviour is because there are actually two conditional branches - one for the loop itself, and the other for the check inside the loop. The CPU is able to execute both paths of one branch in parallel, but not the 4 possible paths of two branches in a row.

(I'm quite tempted to get out one of my P55Cs to try it out...)

mtdewcmu · on June 14, 2014

Isn't that how Itanium is supposed to work? I mean, I think specific instructions would have to be sent in order to force it to happen. But, I seem to recall the line "predication, not prediction" as one of the features, if you want to call it that.

alain94040 · on June 13, 2014

If indeed

  if (array[i] < boundary) count++;

is slower than

  count += (array[i] < boundary) ? 1 : 0;

Any reason why compilers can't infer the latter?

yongjik · on June 13, 2014

Linus once wrote a nice (typically Linus) rant on why cmov is a terrible idea in most cases:

http://yarchive.net/comp/linux/cmov.html

> In contrast, if you use a predicated instruction, ALL of it is on the critical path. Calculating the conditional is on the critical path. Calculating the value that gets used is obviously ALSO on the critical path, but so is the calculation for the value that DOESN'T get used too. So the cmov - rather than speeding things up - actually slows things down, because it makes more code be dependent on each other.

mtdewcmu · on June 14, 2014

Huh. I guess the reason it works in the article is that the value is not calculated, it's just a constant.

zokier · on June 13, 2014

No particular reason. Clang generates identical code for both on my system (Arch Linux/x86_64). Compile flags: "-std=c11 -c -O3 -S -march=native -mtune=native", windows-specific stuff removed.

GCC creates different code for the variations, but the generated code is too complex for me to say which one is faster. edit: with -Os GCC generates identical code too.

whoopdedo · on June 13, 2014

I've discovered the x86_64 opcode CMOVL -- a conditional move -- because of this.

    if (array[i] < boundary) count+=2;

I had to work hard to make GCC not produce optimal code. Even this

    if (array[i] < boundary) return count+1; else return count;

came out the same as all the others.

Avoiding branch prediction is hardly news. Was the article's author even compiling with optimizations enabled?

mtdewcmu · on June 14, 2014

>>Was the article's author even compiling with optimizations enabled?

Quite possibly not. Which, I guess, is reasonable, since the point was really about cpu instructions and not C compilers. What comes out is more predictable with optimizations off.

Note that the author mentions the optimizer in relation to the second version. I think it might be because that code generates a branch with optimizations off in Visual C++. gcc seems to treat ?: as inherently a cmov and VC++ seems to treat it no differently than an if.

WWLink · on June 14, 2014

It does seem weird that the compiler didn't catch that one.

mtdewcmu · on June 14, 2014

The title of this is misleading, and so is this part:

>>The cost of a single increment operation is highly variable. At low boundary values, it is around 0.03 time units per increment. But at high boundary values, the cost drops to one tenth that.

The cost of an increment operation has nothing to do with it. It's very fast and doesn't change. What matters is the cost of flushing the pipeline on every mispredicted branch.

Maybe you're supposed to get that that part is "naive" and ignore it once you've read the whole article?