Hacker News new | past | comments | ask | show | jobs | submit login
How doing something can be faster than not doing it (msdn.com)
73 points by ingve on June 13, 2014 | hide | past | favorite | 17 comments



Note that were the comparisons unsigned, we could use adc instead, which uses one register less, and as a result gets rid of quite a few instructions; the unsigned version is basically

     mov edx, [esp+8]
     xor eax, eax
     mov ecx, array_start
    L:
     cmp [ecx], edx        ; CF = (unsigned) [ecx] < edx
     adc eax, 0            ; eax += CF
     add ecx, 4
     cmp ecx, array_end
     jb L
     ret
and this version is 40% faster on my machine than the signed version.


Nice. I think you might be able to shave another cycle off if you were to use [base + 4 * index] addressing with base = array_end and a negative index. This allows you to get rid of the second compare and use 'add index_register, 4' 'jnz L' for the loop test and exit when index == 0. This saves another instruction as well as a cycle of dependency.

Of course, you could also go all out and vectorize: vpcmpgtd, then vpaddd for per-column subtotal, then sum the elements at the end, then negate. Combined with the negative index trick, you might be able to get it down to single cycle per 8 ints. Ahh, the joys of micro-optimizing toy loops!


Has anyone made a processor that speculatively executes both branch options? I guess good simulations would show weather that or branch prediction is better.


I think this is called dual or multi-path execution if you want to Google for it.

http://people.cs.clemson.edu/~mark/eager.html says ...I've seen it alleged that some mainframe processors in the 1960s and 1970s executed down both paths beyond a branch; but, as far as I'm aware, multiple-path execution has never been done by a commercial processor.


From what I recall reading about this in the academic literature, the primary concern here is that speculative operations on the alternate paths consume power just like any other operation.


I think there were some older mainframes (CDC? System 360?) that may have done this to some extent -- it'd only really be feasible up to a certain (very low) point though, since you have the potential for an exponential explosion in the number of possible execution paths if more branches come up before you've resolved the outcome of one you're speculating on (especially in the deeper pipelines of modern machines, with perhaps 3- or 4-wide issue and a dozen or so cycles before a branch gets resolved).



x86 does this already since the P6, in fact the performance difference would be much worse if it didn't. The reason why this particular case has that behaviour is because there are actually two conditional branches - one for the loop itself, and the other for the check inside the loop. The CPU is able to execute both paths of one branch in parallel, but not the 4 possible paths of two branches in a row.

(I'm quite tempted to get out one of my P55Cs to try it out...)


Isn't that how Itanium is supposed to work? I mean, I think specific instructions would have to be sent in order to force it to happen. But, I seem to recall the line "predication, not prediction" as one of the features, if you want to call it that.


If indeed

  if (array[i] < boundary) count++;
is slower than

  count += (array[i] < boundary) ? 1 : 0;
Any reason why compilers can't infer the latter?


Linus once wrote a nice (typically Linus) rant on why cmov is a terrible idea in most cases:

http://yarchive.net/comp/linux/cmov.html

> In contrast, if you use a predicated instruction, ALL of it is on the critical path. Calculating the conditional is on the critical path. Calculating the value that gets used is obviously ALSO on the critical path, but so is the calculation for the value that DOESN'T get used too. So the cmov - rather than speeding things up - actually slows things down, because it makes more code be dependent on each other.


Huh. I guess the reason it works in the article is that the value is not calculated, it's just a constant.


No particular reason. Clang generates identical code for both on my system (Arch Linux/x86_64). Compile flags: "-std=c11 -c -O3 -S -march=native -mtune=native", windows-specific stuff removed.

GCC creates different code for the variations, but the generated code is too complex for me to say which one is faster. edit: with -Os GCC generates identical code too.


I've discovered the x86_64 opcode CMOVL -- a conditional move -- because of this.

    if (array[i] < boundary) count+=2;
I had to work hard to make GCC not produce optimal code. Even this

    if (array[i] < boundary) return count+1; else return count;
came out the same as all the others.

Avoiding branch prediction is hardly news. Was the article's author even compiling with optimizations enabled?


>>Was the article's author even compiling with optimizations enabled?

Quite possibly not. Which, I guess, is reasonable, since the point was really about cpu instructions and not C compilers. What comes out is more predictable with optimizations off.

Note that the author mentions the optimizer in relation to the second version. I think it might be because that code generates a branch with optimizations off in Visual C++. gcc seems to treat ?: as inherently a cmov and VC++ seems to treat it no differently than an if.


It does seem weird that the compiler didn't catch that one.


The title of this is misleading, and so is this part:

>>The cost of a single increment operation is highly variable. At low boundary values, it is around 0.03 time units per increment. But at high boundary values, the cost drops to one tenth that.

The cost of an increment operation has nothing to do with it. It's very fast and doesn't change. What matters is the cost of flushing the pipeline on every mispredicted branch.

Maybe you're supposed to get that that part is "naive" and ignore it once you've read the whole article?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: