Note that were the comparisons *unsigned*, we could use adc instead, which uses ...

nkurz · on June 14, 2014

Nice. I think you might be able to shave another cycle off if you were to use [base + 4 * index] addressing with base = array_end and a negative index. This allows you to get rid of the second compare and use 'add index_register, 4' 'jnz L' for the loop test and exit when index == 0. This saves another instruction as well as a cycle of dependency.

Of course, you could also go all out and vectorize: vpcmpgtd, then vpaddd for per-column subtotal, then sum the elements at the end, then negate. Combined with the negative index trick, you might be able to get it down to single cycle per 8 ints. Ahh, the joys of micro-optimizing toy loops!