Note that were the comparisons unsigned, we could use adc instead, which uses one register less, and as a result gets rid of quite a few instructions; the unsigned version is basically
Nice. I think you might be able to shave another cycle off if you were to use [base + 4 * index] addressing with base = array_end and a negative index. This allows you to get rid of the second compare and use 'add index_register, 4' 'jnz L' for the loop test and exit when index == 0. This saves another instruction as well as a cycle of dependency.
Of course, you could also go all out and vectorize: vpcmpgtd, then vpaddd for per-column subtotal, then sum the elements at the end, then negate. Combined with the negative index trick, you might be able to get it down to single cycle per 8 ints. Ahh, the joys of micro-optimizing toy loops!