Hacker News new | past | comments | ask | show | jobs | submit login

Note that were the comparisons unsigned, we could use adc instead, which uses one register less, and as a result gets rid of quite a few instructions; the unsigned version is basically

     mov edx, [esp+8]
     xor eax, eax
     mov ecx, array_start
    L:
     cmp [ecx], edx        ; CF = (unsigned) [ecx] < edx
     adc eax, 0            ; eax += CF
     add ecx, 4
     cmp ecx, array_end
     jb L
     ret
and this version is 40% faster on my machine than the signed version.



Nice. I think you might be able to shave another cycle off if you were to use [base + 4 * index] addressing with base = array_end and a negative index. This allows you to get rid of the second compare and use 'add index_register, 4' 'jnz L' for the loop test and exit when index == 0. This saves another instruction as well as a cycle of dependency.

Of course, you could also go all out and vectorize: vpcmpgtd, then vpaddd for per-column subtotal, then sum the elements at the end, then negate. Combined with the negative index trick, you might be able to get it down to single cycle per 8 ints. Ahh, the joys of micro-optimizing toy loops!




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: