Would march=native and fstrict-aliasing do any difference?
It would be interesting to compare the compiled asm with the hand rolled one.
The code has some potential improvements also but maybe the compiler is smart enough to find them, such as reading pivot.key in the loop even though it doesn't change.
From the makefile:
GCCFLAGS = -O3 --std=c++11
MSFLAGS = /nologo /Ox /Ob2 /Ot /Oi /GL