There could be an instruction scheduler impact here as well. Intel processors are known for having an uncommonly deep execution window.
It turns out that the Nth wyhash64_x doesn't depend on any of the multiplies in the N-1th iterations. It only depends on the addition of the zeroth order constant.
So, with a sufficiently deep pipeline, the instruction scheduler can effectively be in the middle of several of those wyhash iterations all at the same time, thus hiding nearly all of the hash's latency by using the other iterations to do it.
Indeed. Of course, the idea that this is invalid implies that "real" application code (whatever that is) would be designed to have a sequential dependency on a single wyhash64 result and to sit on its thumbs waiting. Maybe, and maybe not. One can make up any argument one likes.
It turns out that the Nth wyhash64_x doesn't depend on any of the multiplies in the N-1th iterations. It only depends on the addition of the zeroth order constant.
So, with a sufficiently deep pipeline, the instruction scheduler can effectively be in the middle of several of those wyhash iterations all at the same time, thus hiding nearly all of the hash's latency by using the other iterations to do it.
Such are the perils of micro-benchmarking.