If I'm reading it right, the Intel compiler just took advantage of the outer test loop. It was supposed to loop over the data many times. Instead looped over each item of the data many times. Got the same answer, but the branch predictor had a much easier job.