My first thought was uop cache as well. The fact that pre-Sandy Bridge processors have the intuitive performance behavior makes me more certain it's related.
One interesting experiment to try would be to realign the loop to 0..31 mod 32, and see how/if the behavior changes.
One interesting experiment to try would be to realign the loop to 0..31 mod 32, and see how/if the behavior changes.