The above comment originally used "32-byte boundary" everywhere it uses 64-byte now - the conventional wisdom was always that the uop cache operated on 32-byte boundaries only, and this is almost certainly true for Haswell and earlier.
However, I noticed after a bit more testing that the critical boundary here is actually 64-bytes, not 32-bytes. That's odd as I always understood the boundary to be 32-bytes on Skylake as well, but either (a) it changed to 64-bytes in Skylake, or (b) there is a second-order effect at 64-bytes, and the uop cache can actually deliver instructions from two different ways in a single cycle (seems hard!).
I note that Wikichip, which only has "so-so" accuracy does note:
µOP Cache
instruction window is now 64 Bytes (from 32)
However, I noticed after a bit more testing that the critical boundary here is actually 64-bytes, not 32-bytes. That's odd as I always understood the boundary to be 32-bytes on Skylake as well, but either (a) it changed to 64-bytes in Skylake, or (b) there is a second-order effect at 64-bytes, and the uop cache can actually deliver instructions from two different ways in a single cycle (seems hard!).
I note that Wikichip, which only has "so-so" accuracy does note:
µOP Cache instruction window is now 64 Bytes (from 32)
for Skylake-client https://en.wikichip.org/wiki/intel/microarchitectures/skylak...
So maybe that's a little-noticed change.