Code alignment issues

BeeOnRope · on Jan 22, 2018

This is a uop cache issue. You can replace the entire body of the loop with some big NOPs as long as they are the same size as the existing instructions and see the same effect: as soon as the loop spans two different 64-byte (!! see my reply below) aligned blocks, performance goes down. This holds even for larger iteration count if your LSD is disabled by microcode: asymptotically the fast loop approaches 1 cycle/iteration and the slow one 2/cycles per iteration.

This happens because the uop cache (also known as DSB) can only deliver instructions from one cache set at a time, and 64-byte (??) blocks map to different sets (that is, the uop cache holds a totally opaque decoded format unrelated to the instruction encoding, but the cache organization still depends on the original instruction sizes and alignment in the binary).

So if your loop spans 2 different 64-byte blocks, it will never run faster than 2 cycles per iteration. Normally, this is unlikely to be a problem: the vast majority of loops take more than 2 cycles per iteration anyways, and so this won't be a bottleneck. Loops that do take 1 cycle have, by definition, only a few instructions, so they aren't that likely to span a 32-byte boundary. This loop happened to have very long VEX-encoded instructions with offsets and so took up 28 bytes, even though it's still capable in running in less than 2 cycles, so it's got nearly a 50/50 chance to span a 64-byte boundary...

Now in principle compilers should come to the rescue here, but they are still busy aligning loops to 16-byte boundaries, which is the old advice that applied to the pre-uop cache days. So with 16-byte alignment, you pretty much have a 75% chance of getting lucky: this means that the only the 64N + 48 alignment ends up crossing a 64B boundary,so as it turns out, 16-byte alignment is still better than nothing: with no loop alignment at all, a 28-byte function often cross a 64B boundary (about 42% of the time for randomly distributed starting bytes).

The various other alignment effects are second-order issues, such as splitting a macro-fused branch slowing things down, or just adjusting the number of nops that have be executed.

BeeOnRope · on Jan 22, 2018

The above comment originally used "32-byte boundary" everywhere it uses 64-byte now - the conventional wisdom was always that the uop cache operated on 32-byte boundaries only, and this is almost certainly true for Haswell and earlier.

However, I noticed after a bit more testing that the critical boundary here is actually 64-bytes, not 32-bytes. That's odd as I always understood the boundary to be 32-bytes on Skylake as well, but either (a) it changed to 64-bytes in Skylake, or (b) there is a second-order effect at 64-bytes, and the uop cache can actually deliver instructions from two different ways in a single cycle (seems hard!).

I note that Wikichip, which only has "so-so" accuracy does note:

µOP Cache instruction window is now 64 Bytes (from 32)

for Skylake-client https://en.wikichip.org/wiki/intel/microarchitectures/skylak...

So maybe that's a little-noticed change.

sagebird · on Jan 21, 2018

>> Aligning the code means compiler will insert NOPs before the code you want to align. That increases binary size and might cost you performance if you insert a lot of nops in the hot path. In the end executing nops doesn’t come for absolutely free. You need to fetch and decode it.

This statement makes me wonder if a variant of NOP would be useful - igNOP - ignore ops until next alignment. It would tell the cpu to treat every remaining instruction in the current block as a no op. I somewhat doubt this would help, as I think it is not so much that currently extra noops need to be fetched and decoded but some other bottle neck. A “nice” consequence would be that you could pack extra data after the igNOP and before the next alignment, to be unpacked elsewhere. It would probably cause headaches for debugging and security concerns... Can anyone comment on this?

nkurz · on Jan 21, 2018

This functionality essentially already exists in the form of multi-byte NOP's: https://stackoverflow.com/questions/25545470/long-multi-byte.... Because of the way the decoder fetches instructions, any approach that requires the decoder to act conditionally upon anything other than individual instruction length is likely impossible.

While in theory NOP decoding could be a bottleneck, I think it would be a really rare occurrence. Usually a hot loop is going to be fed from the LSD or DSB caches, so the NOP's will already be removed. It would be interesting to see a benchmark that illustrates a case where excessive alignment actually causes a slowdown.

BeeOnRope · on Jan 22, 2018

A whole other approach to alignment would be to strategically lengthen earlier instructions so that the designed alignment is achieved. This avoids adding any executed nops.

It's not as hard as it sounds: there is lots of redundancy in the x86 encoding, so you can often add REX prefixes, make offsets longer or add an offset where it doesn't exist, etc, etc.

physguy1123 · on Jan 21, 2018

This is just a relative jump instruction and on intel can be just 2 bytes, although an ignop might fit into one.

It wouldn't solve the icache problem though. Also, writing to data inline with code would be super problematic performance-wise with modifying icache lines that currently are being executed from.

jmfisch · on Jan 22, 2018

In my daily job I just can't afford time-wise to think about things like this. More than thinking about it, taking the time to instrument, measure, and test a giant program with this level of granularity. This type of stuff needs to be handled by the language, compiler, or hardware.

nhaehnle · on Jan 22, 2018

Can anybody explain the difference between the baseline and aligned_function cases? As the article states, they contain the exact same code, and the only difference appears to be that in one case, the function starts at C0 while in the other it starts at A0.

As both are 32-byte aligned, none of the uarch features mentioned in the article should be able to explain the difference. If anything, the baseline has the higher alignment (to 64 bytes), and yet the baseline is slower.

BeeOnRope · on Jan 22, 2018

See my reply (to myself) below - the effect actually occurs at 64-byte boundaries, not 32. So the A0 has a different alignment to C0, modulo 64.

neonscribe · on Jan 21, 2018

Micro benchmarks are usually misleading.

userbinator · on Jan 21, 2018

Indeed, for the most part x86 is pretty much insensitive to code alignment and unless you have a specific case where an extremely tight loop is on the critical path, the extra NOPs are almost certainly going to cause an overall slowdown as they push other code out of the cache. Always benchmark total performance.

BeeOnRope · on Jan 22, 2018

Keep in mind the first two examples ("baseline" and "no_foo") don't execute any nops - the only nops are outside the function bodies themselves and are never executed.

Microbenchmarks can still be interesting in realistic scenarios (e.g., the kernel of some decoding algorithm, a math kernel, whatever) - but this one doesn't really fit the bill since the core loop iteration count is so low (4 iterations), a lot of the interesting effect is probably just due to function call overheads, store-forwarding and so on.

x86 still has a lot of ways that it is still sensitive to code alignment (in fact, the list is longer than it used to me), but yeah they don't matter has much, especially since the uop cache was introduced. The uop cache still has lots of alignment related rules, but the code has to be much more extreme to violate them.

When there was no uop cache, decoding restrictions, which were heavily related to alignment, were often a bottleneck, but those days are over on mainstream x86.