Keep in mind the first two examples ("baseline" and "no_foo") don't execute any nops - the only nops are outside the function bodies themselves and are never executed.
Microbenchmarks can still be interesting in realistic scenarios (e.g., the kernel of some decoding algorithm, a math kernel, whatever) - but this one doesn't really fit the bill since the core loop iteration count is so low (4 iterations), a lot of the interesting effect is probably just due to function call overheads, store-forwarding and so on.
x86 still has a lot of ways that it is still sensitive to code alignment (in fact, the list is longer than it used to me), but yeah they don't matter has much, especially since the uop cache was introduced. The uop cache still has lots of alignment related rules, but the code has to be much more extreme to violate them.
When there was no uop cache, decoding restrictions, which were heavily related to alignment, were often a bottleneck, but those days are over on mainstream x86.
Microbenchmarks can still be interesting in realistic scenarios (e.g., the kernel of some decoding algorithm, a math kernel, whatever) - but this one doesn't really fit the bill since the core loop iteration count is so low (4 iterations), a lot of the interesting effect is probably just due to function call overheads, store-forwarding and so on.
x86 still has a lot of ways that it is still sensitive to code alignment (in fact, the list is longer than it used to me), but yeah they don't matter has much, especially since the uop cache was introduced. The uop cache still has lots of alignment related rules, but the code has to be much more extreme to violate them.
When there was no uop cache, decoding restrictions, which were heavily related to alignment, were often a bottleneck, but those days are over on mainstream x86.