Don't have a compiler handy (or for that matter anything more recent than a Core 2 Duo), but the thing to do would be to replace the call instruction with a same-length sequence of NOOPs to see if its an instruction alignment issue. He mentions making sure the loop header is aligned, but there are other alignment issues in play.
For example, in Intel processors with a u-op cache, a 32-byte fetch group that generates more than 18 u-ops has to go through the x86 decoder and cannot stream from cache.[1] The call instruction, being a pretty big instruction with that 32-bit immediate displacement, might force the loop onto two 32-byte lines, reducing the u-op count in each line below 18 and allowing the loop to stream from the u-op cache instead of going through the decoders.
In general, if adding instructions makes a loop go faster, it's some sort of alignment (maybe code layout is a better term?) issue. Intel CPU's have limits on things like the number of branch predictions per fetch group, number of decoded instructions per fetch group, etc. Some x86 CPUs can only track 3 branches in a 16-byte group, while it might be possible to encode 4 in unusually dense code. In such cases, the extra branch just won't be tracked by the predictor. Seemingly superfluous code can force some of the branches into a different 16-byte group, improving overall performance. (This can't be the case here, because there's just one branch, but it can pop up in branch-heavy code).
EDIT: Someone makes in the comments makes a good point that in any case, both loops should fit in the loop buffer, which is after the u-op cache, and so wouldn't suffer from the 18-uop limit mentioned above.
In general, if adding instructions makes a loop go faster, it's some sort of alignment issue.
Not necessarily. In the early days of the P6 core, I found a simple loop which would take either 4.0 or 5.33 cycles depending on the internal CPU state -- exactly the same code, but running with an iteration count of 10^6 would take either 4000000 cycles or 5333333 cycles.
I had the good fortune to talk to a member of the team which designed the P6 core, and we concluded that this was due to a quirk of the out-of-order scheduler: In order to reduce circuit depths, the mechanism for finding and dispatching runnable uops wouldn't always dispatch the earliest available uops, and so -- depending on how the incoming uops were aligned within an internal buffer -- the P6 core would sometimes take a stream of instructions which could be executed in-order without pipeline stalls and run them out-of-order with less ILP.
In short, OOO cores are weird and horribly complicated and completely untrustworthy where performance is concerned.
> In short, OOO cores are weird and horribly complicated and completely untrustworthy where performance is concerned.
Yep. There's a great talk about this by Cliff Click, called A Crash Course in Modern Hardware[1] that I would recommend to everyone. I am no hardware expert, so that talk really enlightened me.
Regarding the issue at hand, I remember Doug Lea saying[2] that some new Intel processors may recognize a loop as the OSs idle loop and power down the core. That's why he computes random numbers in busy-wait loops.
Your last sentence is so important, and was basically what I was thinking the entire time reading that post, that I want to repeat it: In short, OOO cores are weird and horribly complicated and completely untrustworthy where performance is concerned.
GPUs are generally in order as far as I know - it only does OOO optimization at compile time. In general I like high performance programming much more using CUDA than on CPU - the performance is usually much more predictable and as such the time investment tends to correlate pretty well with performance improvements.
GPUs are not only out of order but extremely synchronous in general with a restricted memory model that prevents memory stalls. Not only that, but multiple cores share the same control unit, which is why coherent branching across the cores is such a big deal (they all have to branch the same way or both branches have to be executed...).
Coherent branching is also important on CPU to make use of your vector units, on GPU the payoff is just much higher. Furthermore you never have to do silly things like loop unrolling for vector unit optimization since you write all kernels as if they were executed scalar. Once coherent branching is not possible anymore you're usually dealing with random access over your data - even then the higher random access memory bandwidth can get you a speedup over one CPU socket, compared to dual socket you're about even.
Bottom line is you just have to think about GPUs in a slightly different way - they do have disadvantages (limited register/cache/memory per thread since you need so many of them), but things like branching aren't that big of a deal.
My first thought was uop cache as well. The fact that pre-Sandy Bridge processors have the intuitive performance behavior makes me more certain it's related.
One interesting experiment to try would be to realign the loop to 0..31 mod 32, and see how/if the behavior changes.
both loops should fit in the loop buffer, which is after the u-op cache, and so wouldn't suffer from the 18-uop limit mentioned above
Both loops should fit, but I think the same limits for micro-op density apply. The Optimization Manual says the Loop Stream Detector requires that "All micro-ops are also resident in the Decoded ICache". I took this to mean if something can't be cached by the Decoded ICache, it can't be streamed by the LSD.
You wouldn't NOOP-align the loop header, but instead replace the CALL instruction in the middle of the loop with an equivalent sequence of NOOPs. This would increase the distance between the loop header and the backward jump.
Core2 had an 18 x86 instruction "loop stream detector".
Nehalem went to a 28 u-op (not instruction) LSD.
Sandy Bridge has a 1500 u-op "decoded instruction cache" plus a 28 u-op LSD.
Haswell is the same as SB, but doubles the u-op LSD to 56 if hyperthreading is disabled.
The confusing part is why there is still a tiny loop stream detector in addition to the much larger decoded iCache. Perhaps there is a small additional power savings for code that fits the smaller LSD?
The 18-uop limit comes from how Sandy Bridge (and Haswell) handle the decoded instruction cache. A 32-byte line of the instruction cache can map up to 3 lines in the u-op cache, each of which can hold 6 instructions. Any 32-byte line of instructions that decodes to more than 3x6 = 18 u-ops cannot be stored in the u-op cache.
I think It's because the branch target is a memory access, so the dcache causes the execution pipe to stall on the load. In the tight loop with a call, the branch target is a call and there is time enough to pull from dcache before the load data is needed. I suspect that the i7 can't pull data from dcache immediately the jne instruction, so you get a hiccup. Try adding a second noop in to tightloop as the target for jne.
The trace cache on that chip likely stores a u-op trace that includes the target of the call instruction, which means the memory subsystem shouldn't even need to fetch it separately since it'll come straight back from the trace cache right in line as part of the instruction sequence.
Here is a version of the code that runs even faster than the asm("call foo") one by inserting nops between the load and the use of the loaded value. However it became even faster by inserting a few nops at the beginning of the loop too.
The fastest version that I could find is to do a prefetch of counter at the end of the loop .... speaking of which isn't that nopw that GCC uses for alignment a prefetch instruction too? Is the CPU tricked by the fake address used there?
I think I've discovered that the target of the prefetch is irrelevant, and that what matters is that the store-to-load forwarding does not try to execute in the same cycle. For Sandy Bridge, I'm finding that I can get slightly better performance with a dummy load of an unrelated volatile variable:
The prefetch is only serving as 6-byte padding here. The nopw instruction performs no memory access, it does not double as a prefetch.
There are remarkably few memory accesses actually being performed in that loop. The CPU is using store forwarding to cache the 'memory' accesses in the store buffer, which means most accesses are not even accessing L1 cache (if this were the case, we would not have such a low count of cycles per iteration).
The best result I've got is by inserting a 6-byte nopw right before the store:
µ-ops stand for micro-operations. Behind their CISC frontend, modern [0] x86 (and x64) actually perform as RISC processors. x86 opcodes are translated to these µops on the fly, and executed.
I'm not familiar with the caching mechanism, but here's an educated guess. There are potential optimizations in the CISC -> RISC translation, according to the x86 opcode sequence (reorder operations in order to run some of them in parallel, for example), and it is possible to cache them, sparing cycles in the process, since the processor wouldn't have to analyze the code each time to perform the optimizations.
Edit: apparently, my guess was correct :-) Thanks to Symmetry for the confirmation.
--
[0] By modern, I mean "not ancient". The first processor to do that was the Pentium Pro.
x86 instructions are often big and complicated and hard to decode, so much so that the machine doesn't use them internally in the various buffers and so forth that make up it's out of order engine. Instead, they're translated into a different set of instructions called uOps which only live inside the inner part of the CPU.
Translating them takes energy, and can potentially be the limiting factor in how faster certain programs can be run. That means that recent Intel designs keep a small buffer of uOps that correspond to instructions it's likely to see again quickly. This can help a lot in loops.
I've run into something similiar on a different benchmark where inserting some 'nops' to the preamble for a function actually sped it up as much as 14% because it made the function align better with a memory boundary so the CPU could access it faster. Benchmarks, especially ones that don't control for the cases where 'luck'/alignment/register use/etc can influence the outcome, are terrible testcases to explore behaviour.
Don't you always have to pad back with nops (0x90) since the scheduler might be looking ahead for other instructions to execute (or maybe not, since it's seeing jmp back). Just wildly guessing here...
I'm surprised that GCC generates that code. Using read-modify-write instructions directly in memory has been better on x86 processors for quite awhile now, since both AMD and Intel support fusing the memory and arithmetic operations into a single micro-op.
Curiously enough, using separate reads and writes, with appropriate nop padding, seems to do better than any read-modify-write variation I tried.
By the way, read-modify-write is not fused altogether: only the write part is fused, so it generates 2 unfused uops (read, add), plus the fused uop for the write + address generation.
EDIT: Intel's compiler also generates read-modify-write code, but it ends up being slower than both gcc and clang.
I also have gotten better results with separate read, modify, and write instructions than with combined. I think this is because the separate assembly statements allow for more explicit scheduling.
I was studying this difference between the two compilers for a different issue, actually. According to the standard, I don't see anything that prevents an increment of a volatile to be performed in a single instruction if the target permits it. Splitting it to a separate load+binop+store by gcc seems like an over-pessimization.
- tightloop() does nothing useful and is not realistic assembly code, so modern CPU optimisations aren't targeted for it
- loop_with_extra_call() does nothing useful either, but might look like a more realistic instruction sequence - the call could modify 'counter', making it appear to be a realistic sequence which the CPU optimisations are targeted for.
End result: the CPU is better designed for handling cases like loop_with_extra_call(). In the real world this makes realistic code faster, and this benchmark is just a weird quirk.
Disclaimer: I am not qualified enough to say I know what I'm talking about.
Boy. I know more or less informed speculation is the best we can do, but it would sure be fun if detailed information about the CPU were public so such things could actually be tracked down.
I agree, but asking this of Intel is the equivalent to asking Google to bare the details of their search ranking algorithm so that we could understand some odd search result. The clever tricks and optimizations they do in the cores are their crown jewels, just as the guts of clever, highly-tuned algorithms are for many software companies. It just so happens that many people have a high-level knowledge of modern out-of-order CPUs (just as many people have read Google's PageRank paper) and so can speculate about what might be going on inside.
The real question is whether any of these compiler heroics are actually any good or serve any purpose other than killing time and curiosity. No one knows how "optimized" NOP'ped code will fare on post-i7.
A modern Intel CPU is the most complex proprietary trade secret in the world. That's all Intel wants us to know.
I know this may be silly to suggest but isn't this just a processor pipelining problem? I feel like it's extremely likely that the loop with the extra call just lets the processor carry on better than a variable that it has to forcibly reload and store (to be compliant with volatile).
My assembly is a little rusty but I would just assume that the processor is blocking on memory and the waits just add up.
I don't think this should be the case on the type of out of order superscalar pipeline that this core has, as far as I'm aware of it. If there's a bind on waiting for memory an instruction should sit in a reservation station until its data comes in and that should take the same amount of time in either case. (In addition, with trivial accesses like this, it is highly likely a prefetcher would predict the access and have it fetched from cache in time. Though perhaps prefetch behavior works better in the longer loop?)
My intuition is that this is more likely to be behavior related to the trace cache and rayiner's comment points out several good suggestions on what might be happening here.
On the other hand, given the likely difference in how these instructions may be represented in the trace cache, there may also be interactions between the order and groups of instructions dispatched from the trace cache and that may change how things occupy different slots in reservation stations. Which may change the amount of instruction level parallelism in the loop. It doesn't seem likely, but it may be the case.
Trace cache shouldn't be the issue because the loop is simply executing too slowly. Even with very pessimistic alignment, HSW should get well more than one instruction per clock from that loop.
The culprit is probably store-to-load forwarding. Memory ops cannot wait until the coast is clear like data ops do because the addresses are not known early enough in the pipeline, so they are dispatched well before it's even known that there will be a store-to-load hit. The system to maintain the memory model is very complex, not well understood outside Intel, and might conceivably include optimistic optimizations. In such cases, it's completely feasible that issuing another store in the pipeline might reduce the amount of work that is rolled back and needs to be redone.
> Even with very pessimistic alignment, HSW should get well more than one instruction per clock from that loop.
Why? In Haswell, a store needs an AGU (ports 2, 3, or 7) and the store data unit (port 4). Even with perfect store-load forwarding, you can only write one value per cycle into the store buffer.
For example, in Intel processors with a u-op cache, a 32-byte fetch group that generates more than 18 u-ops has to go through the x86 decoder and cannot stream from cache.[1] The call instruction, being a pretty big instruction with that 32-bit immediate displacement, might force the loop onto two 32-byte lines, reducing the u-op count in each line below 18 and allowing the loop to stream from the u-op cache instead of going through the decoders.
In general, if adding instructions makes a loop go faster, it's some sort of alignment (maybe code layout is a better term?) issue. Intel CPU's have limits on things like the number of branch predictions per fetch group, number of decoded instructions per fetch group, etc. Some x86 CPUs can only track 3 branches in a 16-byte group, while it might be possible to encode 4 in unusually dense code. In such cases, the extra branch just won't be tracked by the predictor. Seemingly superfluous code can force some of the branches into a different 16-byte group, improving overall performance. (This can't be the case here, because there's just one branch, but it can pop up in branch-heavy code).
EDIT: Someone makes in the comments makes a good point that in any case, both loops should fit in the loop buffer, which is after the u-op cache, and so wouldn't suffer from the 18-uop limit mentioned above.
[1] See: http://www.realworldtech.com/sandy-bridge/4