Intel i7 loop performance anomaly

rayiner · on Dec 3, 2013

Don't have a compiler handy (or for that matter anything more recent than a Core 2 Duo), but the thing to do would be to replace the call instruction with a same-length sequence of NOOPs to see if its an instruction alignment issue. He mentions making sure the loop header is aligned, but there are other alignment issues in play.

For example, in Intel processors with a u-op cache, a 32-byte fetch group that generates more than 18 u-ops has to go through the x86 decoder and cannot stream from cache.[1] The call instruction, being a pretty big instruction with that 32-bit immediate displacement, might force the loop onto two 32-byte lines, reducing the u-op count in each line below 18 and allowing the loop to stream from the u-op cache instead of going through the decoders.

In general, if adding instructions makes a loop go faster, it's some sort of alignment (maybe code layout is a better term?) issue. Intel CPU's have limits on things like the number of branch predictions per fetch group, number of decoded instructions per fetch group, etc. Some x86 CPUs can only track 3 branches in a 16-byte group, while it might be possible to encode 4 in unusually dense code. In such cases, the extra branch just won't be tracked by the predictor. Seemingly superfluous code can force some of the branches into a different 16-byte group, improving overall performance. (This can't be the case here, because there's just one branch, but it can pop up in branch-heavy code).

EDIT: Someone makes in the comments makes a good point that in any case, both loops should fit in the loop buffer, which is after the u-op cache, and so wouldn't suffer from the 18-uop limit mentioned above.

[1] See: http://www.realworldtech.com/sandy-bridge/4

cperciva · on Dec 3, 2013

In general, if adding instructions makes a loop go faster, it's some sort of alignment issue.

Not necessarily. In the early days of the P6 core, I found a simple loop which would take either 4.0 or 5.33 cycles depending on the internal CPU state -- exactly the same code, but running with an iteration count of 10^6 would take either 4000000 cycles or 5333333 cycles.

I had the good fortune to talk to a member of the team which designed the P6 core, and we concluded that this was due to a quirk of the out-of-order scheduler: In order to reduce circuit depths, the mechanism for finding and dispatching runnable uops wouldn't always dispatch the earliest available uops, and so -- depending on how the incoming uops were aligned within an internal buffer -- the P6 core would sometimes take a stream of instructions which could be executed in-order without pipeline stalls and run them out-of-order with less ILP.

In short, OOO cores are weird and horribly complicated and completely untrustworthy where performance is concerned.

pron · on Dec 3, 2013

> In short, OOO cores are weird and horribly complicated and completely untrustworthy where performance is concerned.

Yep. There's a great talk about this by Cliff Click, called A Crash Course in Modern Hardware[1] that I would recommend to everyone. I am no hardware expert, so that talk really enlightened me.

Regarding the issue at hand, I remember Doug Lea saying[2] that some new Intel processors may recognize a loop as the OSs idle loop and power down the core. That's why he computes random numbers in busy-wait loops.

[1] http://www.infoq.com/presentations/click-crash-course-modern...

[2] http://emergingtech.chariotsolutions.com/2013/04/phillyete-s...

scott_s · on Dec 3, 2013

Your last sentence is so important, and was basically what I was thinking the entire time reading that post, that I want to repeat it: In short, OOO cores are weird and horribly complicated and completely untrustworthy where performance is concerned.

dman · on Dec 3, 2013

With Atom switching to OOO are there any mainstream processors left that are in order ?

djcapelis · on Dec 3, 2013

Mainstream desktop processors? Not really. But there's tons in the embedded and mobile worlds. (All the ARM A7 cores, for instance.)

In order comes back during strange times. Sun switched back to in-order for Niagara, for instance.

m_mueller · on Dec 3, 2013

GPUs are generally in order as far as I know - it only does OOO optimization at compile time. In general I like high performance programming much more using CUDA than on CPU - the performance is usually much more predictable and as such the time investment tends to correlate pretty well with performance improvements.

seanmcdirmid · on Dec 4, 2013

GPUs are not only out of order but extremely synchronous in general with a restricted memory model that prevents memory stalls. Not only that, but multiple cores share the same control unit, which is why coherent branching across the cores is such a big deal (they all have to branch the same way or both branches have to be executed...).

seanmcdirmid · on Dec 4, 2013

Edit: I meant to say "GPUs are not only in order", they are definitely not out of order.

m_mueller · on Dec 4, 2013

Coherent branching is also important on CPU to make use of your vector units, on GPU the payoff is just much higher. Furthermore you never have to do silly things like loop unrolling for vector unit optimization since you write all kernels as if they were executed scalar. Once coherent branching is not possible anymore you're usually dealing with random access over your data - even then the higher random access memory bandwidth can get you a speedup over one CPU socket, compared to dual socket you're about even.

Bottom line is you just have to think about GPUs in a slightly different way - they do have disadvantages (limited register/cache/memory per thread since you need so many of them), but things like branching aren't that big of a deal.

kevinnk · on Dec 3, 2013

Many ARM processors are in order. Also Itanium.

scott_s · on Dec 3, 2013

Itanium may be in order, but if we're going for ease-of-reasoning, seek your fortunes elsewhere. The memory model gave me fits.

jevinskie · on Dec 4, 2013

The PS3's Cell processor is in-order (both the PowerPC cores and the SPU co-processor cores) but that is getting to be a decade old.

hrjet · on Dec 4, 2013

The processor industry is a mess (especially x86). The micro-architectures are fine tuned for benchmarks, at the expense of clean and reliable design.

pbsd · on Dec 3, 2013

My first thought was uop cache as well. The fact that pre-Sandy Bridge processors have the intuitive performance behavior makes me more certain it's related.

One interesting experiment to try would be to realign the loop to 0..31 mod 32, and see how/if the behavior changes.

nkurz · on Dec 4, 2013

both loops should fit in the loop buffer, which is after the u-op cache, and so wouldn't suffer from the 18-uop limit mentioned above

Both loops should fit, but I think the same limits for micro-op density apply. The Optimization Manual says the Loop Stream Detector requires that "All micro-ops are also resident in the Decoded ICache". I took this to mean if something can't be cached by the Decoded ICache, it can't be streamed by the LSD.

stingraycharles · on Dec 3, 2013

That's is already being discussed in the comments, and the author claims NOOP-aligning the code has no effect.

rayiner · on Dec 3, 2013

You wouldn't NOOP-align the loop header, but instead replace the CALL instruction in the middle of the loop with an equivalent sequence of NOOPs. This would increase the distance between the loop header and the backward jump.

mgraczyk · on Dec 3, 2013

Is the 18 u-op limit still the case for Haswell?

nkurz · on Dec 4, 2013

No, there is no such limit, but it's confusing.

Core2 had an 18 x86 instruction "loop stream detector".

Nehalem went to a 28 u-op (not instruction) LSD.

Sandy Bridge has a 1500 u-op "decoded instruction cache" plus a 28 u-op LSD.

Haswell is the same as SB, but doubles the u-op LSD to 56 if hyperthreading is disabled.

The confusing part is why there is still a tiny loop stream detector in addition to the much larger decoded iCache. Perhaps there is a small additional power savings for code that fits the smaller LSD?

But in general, you no longer have to worry about instruction decoding speed if your loops are less than 1500 u-ops. With some caveats: https://www-ssl.intel.com/content/dam/www/public/us/en/docum...

rayiner · on Dec 4, 2013

The 18-uop limit comes from how Sandy Bridge (and Haswell) handle the decoded instruction cache. A 32-byte line of the instruction cache can map up to 3 lines in the u-op cache, each of which can hold 6 instructions. Any 32-byte line of instructions that decodes to more than 3x6 = 18 u-ops cannot be stored in the u-op cache.

mgraczyk · on Dec 3, 2013

I think It's because the branch target is a memory access, so the dcache causes the execution pipe to stall on the load. In the tight loop with a call, the branch target is a call and there is time enough to pull from dcache before the load data is needed. I suspect that the i7 can't pull data from dcache immediately the jne instruction, so you get a hiccup. Try adding a second noop in to tightloop as the target for jne.

djcapelis · on Dec 3, 2013

The trace cache on that chip likely stores a u-op trace that includes the target of the call instruction, which means the memory subsystem shouldn't even need to fetch it separately since it'll come straight back from the trace cache right in line as part of the instruction sequence.

edwintorok · on Dec 3, 2013

I think you might be right about stalls.

Here is a version of the code that runs even faster than the asm("call foo") one by inserting nops between the load and the use of the loaded value. However it became even faster by inserting a few nops at the beginning of the loop too.

The fastest version that I could find is to do a prefetch of counter at the end of the loop .... speaking of which isn't that nopw that GCC uses for alignment a prefetch instruction too? Is the CPU tricked by the fake address used there?

  void prefetch() {
    unsigned j;
    for (j = 0; j < N; ++j) {
      counter += j;
      __builtin_prefetch(&counter, 0, 3);
    }
   }

   void nop_wait() {
    unsigned j;
    for (j = 0; j < N; ++j) {
      unsigned x = counter;
      /* 4 or more nops seem right, 3 nops are slower */
      __asm__("nop");
      __asm__("nop");
      __asm__("nop");
      __asm__("nop");
      counter = x + j;
      }
    }

  tightloop:
   3,000,314,154 cycles                    #    0.000 GHz                      ( +-  0.01% )
   2,400,971,094 instructions              #    0.80  insns per cycle          ( +-  0.00% )

   0.788256283 seconds time elapsed                                          ( +-  0.02% )

  loop_with_call:
   3,000,314,154 cycles                    #    0.000 GHz                      ( +-  0.01% )
   2,400,971,094 instructions              #    0.80  insns per cycle          ( +-  0.00% )

     0.788256283 seconds time elapsed                                          ( +-  0.02% )

  nopwait:
   2,679,341,970 cycles                    #    0.000 GHz                      ( +-  0.37% )
   4,000,906,874 instructions              #    1.49  insns per cycle          ( +-  0.00% )

     0.704209471 seconds time elapsed

vs

  prefetch:
   2,586,821,497 cycles                    #    0.000 GHz                      ( +-  0.54% )
   2,800,888,975 instructions              #    1.08  insns per cycle          ( +-  0.00% )

     0.679998564 seconds time elapsed

This is on a Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz

edwintorok · on Dec 3, 2013

[can't edit previous post]

In fact replacing the nopw that is used for alignment by a prefetch instruction gives me something slightly even faster:

  00000000004004e0 <prefetchit>:
  4004e0:       31 c0                   xor    %eax,%eax
  4004e2:       0f 18 0d 87 04 20 00    prefetcht0 0x200487(%rip)        # 600970 <counter>
  4004e9:       48 8b 15 80 04 20 00    mov    0x200480(%rip),%rdx        # 600970 <counter>
  4004f0:       48 01 c2                add    %rax,%rdx
  4004f3:       48 83 c0 01             add    $0x1,%rax
  4004f7:       48 3d 00 84 d7 17       cmp    $0x17d78400,%rax
  4004fd:       48 89 15 6c 04 20 00    mov    %rdx,0x20046c(%rip)        # 600970 <counter>
  400504:       75 dc                   jne    4004e2 <prefetchit+0x2>
  400506:       f3 c3                   repz retq 
  400508:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  40050f:       00 

     2,445,505,116 cycles                    #    0.000 GHz                      ( +-  0.52% )
     2,800,857,971 instructions              #    1.15  insns per cycle          ( +-  0.00% )

       0.643010038 seconds time elapsed

nkurz · on Dec 4, 2013

I think I've discovered that the target of the prefetch is irrelevant, and that what matters is that the store-to-load forwarding does not try to execute in the same cycle. For Sandy Bridge, I'm finding that I can get slightly better performance with a dummy load of an unrelated volatile variable:

  volatile unsigned dummy = 0;
  void loop_dummy_read() {
    IACA_START;
    unsigned j;
    unsigned dummy_read;
    for (j = 0; j < N; ++j) {
      dummy_read = dummy;
      counter += j;
    }
    IACA_END;
  }

eps · on Dec 4, 2013

That's almost 20% faster than the original. Not bad at all.

pbsd · on Dec 4, 2013

The prefetch is only serving as 6-byte padding here. The nopw instruction performs no memory access, it does not double as a prefetch.

There are remarkably few memory accesses actually being performed in that loop. The CPU is using store forwarding to cache the 'memory' accesses in the store buffer, which means most accesses are not even accessing L1 cache (if this were the case, we would not have such a low count of cycles per iteration).

The best result I've got is by inserting a 6-byte nopw right before the store:

    .L3:
	movq	counter(%rip), %rdx
	addq	%rax, %rdx
	addq	$1, %rax
	cmpq	$400000000, %rax
        nopw   0x1(%rax,%rax,1)
	movq	%rdx, counter(%rip)
	jne	.L3

raverbashing · on Dec 3, 2013

What trace cache?

That's a P4 thing, not a Core thing http://en.wikipedia.org/wiki/CPU_cache#Trace_cache

Edit: ok, Core has a uop cache http://en.wikipedia.org/wiki/Micro-operation_cache#Micro-ope...

You don't need a trace cache to do what you suggest, it just predicts the branch and goes from there

xorblurb · on Dec 4, 2013

noops are removed by the scheduler. Depending on what is happening precisely, this might not help.

songgao · on Dec 3, 2013

I tested on Intel Xeon W3680, which is at least 2 years old, and the result is opposite. Perhaps it only happens on newer CPUs?

  gcc -O2 -o main main.c
  perf stat -r 10 -e cycles,instructions ./main t
  
   Performance counter stats for './main t' (10 runs):
  
       2,408,573,034 cycles                    #    0.000 GHz                      ( +-  0.07% )
       2,401,351,221 instructions              #    1.00  insns per cycle          ( +-  0.00% )
  
         0.723734145 seconds time elapsed                                          ( +-  0.12% )
  
  perf stat -r 10 -e cycles,instructions ./main c
  
   Performance counter stats for './main c' (10 runs):
  
       2,802,522,431 cycles                    #    0.000 GHz                      ( +-  0.00% )
       3,201,523,974 instructions              #    1.14  insns per cycle          ( +-  0.00% )
  
         0.842082646 seconds time elapsed                                          ( +-  0.07% )

rayiner · on Dec 3, 2013

That's Nehalem u-arch, which (unlike Sandy Bridge and Haswell), does not have a u-op cache. So that's an interesting data point.

songgao · on Dec 3, 2013

I didn't know about u-op and just looked it up. Could you explain it into more details? How does it involve in this case?

pygy_ · on Dec 3, 2013

µ-ops stand for micro-operations. Behind their CISC frontend, modern [0] x86 (and x64) actually perform as RISC processors. x86 opcodes are translated to these µops on the fly, and executed.

I'm not familiar with the caching mechanism, but here's an educated guess. There are potential optimizations in the CISC -> RISC translation, according to the x86 opcode sequence (reorder operations in order to run some of them in parallel, for example), and it is possible to cache them, sparing cycles in the process, since the processor wouldn't have to analyze the code each time to perform the optimizations.

Edit: apparently, my guess was correct :-) Thanks to Symmetry for the confirmation.

--

[0] By modern, I mean "not ancient". The first processor to do that was the Pentium Pro.

Symmetry · on Dec 3, 2013

x86 instructions are often big and complicated and hard to decode, so much so that the machine doesn't use them internally in the various buffers and so forth that make up it's out of order engine. Instead, they're translated into a different set of instructions called uOps which only live inside the inner part of the CPU.

Translating them takes energy, and can potentially be the limiting factor in how faster certain programs can be run. That means that recent Intel designs keep a small buffer of uOps that correspond to instructions it's likely to see again quickly. This can help a lot in loops.

thelucky41 · on Dec 3, 2013

I've run into something similiar on a different benchmark where inserting some 'nops' to the preamble for a function actually sped it up as much as 14% because it made the function align better with a memory boundary so the CPU could access it faster. Benchmarks, especially ones that don't control for the cases where 'luck'/alignment/register use/etc can influence the outcome, are terrible testcases to explore behaviour.

mjn · on Dec 3, 2013

Google's microoptimizer actually has "try inserting random NOOPs" as one of its optimization passes: http://code.google.com/p/mao/

mason55 · on Dec 3, 2013

He mentioned in the comments that he ensured everything is aligned to 64-bit boundaries

_3u10 · on Dec 3, 2013

64 bit might not be the right alignment, and even aligned to 64 bit boundaries you can run into page and cacheline issues.

eliben · on Dec 3, 2013

Just to be clear, I tried it with 64-byte alignment

malkia · on Dec 3, 2013

Don't you always have to pad back with nops (0x90) since the scheduler might be looking ahead for other instructions to execute (or maybe not, since it's seeing jmp back). Just wildly guessing here...

songgao · on Dec 3, 2013

Another interesting point is that, only gcc produced binary has that behavior. clang generated is the opposite. Check this comment about testing on Sandy Bridge with gcc and clang: http://eli.thegreenplace.net/2013/12/03/intel-i7-loop-perfor...

abcd_f · on Dec 3, 2013

This begs for comparing assembly dumps of both.

ksherlock · on Dec 4, 2013

gcc acts RISCy -- load memory into a register, add registers, store register back to memory.

clang acts CISCy -- use a read-modify-write instruction to add directly in memory.

rayiner · on Dec 4, 2013

I'm surprised that GCC generates that code. Using read-modify-write instructions directly in memory has been better on x86 processors for quite awhile now, since both AMD and Intel support fusing the memory and arithmetic operations into a single micro-op.

pbsd · on Dec 4, 2013

Curiously enough, using separate reads and writes, with appropriate nop padding, seems to do better than any read-modify-write variation I tried.

By the way, read-modify-write is not fused altogether: only the write part is fused, so it generates 2 unfused uops (read, add), plus the fused uop for the write + address generation.

EDIT: Intel's compiler also generates read-modify-write code, but it ends up being slower than both gcc and clang.

nkurz · on Dec 4, 2013

I also have gotten better results with separate read, modify, and write instructions than with combined. I think this is because the separate assembly statements allow for more explicit scheduling.

eliben · on Dec 4, 2013

I was studying this difference between the two compilers for a different issue, actually. According to the standard, I don't see anything that prevents an increment of a volatile to be performed in a single instruction if the target permits it. Splitting it to a separate load+binop+store by gcc seems like an over-pessimization.

throwaway0094 · on Dec 3, 2013

Similar effects on a Haswell-era Xeon 1240v3:

    $ perf stat -r 10 -e cycles,instructions ./loop t
    
     Performance counter stats for './loop t' (10 runs):
    
         2,631,020,892 cycles                    #    0.000 GHz                      ( +-  0.34% )
         2,403,261,796 instructions              #    0.91  insns per cycle          ( +-  0.00% )
    
           0.713682287 seconds time elapsed                                          ( +-  0.39% )
    
    $ perf stat -r 10 -e cycles,instructions ./loop c
    
     Performance counter stats for './loop c' (10 runs):
    
         2,237,200,035 cycles                    #    0.000 GHz                      ( +-  0.31% )
         3,202,823,909 instructions              #    1.43  insns per cycle          ( +-  0.00% )
    
           0.606741377 seconds time elapsed                                          ( +-  0.29% )

AshleysBrain · on Dec 3, 2013

My armchair speculation:

- tightloop() does nothing useful and is not realistic assembly code, so modern CPU optimisations aren't targeted for it

- loop_with_extra_call() does nothing useful either, but might look like a more realistic instruction sequence - the call could modify 'counter', making it appear to be a realistic sequence which the CPU optimisations are targeted for.

End result: the CPU is better designed for handling cases like loop_with_extra_call(). In the real world this makes realistic code faster, and this benchmark is just a weird quirk.

Disclaimer: I am not qualified enough to say I know what I'm talking about.

wreegab · on Dec 3, 2013

Could this be due to throttling down as per heat? First benchmark would cause more heat to be generated, thus throttling down going on?

Guvante · on Dec 3, 2013

The instructions per cycle is the interesting number. It is independent of frequency.

comex · on Dec 4, 2013

Boy. I know more or less informed speculation is the best we can do, but it would sure be fun if detailed information about the CPU were public so such things could actually be tracked down.

anon_cownerd · on Dec 4, 2013

I agree, but asking this of Intel is the equivalent to asking Google to bare the details of their search ranking algorithm so that we could understand some odd search result. The clever tricks and optimizations they do in the cores are their crown jewels, just as the guts of clever, highly-tuned algorithms are for many software companies. It just so happens that many people have a high-level knowledge of modern out-of-order CPUs (just as many people have read Google's PageRank paper) and so can speculate about what might be going on inside.

kelas · on Dec 4, 2013

Looks like people have been there before:

https://code.google.com/p/mao/wiki/NOPIN

The real question is whether any of these compiler heroics are actually any good or serve any purpose other than killing time and curiosity. No one knows how "optimized" NOP'ped code will fare on post-i7.

A modern Intel CPU is the most complex proprietary trade secret in the world. That's all Intel wants us to know.

xemdetia · on Dec 3, 2013

I know this may be silly to suggest but isn't this just a processor pipelining problem? I feel like it's extremely likely that the loop with the extra call just lets the processor carry on better than a variable that it has to forcibly reload and store (to be compliant with volatile).

My assembly is a little rusty but I would just assume that the processor is blocking on memory and the waits just add up.

djcapelis · on Dec 3, 2013

I don't think this should be the case on the type of out of order superscalar pipeline that this core has, as far as I'm aware of it. If there's a bind on waiting for memory an instruction should sit in a reservation station until its data comes in and that should take the same amount of time in either case. (In addition, with trivial accesses like this, it is highly likely a prefetcher would predict the access and have it fetched from cache in time. Though perhaps prefetch behavior works better in the longer loop?)

My intuition is that this is more likely to be behavior related to the trace cache and rayiner's comment points out several good suggestions on what might be happening here.

On the other hand, given the likely difference in how these instructions may be represented in the trace cache, there may also be interactions between the order and groups of instructions dispatched from the trace cache and that may change how things occupy different slots in reservation stations. Which may change the amount of instruction level parallelism in the loop. It doesn't seem likely, but it may be the case.

Tuna-Fish · on Dec 3, 2013

Trace cache shouldn't be the issue because the loop is simply executing too slowly. Even with very pessimistic alignment, HSW should get well more than one instruction per clock from that loop.

The culprit is probably store-to-load forwarding. Memory ops cannot wait until the coast is clear like data ops do because the addresses are not known early enough in the pipeline, so they are dispatched well before it's even known that there will be a store-to-load hit. The system to maintain the memory model is very complex, not well understood outside Intel, and might conceivably include optimistic optimizations. In such cases, it's completely feasible that issuing another store in the pipeline might reduce the amount of work that is rolled back and needs to be redone.

rayiner · on Dec 3, 2013

> Even with very pessimistic alignment, HSW should get well more than one instruction per clock from that loop.

Why? In Haswell, a store needs an AGU (ports 2, 3, or 7) and the store data unit (port 4). Even with perfect store-load forwarding, you can only write one value per cycle into the store buffer.

Tuna-Fish · on Dec 4, 2013

Instruction per clock, not iteration per clock. Each iteration of the tight loop takes 6 cycles there.

s_kanev · on Dec 5, 2013

Spent a few hours trying to look into this. The results so far suggest it has to do with the register renaming mechanism, but not completely conclusive. You can check out my notes here: http://blog.skanev.org/2013/12/call-performance-weirdness.ht...