Playing with the CPU pipeline

zhemao · on Feb 4, 2014

I'm really skeptical whether the author's "optimizations" are actually doing what the author thinks they do. Read-after-write hazards do not always result in a stall. Modern CPU pipelines have backward propagation that can resolve RAW hazards without stalling the pipeline. More likely it's the fact that x86 processors are superscalar and have multiple ALUs and FPUs. Therefore, they can dispatch multiple floating point instructions at once, as long as there is no data dependence.

djcapelis · on Feb 4, 2014

Yes, the author ironically doesn't actually understand what's happening in the pipeline for the processor they're writing about.

It is a superscalar processor using reservation stations and tomasulo's algorithm, a RAW hazard doesn't stall the pipeline. But instructions which would be affected by RAW hazards will slow down execution if those instructions are on the critical path. So paying close attention to instruction-level dependencies will produce better results in both since you can increase the ILP possible in the instruction sequence. This helps a lot especially on code which might run on in-order pipelines (a lot of mobile and embedded chips still use in-order pipelines) too.

pslam · on Feb 4, 2014

That's just nit-picking the nomenclature. The author is simplifying things greatly but he's essentially correct - the naive version with less operations ends up slower because there is a long chain of dependencies.

With out-of-order execution this isn't really a stall, because non-dependent instructions can still execute/retire with renamed registers alongside it, but it still effectively delays progress of the dependent chain the same way an in-order pipeline would operate. You might say: its progress is stalled.

alain94040 · on Feb 4, 2014

No. A read-after-write hazard has not stalled a cpu in the last 10 years. So the article will mislead all the beginner architects out there. A clear explanation of dependencies would be much more appropriate.

brigade · on Feb 4, 2014

Intel released Atom not 6 years ago, and is still selling Bonnell-derived CPUs to this day...

Cortex-A7 and A53 are in-order as well, and are/will be quite common among low-end Chinese smartphones and tablets.

weland · on Feb 4, 2014

There are CPUs that have been stalled by RAWs and appeared in the last 10 years, and they are here to stay.

That being said, I think the downvotes are unfair on your post. The author of the post is writing about CPUs that are indeed unlikely to be ever stalled by RAWs.

Tuna-Fish · on Feb 4, 2014

Sadly, this is only true on proper desktop cpus. For an example of a CPU that does horribly on RAW hazards, look no further than the Xenon CPU in the XBox360.

knappador · on Feb 4, 2014

If you need a result to perform the next operation, you have something greater than a RAW hazard. RAW is not necessarily guaranteed to stall whereas a dependency in the result, regardless of the register it's supposed to land in, can't be optimized away without creating opportunities for register renaming and instruction re-ordering to work. It's probably going a little far to say that the author doesn't understand the pipeline. I thought they were over-simplifying the pipeline stages, but the potential for register renaming an instruction reordering don't fundamentally change on AMD vs Intel's x86 CPU's while they do change depending on how you write your C. The results speak for themselves.

djcapelis · on Feb 4, 2014

> but the potential for register renaming an instruction reordering don't fundamentally change on AMD vs Intel's x86 CPU's

People are writing code to run on a lot more than x86_64 these days. The author's way of optimizing is actually more correct for an in-order pipeline than an out-of-order one and that is probably the correct approach for multi-architecture code.

But the author's approach is not particularly well informed by knowledge of what's actually happening in the pipeline of a modern x86 processor. Nor does, in this case, it particularly matter. Which is why the author's assertion that knowing about the pipeline will help with code optimization is somewhat ironic, since they are modeling their optimization using a completely different pipeline than the processor actually implements.

And more interestingly, that's probably the correct way to do it for code that you expect to run on more than one processor generation, or certainly for code that runs on more than one processor architecture.

pslam · on Feb 4, 2014

He points out this is a simplification several times:

  "For the sake of simplicity, imagine the CPU’s pipeline
  depth is 3. At a given time, it can fetch, execute and
  finish one instruction"

and later:

  "I don’t know whether this scheduling is optimal for the
  (incorrect) assumption of a 3-stage pipeline, but it
  does look pretty good. Also, loading a0, a1 etc. from
  memory hasn't been covered for the sake of simplicity."

This wouldn't be anywhere near as readable an article if it covered the gory details of multiple decode, dispatch and register renaming. The same analysis still applies reasonably well to an out-of-order pipeline.

djcapelis · on Feb 4, 2014

The simplification the author points out is arbitrarily picking a pipeline depth of three, not that they simplified the entire concept of what the pipeline is actually doing.

As I've stated multiple times, it's a totally valid way to model it, but it is also useful and important to know it is not how the processor actually works.

knappador · on Feb 4, 2014

To be clear, are you trying to point out that their charts look like in-order modelling on an out-of-order CPU?

Sequential dependencies, especially those existing in looping calculations, turn out-of-order execution into in-order execution. There's no way to re-order the execution beyond some degree of dependency. I don't know what the word for it is, but an engineer will sooner compress a movie to one bit than enable re-ordering certain kinds of sequential dependencies. There's nothing left to do out-of-order because all potential instructions are waiting for something in-order.

djcapelis · on Feb 4, 2014

This isn't even a loop. But yes, the critical path and its dependencies is the main factor in determining the run speed of the instruction sequence unless your processor is balls crazy enough to value predict. ;)

(It has never been successfully implemented, but a few papers have been written on it. There's actually some interesting possibilities that essentially involve value speculation and then forking an SMT thread and finding out whether or not to squash it later.)

knappador · on Feb 4, 2014

I'm so glad we're not talking about javascript ^____^

userbinator · on Feb 4, 2014

One thing really missing from this article that I'd like to see is a listing of the actual instructions generated by the compiler in each case.

knappador · on Feb 4, 2014

objdump and just look for the function names. Autovectorization is about the only curve-ball I would expect, but I haven't seen compilers being especially smart about the pipeline or SIMD.

brigade · on Feb 4, 2014

Even without superscalar execution it'd still help, since no sane architecture has floating-point arithmetic with an effective latency of 1 cycle. Current Intel cores have a latency of 3 cycles for FP add, and 4 cycles for FP multiply.

OoOE is another matter though...

userbinator · on Feb 4, 2014

The CPU he was using is spec'd for a base frequency of 2.7GHz but for microbenchmarks like these, will more likely be at the full turbo frequency of 3.4GHz, That's a 0.294ns instruction cycle, and the fastest result of his optimisation is 15.617ns/call, which is almost exactly 53 cycles/call. That's right in the middle of the timings for the FSINCOS instruction, which on this CPU model (http://www.agner.org/optimize/instruction_tables.pdf ) runs in 20-110 cycles -- calculating both sin and cos, to 80 bits of precision. From my experience these FPU instructions will take more cycles for edge cases since they'll do more iterations to guarantee a specific precision, but for the majority of the input range will be closer to the lower end. So I don't think the author really optimised anything here.

knappador · on Feb 4, 2014

The author optimized a Taylor series expansion that is way more generic than just a sin/cos function, and they explained why, at the atomic level of data dependency in the pipeline, their optimization worked.

userbinator · on Feb 4, 2014

He did, and then benchmarked it against sin(), so it looked like that was the goal rather than a general polynomial evaluation function, and I'm saying there are faster (and shorter) ways to do that quickly.

Lockal · on Feb 4, 2014

He forgot about 3 things:

1) FMA. The most important optimization that GCC can do automatically on Haswell with ffast-math. With FMA and AVX one can calculate 4 (!) doubles simultaneously only for 10 instructions!

2) Register spill. His final version uses 5 xmm registers, while sin3 uses only 3. Sine function is just a primitive, used in more complex calculations. If the final result can't fit in 16 XMMs, each load/store will bite him. More spill -- more blood.

3) Unordered memory access. His final version accesses polynomial coeffs in random order. In some cases compiler may reorder static vars, but not in his benchmark. In synthetic tests all 8 coefficients stay in L1 cache, but in real HPC applications such situation is extremely rare.

pascal_cuoq · on Feb 4, 2014

FMA? Have you seen the date of the post? He would have been writing for PowerPC and IA-64 users, all six of them, if he had based his post on FMA.

Lockal · on Feb 4, 2014

If somebody writes a + x(b + x(c + ...)) one should expect this code to be future-proof and work with any compiler. But if code is mixed with inline assembly, the result will be doubtful, at least.

No need to be a prophet. First CPU with AVX appeared 3 years ago. Haswell was announced 6 months ago. There are no AVX-512 solutions for home users yet, but one can write AVX512-opimized code without any special knowledge.

kevinchen · on Feb 4, 2014

Site went down; cached here. http://webcache.googleusercontent.com/search?q=cache:http://...

knappador · on Feb 4, 2014

Related work demonstrating that ultra-tight mapping loops benefit from multiple inputs per loop iteration: https://github.com/knappador/pipe-packing-demo

nkurz · on Feb 4, 2014

Nice article, I submitted it here: https://news.ycombinator.com/item?id=7176576

I also added a comment on this thread that might interest you: https://news.ycombinator.com/item?id=7176553

anon4 · on Feb 4, 2014

This basically shows that if you don't need to shave off that last nanosecond, sin1 is the fastest variant for a modern optimising compiler.

epx · on Feb 4, 2014

Old ball game, the fad is to calculate sin(x) using 10,000 machines!

nkurz · on Feb 4, 2014

Some people have asked about how these optimizations fare with different compilers. Do the "optimizations" hurt as one changes compilers? Do they help? Do any of them manage to vectorize?

Here's some data:

http://lolengine.net/attachment/blog/2011/9/17/playing-with-...

  Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz stepping 07
  uname -a: Linux 3.5.0-41-generic #64-Ubuntu SMP 
  clang++ -v: clang version 3.2 
  icpc -v: icpc version 14.0.1
  g++ -v: gcc version 4.7.2:

Compiled with "-std=c++11" plus the options mentioned in the names. At a glance, all versions produced the same results to reasonable precision (no gross errors). "-ffast-math" only worked with g++, and did not offer substantial improvement, so I've omitted it from the tests.

         g++_O3      icpc_O3    clang++_O3
   sin:  18.193 ns   7.0174 ns  14.6862 ns
   sin1: 22.5272 ns  8.0225 ns  9.6777 ns 
   sin2: 14.9128 ns  4.7247 ns  7.1016 ns 
   sin3: 18.943 ns   5.1121 ns  6.169 ns  
   sin4: 13.9225 ns  4.8979 ns  5.6666 ns 
   sin5: 14.2042 ns  5.0435 ns  6.3257 ns 
   sin6: 12.2543 ns  4.2955 ns  5.2681 ns 
   sin7: 11.6969 ns  5.0538 ns  5.5793 ns

So yes, these optimizations can still have a positive effect, but generally less than that of using a better compiler. In this case, that means if you care about performance use Intel, or if not possible, use CLang.

But what about vectorization? Perhaps the compiler can do a better job if you let it make full use of modern SIMD:

         g++_O3_mavx  icpc_O3_mavx  clang++_O3_mavx 
   sin:  18.1148 ns   3.7879 ns     14.1987 ns      
   sin1: 22.3675 ns   6.4376 ns     9.2589 ns       
   sin2: 14.5665 ns   3.2823 ns     6.2089 ns       
   sin3: 18.3841 ns   3.6652 ns     5.2409 ns       
   sin4: 13.0917 ns   3.0374 ns     4.9839 ns       
   sin5: 13.1144 ns   3.598 ns      5.177 ns        
   sin6: 11.7605 ns   2.8766 ns     4.2239 ns       
   sin7: 11.4222 ns   3.3261 ns     4.612 ns

Yes, it looks like Intel gains quite a bit from vectorizing the result with 256-bit AVX (4-wide for doubles). If you care about autovectorization, use Intel.

This performance here generally matched my expectations. I have no affiliation with Intel other than having a free academic license, but find that in general their compiler offers better performance than GCC or CLang. In this case, Intel's best (sin6 with AVX) is 4x faster than GCC's best. I'd usually expect something more like a 20%-40% improvement, but vectorization is one of Intel's strong suits.

For those who wish to explore the reasons for the difference in performance, here are the results of 'objdump -C -d' for the functions in question:

g++_O3: http://pastebin.com/VstqvcHJ

icpc_O3: http://pastebin.com/3LjtXAhS

clang++_O3: http://pastebin.com/aSyCyULh

g++_O3_mavx: http://pastebin.com/81vYueEp

icpc_O3_mavx: http://pastebin.com/XRdCuusV

clang++_O3_mavx: http://pastebin.com/KJqUpUBE

Personally, I'd be very interested to see what an experienced x64 SIMD programmer could do to improve these further. My usual estimate is a further 2x, but I don't know how well that applies to this case. I'd also welcome analysis of the code produced by the compilers.

Lockal · on Feb 4, 2014

Your GCC and Clang are outdated and results are VERY UNFAIR in many aspects.

Intel compiler violates IEEE 754 by reassociating multiplication and addition. It you want to compare icc with other compilers, you should use -Ofast option with other compilers.

Also, if you compile this code with Clang 3.3, you will get the same assembly as with ICC. And with GCC 4.8 you will get even more optimal assembly. Check http://gcc.godbolt.org/ to see the result with different compilers (permalink to code: http://bit.ly/1euyjXZ)

nkurz · on Feb 4, 2014

I appreciate your comment. The machine I ran this on has some strange issues that require us to keep around older versions of GCC and Clang, but I was able to install recent ICC. I hadn't known about the online compiler. That said, perhaps you could provide some numbers to back up what you are saying? I'd love to see what you think they should be.

Seeing as the article is about providing rough faster estimations using series expansions, I'm OK with Intel's choice of optimizations. For GCC, does "-Ofast" do anything that "-O3 -ffast-math" doesn't? Because I tried that and it did not help. CLang threw up error messages when I tried, and since it wasn't my code I didn't bother deciphering.

[Does godbolt.org give you any way to download the executable, or do you have to cut and paste the output and assemble that? I can't find a download feature, but it would seem very useful and I don't immediately see any security reasons that they wouldn't offer it.]

nkurz · on Feb 4, 2014

OK, I just tried on the same machine with GCC 4.8.0. Same results: ICC clobbers GCC.

nate@fastpfor:~/lolengine$ g++-4.8.0 -std=c++11 poly.cpp -Ofast -march=corei7-avx -o g++-4.8.0_Ofast_march=corei7-avx

nate@fastpfor:~/lolengine$ g++-4.8.0_Ofast_march\=corei7-avx

  sin: 18.373 ns
  sin1: 17.0716 ns
  sin2: 11.0505 ns
  sin3: 18.3832 ns
  sin4: 13.244 ns
  sin5: 12.8103 ns
  sin6: 12.0091 ns
  sin7: 11.4129 ns

Entirely possible I'm doing something wrong. Maybe ICPC is inlining things into main() and not using the assembly I put on pastebin? But in any case, I think the ball is in your court.

And with GCC 4.8 you will get even more optimal assembly.

Please prove it. Is it optimal because you tested it and it ran faster, or because you believe it should be? I used to have faith in GCC, but having looked at a fair amount of generated SIMD x64 assembly, I now believe that ICC generates better code in most cases.

Lockal · on Feb 4, 2014

The math is very simple: register spill is the same, GCC generates 7 vaddsd and 9 vmulsd. ICC generates 7 vaddsd and 15 vmulsd. addsd latency is 3, mulsd latency is 5. There are multiple add-multiply units on intel CPUs, but this code loads them all, so ICC should be 30% slower with this assembly.

> Maybe ICPC is inlining things

Check this. Looks like it generates vectorized code. I don't have Intel compiler, so all I can do here is to read assembly.

> having looked at a fair amount of generated SIMD x64 assembly

There are no SIMD in any of your asm files. All these sin1-sin7 functions accept only one double and return only one result. They work with 128-bit operands, but only the first 64-bit part contain values for operations.

nkurz · on Feb 4, 2014

The math is very simple: register spill is the same, GCC generates 7 vaddsd and 9 vmulsd. ICC generates 7 vaddsd and 15 vmulsd. addsd latency is 3, mulsd latency is 5.

I strongly disagree that it's simple: figuring out the actual performance on a real processor without running the code is far from simple. In particular, the ordering of the instructions can be crucial. Yes, it can be done, but it takes a lot more than counting instructions. Intel's IACA is the most useful tool I've found for this (free), but otherwise I just spend a lot of time with Agner Fog's execution port info.

There are no SIMD in any of your asm files.

You are absolutely correct. I put the objdump into pastebin, but did not actually look at this code. I just presumed it was vectorized from seeing the large jump in performance with '-mavx'.

Looking closer, it looks like Intel is using a dispatch function, and has a separate vectorized version that it uses when it can. I need to go to sleep, but I'll post the whole objdump so you can inspect: http://pastebin.com/B3EBi0Lq

And if you send me email (my address is in profile) I'll send you a binary to play with. Or if you tell me where to upload, I can do so.

Lockal · on Feb 4, 2014

-Ofast is just an option, that works with icc, gcc and clang 3.3. Maybe for gcc 4.8 it includes only "-O3 -ffast-math", but it can be changed in any moment.

Lockal · on Feb 4, 2014

If you care about SIMD, you should try Intel SPMD Program Compiler (http://ispc.github.io/). Yes, this is a different language, but the result is also very different from normal C compilers. For example, the next thing an experienced SIMD programmer could do is to unroll loop to process not just 2 doubles (SSE), 4 doubles (AVX), 8 doubles (AVX-512), but full cache line, and then reduce the result to a sum. ISPC on the other hand implements this thing automatically.

nkurz · on Feb 4, 2014

Thanks, I'll check this out. I'm very interested in x64 SIMD. This is the sort of thing I've been doing with it: http://arxiv.org/abs/1401.6399

Although I'm talking up Intel's compiler, and do think it usually beats GCC, my actual experience is that I've largely given up on writing anything performant in C (muchless C++). If you give a compiler an inch, it will figure out way to hang itself. I'm having much better luck doing the whole function with inline assembly rather than instinsics.

Lockal · on Feb 4, 2014

Your advice is not bad, but you should definitely try ISPC before rewriting everything in assembly. I'm currently optimizing Cycles Raytracer (without implementing an actual SIMD-packet raytracer), and I'm a little bit tired of avalanche effect in modern compilers too. The unusual thing about Cycles is that its code can be compiled by CUDA, OpenCL and normal C++ compilers. GPU and CPU architectures are very different, so SIMD optimization is not trivial at all. Simple operation reordering can lead to completely different stack map and performance drop by 10%. And you won't ever know about it, until you test with every supported compiler (i. e. gcc, clang and msvc).

solarexplorer · on Feb 4, 2014

It would be interesting to see what Intel's compiler would do to his mini benchmarks.