I'm really skeptical whether the author's "optimizations" are actually doing what the author thinks they do. Read-after-write hazards do not always result in a stall. Modern CPU pipelines have backward propagation that can resolve RAW hazards without stalling the pipeline. More likely it's the fact that x86 processors are superscalar and have multiple ALUs and FPUs. Therefore, they can dispatch multiple floating point instructions at once, as long as there is no data dependence.
Yes, the author ironically doesn't actually understand what's happening in the pipeline for the processor they're writing about.
It is a superscalar processor using reservation stations and tomasulo's algorithm, a RAW hazard doesn't stall the pipeline. But instructions which would be affected by RAW hazards will slow down execution if those instructions are on the critical path. So paying close attention to instruction-level dependencies will produce better results in both since you can increase the ILP possible in the instruction sequence. This helps a lot especially on code which might run on in-order pipelines (a lot of mobile and embedded chips still use in-order pipelines) too.
That's just nit-picking the nomenclature. The author is simplifying things greatly but he's essentially correct - the naive version with less operations ends up slower because there is a long chain of dependencies.
With out-of-order execution this isn't really a stall, because non-dependent instructions can still execute/retire with renamed registers alongside it, but it still effectively delays progress of the dependent chain the same way an in-order pipeline would operate. You might say: its progress is stalled.
No. A read-after-write hazard has not stalled a cpu in the last 10 years. So the article will mislead all the beginner architects out there. A clear explanation of dependencies would be much more appropriate.
There are CPUs that have been stalled by RAWs and appeared in the last 10 years, and they are here to stay.
That being said, I think the downvotes are unfair on your post. The author of the post is writing about CPUs that are indeed unlikely to be ever stalled by RAWs.
Sadly, this is only true on proper desktop cpus. For an example of a CPU that does horribly on RAW hazards, look no further than the Xenon CPU in the XBox360.
If you need a result to perform the next operation, you have something greater than a RAW hazard. RAW is not necessarily guaranteed to stall whereas a dependency in the result, regardless of the register it's supposed to land in, can't be optimized away without creating opportunities for register renaming and instruction re-ordering to work. It's probably going a little far to say that the author doesn't understand the pipeline. I thought they were over-simplifying the pipeline stages, but the potential for register renaming an instruction reordering don't fundamentally change on AMD vs Intel's x86 CPU's while they do change depending on how you write your C. The results speak for themselves.
> but the potential for register renaming an instruction reordering don't fundamentally change on AMD vs Intel's x86 CPU's
People are writing code to run on a lot more than x86_64 these days. The author's way of optimizing is actually more correct for an in-order pipeline than an out-of-order one and that is probably the correct approach for multi-architecture code.
But the author's approach is not particularly well informed by knowledge of what's actually happening in the pipeline of a modern x86 processor. Nor does, in this case, it particularly matter. Which is why the author's assertion that knowing about the pipeline will help with code optimization is somewhat ironic, since they are modeling their optimization using a completely different pipeline than the processor actually implements.
And more interestingly, that's probably the correct way to do it for code that you expect to run on more than one processor generation, or certainly for code that runs on more than one processor architecture.
He points out this is a simplification several times:
"For the sake of simplicity, imagine the CPU’s pipeline
depth is 3. At a given time, it can fetch, execute and
finish one instruction"
and later:
"I don’t know whether this scheduling is optimal for the
(incorrect) assumption of a 3-stage pipeline, but it
does look pretty good. Also, loading a0, a1 etc. from
memory hasn't been covered for the sake of simplicity."
This wouldn't be anywhere near as readable an article if it covered the gory details of multiple decode, dispatch and register renaming. The same analysis still applies reasonably well to an out-of-order pipeline.
The simplification the author points out is arbitrarily picking a pipeline depth of three, not that they simplified the entire concept of what the pipeline is actually doing.
As I've stated multiple times, it's a totally valid way to model it, but it is also useful and important to know it is not how the processor actually works.
To be clear, are you trying to point out that their charts look like in-order modelling on an out-of-order CPU?
Sequential dependencies, especially those existing in looping calculations, turn out-of-order execution into in-order execution. There's no way to re-order the execution beyond some degree of dependency. I don't know what the word for it is, but an engineer will sooner compress a movie to one bit than enable re-ordering certain kinds of sequential dependencies. There's nothing left to do out-of-order because all potential instructions are waiting for something in-order.
This isn't even a loop. But yes, the critical path and its dependencies is the main factor in determining the run speed of the instruction sequence unless your processor is balls crazy enough to value predict. ;)
(It has never been successfully implemented, but a few papers have been written on it. There's actually some interesting possibilities that essentially involve value speculation and then forking an SMT thread and finding out whether or not to squash it later.)
objdump and just look for the function names. Autovectorization is about the only curve-ball I would expect, but I haven't seen compilers being especially smart about the pipeline or SIMD.
Even without superscalar execution it'd still help, since no sane architecture has floating-point arithmetic with an effective latency of 1 cycle. Current Intel cores have a latency of 3 cycles for FP add, and 4 cycles for FP multiply.
The CPU he was using is spec'd for a base frequency of 2.7GHz but for microbenchmarks like these, will more likely be at the full turbo frequency of 3.4GHz, That's a 0.294ns instruction cycle, and the fastest result of his optimisation is 15.617ns/call, which is almost exactly 53 cycles/call. That's right in the middle of the timings for the FSINCOS instruction, which on this CPU model (http://www.agner.org/optimize/instruction_tables.pdf ) runs in 20-110 cycles -- calculating both sin and cos, to 80 bits of precision. From my experience these FPU instructions will take more cycles for edge cases since they'll do more iterations to guarantee a specific precision, but for the majority of the input range will be closer to the lower end. So I don't think the author really optimised anything here.
The author optimized a Taylor series expansion that is way more generic than just a sin/cos function, and they explained why, at the atomic level of data dependency in the pipeline, their optimization worked.
He did, and then benchmarked it against sin(), so it looked like that was the goal rather than a general polynomial evaluation function, and I'm saying there are faster (and shorter) ways to do that quickly.
1) FMA. The most important optimization that GCC can do automatically on Haswell with ffast-math. With FMA and AVX one can calculate 4 (!) doubles simultaneously only for 10 instructions!
2) Register spill. His final version uses 5 xmm registers, while sin3 uses only 3. Sine function is just a primitive, used in more complex calculations. If the final result can't fit in 16 XMMs, each load/store will bite him. More spill -- more blood.
3) Unordered memory access. His final version accesses polynomial coeffs in random order. In some cases compiler may reorder static vars, but not in his benchmark. In synthetic tests all 8 coefficients stay in L1 cache, but in real HPC applications such situation is extremely rare.
If somebody writes
a + x(b + x(c + ...))
one should expect this code to be future-proof and work with any compiler. But if code is mixed with inline assembly, the result will be doubtful, at least.
No need to be a prophet. First CPU with AVX appeared 3 years ago. Haswell was announced 6 months ago. There are no AVX-512 solutions for home users yet, but one can write AVX512-opimized code without any special knowledge.
Some people have asked about how these optimizations fare with different compilers. Do the "optimizations" hurt as one changes compilers? Do they help? Do any of them manage to vectorize?
Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz stepping 07
uname -a: Linux 3.5.0-41-generic #64-Ubuntu SMP
clang++ -v: clang version 3.2
icpc -v: icpc version 14.0.1
g++ -v: gcc version 4.7.2:
Compiled with "-std=c++11" plus the options mentioned in the names. At a glance, all versions produced the same results to reasonable precision (no gross errors). "-ffast-math" only worked with g++, and did not offer substantial improvement, so I've omitted it from the tests.
So yes, these optimizations can still have a positive effect, but generally less than that of using a better compiler. In this case, that means if you care about performance use Intel, or if not possible, use CLang.
But what about vectorization? Perhaps the compiler can do a better job if you let it make full use of modern SIMD:
Yes, it looks like Intel gains quite a bit from vectorizing the result with 256-bit AVX (4-wide for doubles). If you care about autovectorization, use Intel.
This performance here generally matched my expectations. I have no affiliation with Intel other than having a free academic license, but find that in general their compiler offers better performance than GCC or CLang. In this case, Intel's best (sin6 with AVX) is 4x faster than GCC's best. I'd usually expect something more like a 20%-40% improvement, but vectorization is one of Intel's strong suits.
For those who wish to explore the reasons for the difference in performance, here are the results of 'objdump -C -d' for the functions in question:
Personally, I'd be very interested to see what an experienced x64 SIMD programmer could do to improve these further. My usual estimate is a further 2x, but I don't know how well that applies to this case. I'd also welcome analysis of the code produced by the compilers.
Your GCC and Clang are outdated and results are VERY UNFAIR in many aspects.
Intel compiler violates IEEE 754 by reassociating multiplication and addition. It you want to compare icc with other compilers, you should use -Ofast option with other compilers.
Also, if you compile this code with Clang 3.3, you will get the same assembly as with ICC. And with GCC 4.8 you will get even more optimal assembly. Check http://gcc.godbolt.org/ to see the result with different compilers (permalink to code: http://bit.ly/1euyjXZ)
I appreciate your comment. The machine I ran this on has some strange issues that require us to keep around older versions of GCC and Clang, but I was able to install recent ICC. I hadn't known about the online compiler. That said, perhaps you could provide some numbers to back up what you are saying? I'd love to see what you think they should be.
Seeing as the article is about providing rough faster estimations using series expansions, I'm OK with Intel's choice of optimizations. For GCC, does "-Ofast" do anything that "-O3 -ffast-math" doesn't? Because I tried that and it did not help. CLang threw up error messages when I tried, and since it wasn't my code I didn't bother deciphering.
[Does godbolt.org give you any way to download the executable, or do you have to cut and paste the output and assemble that? I can't find a download feature, but it would seem very useful and I don't immediately see any security reasons that they wouldn't offer it.]
Entirely possible I'm doing something wrong. Maybe ICPC is inlining things into main() and not using the assembly I put on pastebin? But in any case, I think the ball is in your court.
And with GCC 4.8 you will get even more optimal assembly.
Please prove it. Is it optimal because you tested it and it ran faster, or because you believe it should be? I used to have faith in GCC, but having looked at a fair amount of generated SIMD x64 assembly, I now believe that ICC generates better code in most cases.
The math is very simple: register spill is the same, GCC generates 7 vaddsd and 9 vmulsd. ICC generates 7 vaddsd and 15 vmulsd. addsd latency is 3, mulsd latency is 5. There are multiple add-multiply units on intel CPUs, but this code loads them all, so ICC should be 30% slower with this assembly.
> Maybe ICPC is inlining things
Check this. Looks like it generates vectorized code. I don't have Intel compiler, so all I can do here is to read assembly.
> having looked at a fair amount of generated SIMD x64 assembly
There are no SIMD in any of your asm files. All these sin1-sin7 functions accept only one double and return only one result. They work with 128-bit operands, but only the first 64-bit part contain values for operations.
The math is very simple: register spill is the same, GCC generates 7 vaddsd and 9 vmulsd. ICC generates 7 vaddsd and 15 vmulsd. addsd latency is 3, mulsd latency is 5.
I strongly disagree that it's simple: figuring out the actual performance on a real processor without running the code is far from simple. In particular, the ordering of the instructions can be crucial. Yes, it can be done, but it takes a lot more than counting instructions. Intel's IACA is the most useful tool I've found for this (free), but otherwise I just spend a lot of time with Agner Fog's execution port info.
There are no SIMD in any of your asm files.
You are absolutely correct. I put the objdump into pastebin, but did not actually look at this code. I just presumed it was vectorized from seeing the large jump in performance with '-mavx'.
Looking closer, it looks like Intel is using a dispatch function, and has a separate vectorized version that it uses when it can. I need to go to sleep, but I'll post the whole objdump so you can inspect: http://pastebin.com/B3EBi0Lq
And if you send me email (my address is in profile) I'll send you a binary to play with. Or if you tell me where to upload, I can do so.
-Ofast is just an option, that works with icc, gcc and clang 3.3. Maybe for gcc 4.8 it includes only "-O3 -ffast-math", but it can be changed in any moment.
If you care about SIMD, you should try Intel SPMD Program Compiler (http://ispc.github.io/). Yes, this is a different language, but the result is also very different from normal C compilers. For example, the next thing an experienced SIMD programmer could do is to unroll loop to process not just 2 doubles (SSE), 4 doubles (AVX), 8 doubles (AVX-512), but full cache line, and then reduce the result to a sum. ISPC on the other hand implements this thing automatically.
Thanks, I'll check this out. I'm very interested in x64 SIMD. This is the sort of thing I've been doing with it: http://arxiv.org/abs/1401.6399
Although I'm talking up Intel's compiler, and do think it usually beats GCC, my actual experience is that I've largely given up on writing anything performant in C (muchless C++). If you give a compiler an inch, it will figure out way to hang itself. I'm having much better luck doing the whole function with inline assembly rather than instinsics.
Your advice is not bad, but you should definitely try ISPC before rewriting everything in assembly. I'm currently optimizing Cycles Raytracer (without implementing an actual SIMD-packet raytracer), and I'm a little bit tired of avalanche effect in modern compilers too. The unusual thing about Cycles is that its code can be compiled by CUDA, OpenCL and normal C++ compilers. GPU and CPU architectures are very different, so SIMD optimization is not trivial at all. Simple operation reordering can lead to completely different stack map and performance drop by 10%. And you won't ever know about it, until you test with every supported compiler (i. e. gcc, clang and msvc).