You can’t beat a good compiler…. right?

ori_b · on June 7, 2010

You can easily beat a good compiler for many kinds of code. It just takes time, effort and a good deal of knowledge. If you're doing it outside of the hottest kernels of your code, you're probably wasting your time.

kenjackson · on June 7, 2010

And it also requires that you pay attention to your evolving toolset. I've worked on x86 code that was hand-optimized back in 1995. Once I reverse-engineered it and handed it to Visual C++ the code gen'ned by VC killed it, by like a factor of 2x.

I emailed the code to the guy who originally wrote it and his response was brief, "The compiler wasn't that smart before". But yet the code sat there for like 10 years, untouched -- largely because people don't like to touch x86 assembly code that is working.

The moral, you may get a benefit today, but you don't get any future benefits of the investment at MS, Intel, the Gcc folks, etc..., who are improving the compilers. Sometimes it is the right call to hand optimize. But when you do, I suggest having a C/C++ version alongside of it that one can test aganist it whenever the compiler revs.

metaguri · on June 7, 2010

That's a great point. It's similar to why people often set up their problems as 3SAT or convex optimizations--you get to benefit from all future improvements to 3SAT solvers or optimizers.

zbyszek · on June 8, 2010

An anecdote in a similar vein:

I was involved in writing (Fortran) code for Crays in the 90s. For the T3D, we speeded up the code with the usual set of tricks -- loop unrolling, array padding and so on. With the later T3E, it turned out that not only could the compiler now do all that, but that code hand-optimised in this way performed worse.

kenjackson · on June 8, 2010

Heh, we may know each other. I also developed code on the T3D and T3E. I figure can't be too many people in the world that developed for both. I still have the big gray Cray binders with the documentation laying around somewhere.

smcl · on June 7, 2010

Heh, I do like the "You can easily beat...." followed imediately by "It just takes time, effort..." - but you're absolutely right. I'm not sure that was his point though, coaxing the compiler into issuing better code by providing few hints doesn't necessarily require too much time or effort.

gridspy · on June 8, 2010

Moral of the story - if you can make more assumptions about the input data than the compiler, you can beat it.

Or in general - if you are considering using a tool or rolling your own, what do you know about your data that the tool author doesn't?

wglb · on June 8, 2010

This is really a matter of scale and reality. Let's think in survival terms. Some applications, oh, say High Frequency Trading depend on getting the calculation done first. We are talking milliseconds or less to do a calculation that leads to placing, cancelling, or modifying an order. Do they use assembler? No. Problems of scale requiring massive hardware or clusters, do they use assembler? Not generally. Often they use Lisp, C++ or C.

So in the context of real-world problems, compilers win, because none of us has the energy or attention span to write it in assembler, something the size of a real-world problem like, for example, a compiler.

We optimize small problems by hand, but problems of useful size we don't because the compiler does a better job on problems of true interest than we have the patience or time to do.

To put it another way, it is better to put energy into improving compilers than getting into a John Henry type of competition.

zbyszek · on June 8, 2010

Your point is well made, but if the large, relevant problem can be broken into small sub-problems, and if one of those consumes a significant amount of compute time, then that might (just might!) be worth optimising in assembler.

My experience of problems requiring massive hardware or clusters (limited, I admit, to certain classes of scientific computing) is that such assembler coding of speed-critical components is actually often done.

InclinedPlane · on June 8, 2010

John Henry won the race against the steam-hammer too, before he died. The point is not that it's impossible to beat a good compiler at micro-optimization, the point is that it's rarely worth the effort. More important than being able to beat the compiler at optimizing any given piece of code is building the tooling (and having the discipline and experience) to know when and where to best leverage your development efforts in optimization.

Rather than trying to carpet-bomb everywhere with micro-optimization hand-grenades, use laser guided precision to hit the targets that count. So instead of wasting your time manually unrolling loops in code that's just going to be refactored away before release anyway, find out exactly where your code is bogging down and concentrate on fixing that.

maweaver · on June 7, 2010

Wouldn't something like

    float temp_out = 0.0f;
    for (j = 0;j<5;j++)
         temp_out += filter[j] * in[i+j];
    out[i] = temp_out;

solve the performance issue, and allow for out = in? (which I would think would be desirable, in cases where you don't need the original input).

Also, this is way outside my area of expertise, but when I hear convolution I think FFT. I would be curious if anyone knows whether there is an algorithmic improvement to be made here.

lbrandy · on June 7, 2010

If you look at the code I put on github, you will see I tested this exact bit of code. It does produce good assembly but I left it out because the post was already long, and it obfuscates -why- this improvement improves performance.

Your second comment (about out == in), is a bit trickier. This is obviously a contrived example. In a normal gaussian filter, you'd want the filter centered so that the effective indexes of the filter are -2 .. 2, not 0 .. 5. This makes (out == in) a trickier problem to solve. I didn't really want to get into those details in the post.

cameldrv · on June 7, 2010

I think that your solution is better in that it's simpler, but I'm not sure what the compiler will do in this case. Theoretically, in[i+j] could point to temp_out, but it could only be through very tricky means, since temp_out is on the stack, and the caller presumably doesn't know where it will be allocated.

An FFT wouldn't be faster in this case since the filter only has five taps. Generally there has to be a decent sized kernel for the FFT to be worth it -- the FFT is O(N log N), and this algorithm is only 5*N operations.

JoachimSchipper · on June 7, 2010

The compiler is free to optimize in this case, because there's no requirement that in[i + j] point to temp_out, even if you manage to make the stars align.

JoeAltmaier · on June 7, 2010

Better examples of good compiler optimization would be other than math loops. Math is kind of shooting fish-in-a-barrell. Where the compiler shines, is when it optimizes thru necessary abstraction. Sure you can hand-code a flat struct, but you rather have objects with ctors etc. (insert favorite abstraction here).

ramchip · on June 7, 2010

Starts by saying how it's a huge myth that compilers generate better code than hand-made assembly... and ends with compiler-generated code, the author basically saying "I think I could do better".

It was interesting to see what the compiler generated, though it won't be a big surprise to experienced C coders.

lbrandy · on June 7, 2010

> Starts by saying how it's a huge myth that compilers generate better code than hand-made assembly...

It's important to note that I never said this.

I was simply trying to separate two versions of the same rule of thumb... I said that using a compiler, a smart low-level programmer can generate the assembly code he wants. This is not "hand-made" assembly, but this is not just the magic of an optimizing compiler, either.

ramchip · on June 7, 2010

I do see your point and I found it interesting (I've had to optimize filters in the past and used the same process of coding C and looking at the assembly), but how can "beating the compiler" be interpreted as anything but writing better code than the compiler?

Retric · on June 7, 2010

Writing optimized code generally involves looking at larger sections of the problem / code base and picking a different approach based on global conditions.

So, when you and the compiler are booth looking at the same code base, but you find a better local approach based on a global condition you have beaten the compiler. Even if all you did was tell it about that condition.

JoachimSchipper · on June 7, 2010

If you read it slightly more charitably as saying "you need to know a thing or two about how computers work, even if you have a nice optimizing compiler", it makes sense.

smcl · on June 7, 2010

True, but it's easy to forget that many people don't use compiler\architecture-specific extensions or pragmas (and probably don't even need to). I think the article provides a gentle nudge into the "think about what exactly you are telling the compiler" mindset for such people.

JoachimSchipper · on June 7, 2010

Although there's no need to venture into unportable extensions here - as I pointed out at http://lbrandy.com/blog/2010/06/you-cant-beat-a-good-compile..., C99 has the "restrict" keyword.

Malic · on June 7, 2010

I think the Euro-demo coders of the 80s made it clear that handwritten assembly can be optimized to a ridiculous point of efficiency.

sjf · on June 7, 2010

Very true. But these were specifically intended to test the limits of coding. In other situations, you have to ask yourself, is this really adding value?

jheriko · on June 7, 2010

Good post.

I'm a big fan of looking at the assembler code generated from programs - even if you are not optimising - it helps give a depth of understanding which is simply not possible without doing it.

Not to mention that the "smart" compiler is a myth - I've seen some downright stupid asm come out when the compiler is configured wrong - and it gets missed, even with profiling based optimisation targetting the area - because looking at assembler is actively discouraged.

We are spoiled with IDEs and high level languages...

JoachimSchipper · on June 7, 2010

This has some caveats that I pointed out at http://lbrandy.com/blog/2010/06/you-cant-beat-a-good-compile...; still, a nice enough article.

EDIT: note that the off-by-one error is MY mistake - it simply isn't there. My point about "restrict" is valid, though.

lbrandy · on June 7, 2010

This is a terrifying post for me to make because I feel like I'm about to make an ass of myself... but...

I'm not seeing an off-by-one error. 'i' will be LENGTH-5 on the last iteration and we will add 'j==4' for in[LENGTH-1] not in[LENGTH].

JoachimSchipper · on June 7, 2010

You are actually right. Sorry! I shouldn't be trying to do N things at the same time - I somehow misread that as i <= LENGTH - 4.

snippyhollow · on June 7, 2010

Being unsure and not a douche should prevent you from making an ass of yourself. No error for me too.

zwieback · on June 7, 2010

I agree that checking assembly output is very good practice and even more important on not-so-common architectures. x86 and ARM compilers are very good now but if you're working on a different processor the compiler might not be so good.

Especially on DSPs and microcontrollers you may end up changing instruction order to avoid pipeline stalls or use on-chip RAM buffers for your often used data.

Also, if you start writing a routine in assembly from scratch it's often surprising how many different ways you can implement an algorithm since you're not as bound by the calling conventions and runtime assumptions made by the compiler.

Not that I would recommend that for anything other than the innermost loops.

By the way, are we talking about anything other than C or maybe Fortran here? I'm guessing Java or C# coders don't spend too much time looking at the IL, I know I've only done it for educational purposes.

SkyMarshal · on June 7, 2010

Speaking of this, can anyone recommend any good resources for learning assembler code, targetted at noobs who know some C but nothing more low level?

I need to plug some holes in my skills, and this is one of them.

nitrogen · on June 8, 2010

I wouldn't necessarily recommend this approach, but the way I learned to read x86 assembly (I've never needed to write it, but I do write other assembly languages) was by reading Michael Abrash's Graphics Programming Black Book and disassembling a Windows driver for a piece of hardware in order to create a Linux driver. Having a practical motivation other than just "learning assembly language" can be beneficial.

You could also try getting an embedded development kit, like the USB Bitwhacker, PICKit, or an AVR-based kit, and programming a microcontroller in assembly language.

lanstein · on June 7, 2010

http://www.amazon.com/Assembly-Language-Step-Step-Programmin...

It's kind of a conversational approach to learning Assembly.

DCoder · on June 8, 2010

In certain cases, this level of optimization makes perfect sense. In other cases, arguably more common when writing C++, it makes more sense to invest that time in understanding the object model, i.e. what actually happens when you pass a struct by value, return an object (NRV), where invisible object copying occurs (STL, I'm looking at you) or similar issues. Lippman's "C++ Object Model" is a good reference for that.