Forcing code out of line in GCC and C++11

userbinator · on Jan 18, 2015

It's interesting to see techniques that are pretty common in Asm programming appearing in higher-level languages - putting the error-handling code "somewhere else" or otherwise trying to keep it out of the way is one of these. The other one related to this, which I haven't seen a compiler do yet, is "chaining the error path" which looks something like this:

        ; some code that sets CY on error
        jc some_error
        ...
        ; more code that sets CY on error
    some_error:
        jc some_error1
        ...
        ; etc.
    some_error1:
        jc some_error2
        ...
    some_error2:
        jc some_error3
        ...

    some_errorN:
        ; error handling code goes here

The reasoning behind this being that conditional jumps to nearby (within +/-127 bytes) targets are far shorter (2 bytes vs 6) and the error path is rarely taken. When the error-handling code is ahead, they are predicted to be not-taken with the common "always not taken" and "forward not taken" decisions, which agrees with their use of being taken only on error. A bit hard to express this pattern with C, however - and chained gotos, which would be somewhat equivalent, aren't so "high level" anymore.

ot · on Jan 18, 2015

There are a few things compilers can be really bad at, another is register allocation. A very interesting example is the LuaJIT2 interpreter, as explained by Mike Pall:

http://article.gmane.org/gmane.comp.lang.lua.general/75426

gumby · on Jan 18, 2015

Hah, funny, as the original motivation for register-rich RISC was that compilers can be more diligent than humans (in other words "the computer has more fingers than a person does").

But different languages are appropriate for different problems and there is no shame in dropping down into assembly code for pathological cases like the lua example you cite. In the end a good programmer understands the semantics and execution case far better (i.e. at a higher level) than the computer is able to.

In fact this article is a good example of how you can provide intentionality to a modern compiler so that you don't have to drop into assembly and provide the semantics by extension.

grundprinzip · on Jan 18, 2015

I'm not sure, but the assembly code generated when using __builtin_expect looks almost identical.

ot · on Jan 18, 2015

There's actually a huge difference: if you use __builtin_expect the slow path is still part of the function.

If the fast path is small enough, you probably want to inline the function, but the slow path makes it large enough that the compiler doesn't inline it (for good reasons).

If you force the slow path into a separate function, then your function becomes fast path + a call instruction, and it can be inlined.

I've often done this manually, and this is a very cool trick that I'm going to adopt immediately.

EDIT: plus, it's also beneficial for instruction cache/iTLB, as cornstalks pointed out.

Scaevolus · on Jan 18, 2015

Yes, unlikely()/likely() is a much better way to give the compiler hints about block linearization.

Here it is on godbolt: http://goo.gl/iPCiSy

gaze · on Jan 18, 2015

This is so strange! If something is unlikely to execute, it is unlikely to execute due to a condition, which you clue the branch predictor as unlikely and the layout thingy as out of line. Why would something be unconditionally unlikely to execute?! I don't understand what a lambda buys you here.

nkurz · on Jan 18, 2015

In theory, the lambda combined with the noinline attribute forces the compiler to use a call/ret to a function rather than making a local jump. Since the function usually will have alignment requirements, this often will mean the error handling is in a different 64-byte cache line. If this error never occurs, this cache line will never occupy space in the instruction cache (i-cache), and useful instructions can be held there instead.

In practice, I'd be surprised if you can come up with a case where this makes a significant difference. If you are on a hot enough path for this to matter, on a modern processor you probably are running out of the even lower level decoded µop cache, which doesn't cache µops for branches that are not taken. If you aren't in this cache, your efforts are probably better spent making this happen.

Edit: ot's comment about how this affects the size of the parent function and whether it will be inlined is a good point, and might well make a measurable difference in the cases where it is true.

mattgodbolt · on Jan 18, 2015

Author here. I use this trick for a general 'bail out with error code' throw macro. By definition it's the exceptional case and so I'm happy to take the hit of going out of line. It's a macro used all over a very large codebase and so it made sense to do this. And yes, in some cases it helps the compiler inline more aggressively where it does matter.

abdulla · on Jan 27, 2015

I'm guessing the reason for the !! in the macro is to force a boolean conversion. Is there a better explanation?

cornstalks · on Jan 18, 2015

It's similar, but the whole point of the trick here is to move the generated code so it doesn't "pollute" the instruction cache in the CPU. __builtin_expect still puts the error-handling instructions right next to the rest of the code. The lambda puts the error-handling instructions elsewhere.

Whether or not this will impact performance depends, and will need some careful profiling. But I can imagine some situations where keeping the "hot" instructions in the cache and the "cold" (error-handling) instructions out of the cache could be beneficial.

rlpb · on Jan 19, 2015

> __builtin_expect still puts the error-handling instructions right next to the rest of the code. The lambda puts the error-handling instructions elsewhere.

Surely that's a decision for the optimizer to make, in the case of __builtin_expect?

tacos · on Jan 18, 2015

Even the best coders often get this wrong. Let the profiler do it.

http://llvm.org/docs/BranchWeightMetadata.html

ot · on Jan 18, 2015

See my other comment (https://news.ycombinator.com/item?id=8906866)

The compiler can optimize the layout of the basic blocks according to its beliefs (or hints) on the branch probabilities, but it won't move a code path to a separate function. I don't know if this could be done in compliance with the standard, but I haven't seen any compiler do it.

EDIT: I stand corrected, looks like GCC 4.9 can do it (http://goo.gl/amM4Et). Still, it didn't do it when I needed it :)

DannyBee · on Jan 18, 2015

Function outlining can be done, if it's not being done, you don't have the right options or it's deciding it's not profitable.

(and FWIW, ICC has done this for at least 8 years)

repsilat · on Jan 18, 2015

Why put it into a separate function? You could just movie it to the end and jump there and back，saving the call overhead.

DannyBee · on Jan 20, 2015

Call overhead would be nil, since it created the function, and thus it does not have to follow normal argument passing behavior :)

In effect, it will just be a bunch of gotos.

Rusky · on Jan 18, 2015

It makes the function smaller, making it easier to inline. This might matter depending on how easy it is for the compiler to do partial inlining.

nkurz · on Jan 18, 2015

The link is helpful for background, but is it really useful to tell a programmer who is worried about i-cache contention to be guided by an automatic profiler rather than forcing the compiler to do what they want and then profiling that? I'm intrigued --- what prompts you to give this advice? It's the opposite of my instinct. Have you personally experienced issues with this?

tacos · on Jan 18, 2015

Programmers are horrible at guessing branch prediction. It's a famous truism. And the only thing worse than guessing the branch prediction wrong would be moving the code farther away then adding a jump to it and back.

The next level of mistake is gathering the profile data from the unit tests. (Hey! Free Input!) Ugh.

Hardcoding "Case 5: is most often" in a library also bit me. If you're giving me the source, let me guide the paths, thanks. Maybe you use Case 5, maybe my code doesn't. (And I just stepped through Clang doing the exact right thing on a switch statement with inlining.)

userbinator · on Jan 18, 2015

Programmers are horrible at guessing branch prediction.

Not for error cases like the ones described in the article.

enqk · on Jan 18, 2015

I think you should have stated upfront that you were asking library-writers not to do that, because it's hard to guess actual branching profiles there. I don't think your advice apply that much to an application writer.

gumby · on Jan 18, 2015

The article doesn't give any specific situations for doing this (apart from the example of an uncommon error case). You could indeed be doing this specifically from the result of looking at the profiler and then reading the assembly code output of a hotspot.

mattgodbolt · on Jan 20, 2015

That's a weakness of the article...sorry. I use it to handle error cases in otherwise very performant straight-line code.