Operation Costs in CPU Clock Cycles

slashdev · on Nov 11, 2016

A syscall is a lot cheaper than shown here in terms of direct costs ~ 150 cycles, and also a lot more expensive, depending on the call, when you factor in the full cost of the cache that gets clobbered. More like the full cost of a context switch shown, around 10-30K cycles[1]. For this reason it's important to use system calls that allow you to amortize more work into one call, like preadv, pwritev, recvmmsg, and sendmmsg.

[1] http://www.cs.cmu.edu/~chensm/Big_Data_reading_group/papers/...

bogomipz · on Nov 11, 2016

>"and also a lot more expensive, depending on the call, when you factor in the full cost of the cache that gets clobbered."

But a syscall doesn't necessarily mean a context switch right? It's just a mode switch if the kernel is servicing the syscall on behalf of the same process. The cache wouldn't get clobbered then because the kernel is mapped into the top of every process, for just this reason.

Or am I misunderstanding what you are saying?

fdsaaf · on Nov 11, 2016

You are correct. The author is considering the cost of accessing kernel mode code and data, which I don't think is fair

slashdev · on Nov 12, 2016

It absolutely is fair. For a long time after a syscall your program will run slower thanks to the increased cache misses from the kernel clobbering your stuff in cache. That's a real price that you pay for a syscall and it would be wrong not to count it. However, if you found some other way of doing the work, e.g user code, kernel bypass, amortized bulk syscall, those will also have a (lesser) effect on the cache. So to be fair you compare a syscall against that, not against zero.

JoshTriplett · on Nov 12, 2016

If you have enough CPU cache, the CPU will cache both kernel and user code/data for your process. If you run long enough and access enough user data that the kernel bits get evicted, then you'll take more of a performance hit, but the same thing applies to accessing enough user data that different parts of your user data get evicted.

slashdev · on Nov 12, 2016

Maybe this is more reflective on the kinds of software I work on, but I don't find that I have enough cache. So syscalls always come at a steep price. In fact I am tempted to break up one program into multiple processes and use the cache partitioning in Xeon v4 to prevent the different parts of my program from clobbering it's own cache.

bogomipz · on Nov 11, 2016

Oh I see, that makes sense. Agreed. Thanks.

digi_owl · on Nov 11, 2016

Anyone have the presentation from a Intel guy on how the CPU design focus has moved from cycles to cache misses handy?

Edit: never mind, it was not a Intel guy. And i actually had the thing bookmarked (and it still worked).

https://www.infoq.com/presentations/click-crash-course-moder...

vcarl · on Nov 11, 2016

There's a fantastic, massively upvoted StackOverflow post that can also provide some insight here. This may be a little more accessible, since it's such a significant runtime difference with very simple source code.

http://stackoverflow.com/questions/11227809/why-is-it-faster...

globuous · on Nov 12, 2016

Thanks so much for sharing this, great read !

For those, like me, that want to play with what this stackoverflow talks about, here's a fiddle of it: https://jsfiddle.net/tbinetruy/Latkmk2q/1/ (code takes 2s to run and loads firebug for console logs).

woliveirajr · on Nov 11, 2016

The question is very interesting and good phrased, and the answer is better than many classes that many students had about processors and so on.

vvanders · on Nov 11, 2016

That's a great talk, brings together a lot of different things I've seen in one place.

quotemstr · on Nov 11, 2016

Honestly, it's a lot more complicated than that, and the article itself presents a pretty complicated (and flawed) model. There's no substituting for benchmarks.

Really, if you want a quick and dirty heuristic, the smallest program is likely the fastest. Everything else will bite you in the ass.

Anti exception stuf: the author repeats the old harmful "exceptions for exceptional cases" bromide while overselling the cost of exception dispatch (5k cycles is cheap in the scheme of things) and underestimating (or, rather, not mentioning) the icache cost of error checking.

Now that I'm looking, the author glosses over many other things as well, like modern system call interfaces (no exceptions!), cache pressure generally, and branch prediction efficacy. These factors are strong arguments against generating reams of template code using CRTP.

Like most analyses that purport to show the "true" cost of anything on modern superscalar virtual memory systems, it oversimplifies.

no-bugs · on Nov 12, 2016

> Honestly, it's a lot more complicated than that, and the article itself presents a pretty complicated (and flawed) model. There's no substituting for benchmarks.

Of course, all the models are inherently flawed, and in theory there is no substituting for benchmarks, but going along the lines of "all models out there are flawed, so we don't need ANY model and should do benchmarking instead" leads to the software which is badly architectured to start with (and then no benchmarking will be able heal it without throwing the whole thing away and rewriting it from scratch - and it is not going to happen for most of the projects for many practical reasons).

In other words, if we're already in the vicinity of the optimum (which BTW may happen to be a local one - but this is a very different story) - then benchmarking is the way to go (that is, if those 20% we're chasing, are worth the trouble - and TBH, they really are). However, at the earlier stages being able to realise the order of magnitude without spending time on trying really badly architectured things - is extremely important.

Just one example - no optimisation will save a (non-GPGPU) program which tries to split the load over multiple threads with chunks calculated on each thread (before it relinquishes control to another thread), being 100 cycles or so; I don't need to benchmark it to know that at least in 99.9999% of real-world projects it is a suicide from performance point of view - and this observation directly follows from the diagram.

Or another example - in most of practical cases (actually, pretty much always except for inserts into the middle without iterating through and specialized stuff such as merges), vectors have advantage over lists - and this advantage can easily reach over 100x(!) ( see, for example, https://accu.org/index.php/journals/2268 ). This, once again, follows from the diagram (it is a direct consequence of the main memory read being over 100 cycles and L1 read being 3-4 cycles).

This list of crazily inefficient things (which are still widely used in the industry for no reason) can go on and on. Of course, going into anywhere more detailed analysis than "order of magnitude" indeed doesn't make sense - and it is clearly stated in OP. But for initial estimates when making some architectural decisions (and to avoid premature pessimisation) - such "order of magnitude" analysis can be very useful.

trsohmers · on Nov 11, 2016

While this is a pretty decent overview, it is not very precise, and even a bit misleading in certain cases. Depending on where certain instructions are in the execution pipeline, you can. R limited in what can be issued and executed, and the latencies are not deterministic. This is one of the biggest difficulties in optimizing scheduling in a complex architecture such as x86, and gets even worse (determinism wise) when you throw in cache misses and branch prediction.

For a more exhaustive instruction latency listing over a variety of micro architectures, check out [0]. [1] is also a great resource for memory latencies for a variety of processors.

[0]: http://www.agner.org/optimize/instruction_tables.pdf

[1]: http://www.7-cpu.com

logicallee · on Nov 11, 2016

For people who make interactive applications, here is an interesting thing to add:

Making a user interpret an error message or notification of some kind (8 seconds) and have to dismiss a window (3 seconds) rather than be able to continue seamlessly:

44,000,000,000 operations, or 275,000,000,000 operations if we include the idling GPU.

Distance light travels in this time: around the world 77 times, all while we keep the user from being able to do anything.

Something to think about the next time you throw up a dialog box...

taneq · on Nov 12, 2016

If you want to put it that way, you can think of information in the user's brain as being stored in peripheral memory at the end of a horribly slow, high-latency, unreliable bus. :)

logicallee · on Nov 12, 2016

This is really good phrasing!!! I love it.

Especially when you need the user to take an action: the API isn't well-formed (for example a mix of visual cues, text, and UX behavior) but at the end of the day if you give the user something to do (example: you want them to acknowledge and dismiss a notification or make a choice) there are multiple ways of doing it, some much faster and others much slower.

This is especially interesting when you need the user to deal with your bugs (e.g. errors of various kinds). It's not really part of what they're trying to do - you're the one that needs them to react somehow.

very interesting way of thinking about it!

jheriko · on Nov 12, 2016

i don't think we need to 'think of' it as such... this is precisely what it is.

... and don't forget its not just the bus that is unreliable, but the data store too. :)

glangdale · on Nov 12, 2016

In places, this is quite simple-minded, though there is a nod to the distinction between throughput and latency. A function call may effectively cost nothing; if the function and its caller do useful work, a few store/load pairs (save and restore registers) and correctly predicted return address may be a handful of microoperations merged in with a stream of other instructions. The idea that there is 20-50 cycles of actual cost from a function call is outlandish.

Similarly, the virtual function call number looks way out of whack as long as the indirect branch target is correctly predicted (which, if the use of virtual functions is substituting for a more straightforward approach, is likely - and if it's quite unpredictable where the indirect branch in your virtual function call would go, any other approach probably just moves the branch mispredict somewhere else).

The numbers are not overall "crazy-wrong", but there is a tone of precision and certainty to them that is a bit deceptive. Talking about how much an operation costs is pointless; build things one way, measure that, then make the smallest change possible and measure that. Most of these costs make sense only in context (if you're on a critical path, measure latency, if you're on oversubscribed execution resource, measure throughput, etc).

no-bugs · on Nov 12, 2016

> may be a handful of microoperations merged in with a stream of other instructions.

It may indeed in theory, but most of the time in real-world it won't. And BTW - let's not forget about implicit costs of being unable to inline (which can be huuuuuge). Speaking of the real-world - we've just got our article accepted, presenting supposedly the fastest universal hashing function as of today (in spite of math being more heavy compared to the existing ones) - and the numbers in the OP are consistent with our real-world experiences while we were optimising it (well, within an order of magnitude at least).

> but there is a tone of precision and certainty to them that is a bit deceptive.

OP: "Last but not least, a word of caution: all the estimates here are just indications of the order of magnitude".

> Talking about how much an operation costs is pointless; build things one way, measure that, then make the smallest change possible and measure that.

Sure. The problem is that almost-nobody has time to do it in real-world projects. Which leads to even cruder estimates (such as "virtual function calls costs are negligible, regardless of the number of times they're called") being used - causing lots of crazily inefficient programs. IMO, OP is a reasonable middle ground between an all-out quasi-stationary testing and even worse guesstimates ;-).

glangdale · on Nov 12, 2016

I have been working in this area for 10 years, made millions of dollars of revenue from a product that was largely dependent on getting this sort of thing right, sold a startup to Intel whose main business was in this area, and work for Intel. I'm not 100% sure you are well situated to lecture me about the 'real world' because you wrote an academic paper.

Congratulations on getting a paper accepted (and would be interested in reading about it, as I love hashing work) but your claims about "most of the time in real-world it won't" is nonsense. The typical x86 function call overhead is a correctly predicted call, some register saves (using an optimized stack engine), the real work, and some register restores (using the same optimized stack engine). This is not typically 15-30 cycles worth of overhead. The loads and stores generally go to and from L1 cache, are generally predictable (if the function itself is predictable) and none of the operations that are part of a conventional function call are all that expensive. In 15-30 cycles you can do 30-60 loads and/or 15-30 stores (mileage varies by uarch) and 60-120 integer operations if you are very, very lucky. Compare this with the typical argument setup, function prolog/epilog, etc.

As you hint at, function call overhead generally comes from interrupting live ranges (forcing save/restore or simply causing the register allocator to pick a worse strategy) and losing the opportunity to optimize across function boundaries - this cost can be enormous, nebulous and not even a constant (it will impose costs repeatedly on the program and isn't even a constant). I have code where the cost of sticking a function call in and losing some constant propagation information is millions of cycles per call, not 15-30. In other places the cost of a call is effectively zero.

In still other places the cost of a function call is 'negative' - that is, it's cheaper to have a function exist as a independent unit than inline it. This is typically an i-cache issue but we've seen a host of weird effects here.

So - under some (unusual) circumstances, function call overhead can be practically free (i.e. the ooo engine has a lot of opportunities to insert prolog/epilog instructions, no branch mispredicts, etc). Typically it will be inherently cheaper than 15-30 cycles - but lost opportunities for optimization may take you to numbers that are insanely higher than that.

The reason I arced up over this is that it's just not useful to say "within an order of magnitude". If you say 15-30 is the magic number you are creating folklore. We are burdened by folkloric stuff ("the interpreter penalty that results because branch predictors 'can't predict indirect branches'") that result in considerably worse designs. It's better to know you don't know rather than promulgate simplistic rules-of-thumb that misstate the real issues.

This is particularly true because a lot of this 'folkloric optimization' stuff people do generally leads away from a simple and direct expression of their ideas in their favorite coding style, and towards a heavily "optimized" form that's super-obscure because some "performance high priest" has declared it Performant. We've had a number of cheap laughs in our day re-rolling loops, replacing inline asm with C code, and restoring sanity and getting a performance win from doing it. I've been just as guilty of playing Performance High Priest as anyone else, mind you.

no-bugs · on Nov 12, 2016

> made millions of dollars of revenue... I'm not 100% sure you are well situated to lecture me about the 'real world' because you wrote an academic paper.

So, we're going to discuss millions of dollars instead, sigh... This way Trump should be one of the best programmers in the universe, I guess.

> Typically it will be inherently cheaper than 15-30 cycles - but lost opportunities for optimization may take you to numbers that are insanely higher than that.

And still, even from your own rant it follows that in vast majority of cases the estimate of 15-30 cycles will be well within "order of magnitude" (TBH, I didn't see millions myself, but it should be a really strange corner case).

> If you say 15-30 is the magic number you are creating folklore... We are burdened by folkloric stuff... that result in considerably worse designs.

And not having any such "folklore" results in even worse designs :-( (actually - MUCH worse ones). Using list instead of vector can easily give you 100x penalty for absolutely zero reason (actually, up to 780x was experimentally observed - and that's without swapping). Off-loading 100-cycle chunks (with a thread context switch back after calculating these 100 cycles) to a different thread will never work at least on a x64 (though I remember meeting some folks from Intel - I think they were representing an OpenMP team - who were seriously preaching otherwise, based on utterly silly "how many cores we managed to utilise" metric without realising that the whole thing became _slower_ after they parallelised it ;-( ). And so on and so forth.

Sure, the numbers are very rough. But trying to say that "hey, it is not precise so let's not even try to estimate" - is even worse than that.

glangdale · on Nov 12, 2016

I don't think Trump made his money doing low-level performance programming for the past 10 years, so I'm not sure your analogy is valid.

However, since you, whoever you are, have not only written a hash table, but discovered profundities like 'Sometimes list costs 780x as much as vector for "absolutely zero reason"', and 'don't try to offload 100 cycles of work to another thread' I'm going to defer to your expertise. I recommend you stick a bone through your beard, pronounce yourself a performance guru, and make bank. Have fun.

no-bugs · on Nov 13, 2016

> I recommend you stick a bone through your beard, pronounce yourself a performance guru, and make bank. Have fun.

:-) :-) I LOVE when my opponent has to resort to personal insults :-). Leaving aside any sarcastic remarks in this regard:

For Intel's sake I Really Hope that these "profundities" are indeed very well-known to you - and believe it or not, they're very well-known to me too for at least 10 years. However, this is not the point; the point is that there are LOTS of developers out there who do NOT know them - and the OP is intended for them (and not for "performance gurus").

It is actually THIS simple. Eliminating 90% of inefficiency does not really require black magic or "performance gurus" who know exactly how the pipeline of specific CPU works. And this is exactly what I'm arguing for - to educate app-level developers and architects about this low-hanging fruit of 10x+ inefficiencies; I can assure you that it is very far from being universal knowledge in app-level development circles.

CountSessine · on Nov 11, 2016

This is a fantastic resource; kudos to the author. But there is one thing in this reference that I found unexpected:

One further thing which is related to memory accesses and performance, is rarely observed on desktops (as it requires multi-socket machines – not to be confused with multi-core ones ... When multiple sockets are involved, modern CPUs tend to implement so-called NUMA architecture, with each processor (where “processor” = “that thing inserted into a socket”) having its own RAM

I thought that all Intel chips since Nehelem divided their SDRAM access into a NUMA-configuration based on cores? Am I wrong about that?

slashdev · on Nov 11, 2016

NUMA typically affects multi-socket machines only. An exception would be high end Xeon chips since Haswell when used in a cluster-on-die configuration, but you won't find that in a desktop PC. Each socket in a multi-socket system has its own memory, and when a remote CPU accesses the memory of another CPU, it pays a fairly hefty latency penalty compared to accessing its own memory.

CountSessine · on Nov 11, 2016

Just thinking about it, I guess this makes perfect sense - all memory access on a socket will converge on a common L3 cache and it would be just bizarre if somehow each core would do a 'private write-through' somehow to it's own SDRAM.

jdub · on Nov 11, 2016

I don't think there's much NUMA action on single socket at the moment, but as CPU area increases and more of the transistors are not actually doing CPU work (to spread out the heat-making bits) which increases distances on a single die, this will change.

kikoreis · on Nov 11, 2016

Unless there is core-specific RAM on the die, why? Isn't the essential aspect of NUMA the fact that there is some memory which is "near", and some which is "far"?

jdub · on Nov 12, 2016

Yeah, as distance (latency) to RAM increases, the amount of on-die cache increases (another handy way to distribute heat with a performance bonus) and coherency becomes more costly, so in effect it becomes the core-specific RAM you mention.

(Oh, hi Kiko!)

kikoreis · on Nov 14, 2016

Hey Jeff.. the nick fonts are really small on HN; I didn't see it was you!

I don't think latency to near RAM will increase; it would have too material a performance impact. Even in disaggregated designs like Rackscale there is definitely a concept of "near RAM", which is not cache, but which has very low latency.

However, your post made me realize that as the number of cores go up, as with KNL, they are likely to be organized hierarchically with some clustered sharing of cache, so indeed NUMA-style affinity of workload to core starts paying off there. IOW, if you have thousands of cores on a chip, they definitely aren't going to be all sharing the same L2 and L3.

AlphaSite · on Nov 12, 2016

Are core specific caches write through or write Back?

CountSessine · on Nov 12, 2016

I think that they're always write-back, but it gets complicated with the different levels and it's different between Intel and AMD - my understanding is that Intel uses an 'inclusive' method where everything that's in a higher-level cache line will also be in a lower-level cache line; i.e. if something is in L1 it will always also be in L2 and L3, but then with AMD's exclusive scheme L1 or L2, when clearing a dirty line, can/will effectively 'write-around' L2 or L3 straight to a lower level.

bogomipz · on Nov 11, 2016

The post states:

"in particular, tcmalloc and ptmalloc2 allocators can take as little as 200-500 CPU cycles for allocation/deallocation of a small object"

Does anyone how many cycles the regular glibc malloc() takes?

nullc · on Nov 12, 2016

I believe ptmalloc2 is glibc's malloc.

bogomipz · on Nov 12, 2016

Yes you are correct. Does anyone know how many cycles that is comparatively?

scottlamb · on Nov 12, 2016

I like this chart. In particular, I like that they break apart "thread context switch (direct costs)" and "thread context switch (total costs, including cache invalidation)". I've often heard that context switches are cheap because people only consider the direct costs.

quotemstr · on Nov 12, 2016

Context switches are cheap because context switches are cheap. You can, with enough finnegaling, make the "indirect costs" of anything arbitrarily high.

jheriko · on Nov 12, 2016

i'm curious where this information comes from. a lot of it jars heavily against personal experience and measurements i can do for myself.

the idea that floating point division is that much more expensive than multiplication for instance... the only difference afaik is latency, not timing.

the idea that an indirect call and a virtual function call are so close as well... when it is a read followed by an indirect call- whilst giving timings for some of the reads are considerably greater than either is an utter nonsense on inspection.

take with a great pinch of salt and remember the one correct way to judge timings is to measure them in context instead of guessing based on information that could well be wrong.

imo this kind of article is harmful and misses the more important lesson to learn: measure, measure, measure.

fancy_pantser · on Nov 12, 2016

There is a "references" section at the bottom:

[Agner4] Agner Fog, “Instruction tables. Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs”

[Agner3] Agner Fog, “The microarchitecture of Intel, AMD and VIA CPUs. An optimization guide for assembly programmers and compiler makers”

[Intel.Skylake] “Intel® 64 and IA-32 Architectures Optimization Reference Manual”, 2-6, Intel

[Levinthal] David Levinthal, “Performance Analysis Guide for Intel® CoreTM i7 Processor and Intel® XeonTM 5500 processors”, 22

[NoBugs] 'No Bugs' Hare, “C++ for Games: Performance. Allocations and Data Locality”

[AlBahra] Samy Al Bahra, “Nonblocking Algorithms and Scalable Multicore Programming”

[eruskin] http://assemblyrequired.crashworks.org/how-slow-are-virtual-...

[Agner1] Agner Fog, “Optimizing software in C++. An optimization guide for Windows, Linux and Mac platforms”

[Efficient C++] Dov Bulka, David Mayhew, “Efficient C++: Performance Programming Techniques”Amazon, p. 115

[Drepper] Ulrich Drepper, “Memory part 5: What programmers can do”, section 6.2.2

[TCMalloc] Sanjay Ghemawat, Paul Menage, “TCMalloc : Thread-Caching Malloc”

[Wikipedia.ProtectionRing] “Protection Ring”, Wikipedia

[Ongaro] Diego Ongaro, “The Cost of Exceptions of C++”

[LiEtAl] Chuanpeng Li, Chen Ding, Kai Shen, “Quantifying The Cost of Context Switch”

gpderetta · on Nov 12, 2016

Multiplies are significantly cheaper than divisions in most recent processors.

First of all latency is the most important parameter: after memory bandwith, the latency of the longest dependency chain is tipically the bottleneck, especially for floating point code.

For example on Skylake float muls have 4 cycle latency (same as adds and MADs) vs over a minimum of 14 cycles for divisions.

But even when only cosidering thoughput, Skylake has two fully pipelined MAD units and can start 2 multiplies every clock cycle, while its single division unit is only partially pipelined and can start a new div only every fourth clock cycle (it is also, IIRC, only 128 bit wide so 256 bits vector divs are more expensive still).

Avoiding divs (and mods) is something that it is still worth optimising for.

fdsaaf · on Nov 11, 2016

I am extremely skeptical of "full" rows in the table, the ones that purport to measure the overall costs of cache invalidation. These costs are so workload specific that a single number is meaningless and likely to mislead. My own benchmarks show costs that are nowhere near the ones cited.

no-bugs · on Nov 12, 2016

> My own benchmarks show costs that are nowhere near the ones cited.

This is heavily dependent on the benchmark. In the OP, there is a ref to an academic research showing these numbers - and from my own real-world experience (NOT artificial benchmarks), the costs of 10K-100K are very usual. From a completely different perspective - there should be reasons why nginx beats Apache performance-wise :-).

OskarS · on Nov 11, 2016

I'm surprised how fast a C function call is. I would have thought that creating a stack frame would be slower than that, (and significantly slower than a floating point division), but I guess not.

dom0 · on Nov 11, 2016

That's because nothing is created. A function call is just putting some registers on the stack and jumping somewhere else. Stack growth etc. is typically handled implicitly by the OS (if a write on the stack pagefaults more stack memory is allocated).

Contrary to eg. an interpreter where a stack frame would usually be a real, dedicated object that is allocated when a frame is needed.

umanwizard · on Nov 12, 2016

Creating a stack frame is just updating a register and writing a couple of values to L1 cache.

jheriko · on Nov 12, 2016

sometimes not even that. it might just update a single register and write nothing to cache or memory... and that cost might be hidden by out-of-order execution and the parallelism of different units on the CPU, making it effectively zero-cost.

raarts · on Nov 11, 2016

Great overview! Would be even better as a general guideline if it included costs for HD/SSD disk writes and network usage!

wmf · on Nov 11, 2016

See the Jeff Dean numbers: https://gist.github.com/jboner/2841832

falcolas · on Nov 11, 2016

The small difference between direct C calls and virtual C++ calls surprised me, actually. I thought it would be much bigger.

currysoup_t · on Nov 11, 2016

As usual, memory access is the expensive operation. If the virtual function is already in L1 cache a virtual function should be only a minor slowdown. If it's all the way off in main memory it will be significantly slower.

Eric Brumer has a great talk on this: https://channel9.msdn.com/Events/Build/2013/4-329

slashdev · on Nov 11, 2016

Exactly. Calling a virtual function in a tight loop, almost indistinguishable from a direct (non-inlined) function call. Calling a virtual function every now and then, much more expensive, but also not as likely to matter much to the performance of your program on the whole anyway.

Still worth noting that every VM like Java or .NET or LuaJIT will optimize for the case that a virtual call usually has only one or two commonly invoked target functions. They will use a conditional fast-path for the common case(s) and fall back to slower virtual calls or less optimized paths for megamorphic calls. I don't know if any C++ compilers do this kind of thing, but I would be surprised if they did not.

twoodfin · on Nov 11, 2016

Some C++ compilers support "fast path" devirtualization via profiling feedback:

http://hubicka.blogspot.com/2014/04/devirtualization-in-c-pa...

As you suggest, it's a little misleading to talk about the runtime cost of a virtual or indirect function call in isolation. The most significant cost is often the resulting inability of the compiler to inline and perform further optimizations across the caller and callee.

slashdev · on Nov 12, 2016

Nice, I had a feeling they did that, thanks for the link.

I fully agree, missed optimization opportunity from not being able to inline is usually a bigger effect, because loops dominate the runtime and that's where those optimizations count.

exDM69 · on Nov 12, 2016

Virtual function / function pointer calls can still have a big performance impact, especially if it could be inlined away.

Compare C standard library qsort with C++ std::sort performance for sorting integers. The difference is huge in favor of std::sort. If you write your own qsort and put it in a header file so the indirect calls can be inlined, they should be equal.

jheriko · on Nov 12, 2016

i've seen virtuals completely removed and inlined by gcc when there is enough information for it to do that, and a simple enough use case... probably other compilers are similar.

brigade · on Nov 11, 2016

Even then, the article and co-commentators exaggerate the cost and factors you should worry about for virtual calls. Unless you're optimizing specifically for an in-order CPU, the only cost of virtual functions that really matters in the general/high level case is not being able to inline them.

a_e_k · on Nov 11, 2016

An important factor is whether it can be branch predicted. If the virtual call nearly always goes to the same method implementation (e.g., always an OpenGL renderer or always a DirectX renderer), it can actually be pretty cheap. They get expensive mainly when the call is constantly jumping to implementations for different object types.

acchow · on Nov 11, 2016

A virtual call in C++ is just a vtable lookup then a direct C call. Literally one pointer away. (or perhaps one pointer then one integer addition away? For the offset)

no-bugs · on Nov 12, 2016

However, the address of this C call is calculated, which means that branch prediction (and the cost of potential misprediction) is involved. So additional cost we're talking about, is roughly a cost of L1 read (3-4 clocks by itself) - plus (0-<cost-of-branch-misprediction))

chrisseaton · on Nov 11, 2016

Is it that simple for virtual methods in complex class hierarchies? If the class hierarchy isn't linear, how can you assign simple absolute vtable indices for methods?

dom0 · on Nov 11, 2016

Yes it is. The C++ ABI defines how any permitted inheritance hierarchy is mapped into a surjective, ordered set of vtables. Also, calls into methods of other classes in the hierarchy offset the this pointer accordingly.

It feels counterintuitive that these two mechanisms are sufficient; but they are.

jackmott · on Nov 11, 2016

one pointer can be as slow as a sqrt if it isn't in cache.

gpderetta · on Nov 11, 2016

The cost of virtual calls (and calls in general) in the article is greatly exaggerated. Jumps, calls and indirect calls have an effective latency of zero as they are resolved in the fetch stage (and speculated if the target is not known) and a troughput of one taken jump every one or two cycles.

In a virtual call, the offset computation and load is only used to compute the jump target (which is only used later on commit), which means that is not on the critical path [1].

They consume execution and load bandwith, but that's not normally the bottleneck. The cost of virtual cals is usually due to missed optimizations and the possibility of misprediction.

[1] technically if the load is delayed long enough, the CPU might be prevented from committing the speculated instructions and eventually exhaust its reoerder buffer and stall.

no-bugs · on Nov 12, 2016

> The cost of virtual cals is usually due to missed optimizations and the possibility of misprediction.

Yep (plus one L1 read to get that VMT pointer). And the OP is pretty much consistent with this (at least within declared "indications of the order of magnitude" precision).

gpderetta · on Nov 12, 2016

No, the load is not on the critical path, so it is not normally part of the overhead.