Callgrind is certainly dispensable; just use a profiler. They are much, much fas...

nickelpro · 2024-08-29T15:45:10 1724946310

perf does not provide me with the complete callstack, it's a sampling profiler.

In effectively all latency-sensitive contexts, sampling is worthless. 99.999999% of the time the program is waiting for IO, and then for a handful of microseconds there's a flurry of activity. That activity is the only part I care about and perf will effectively always miss it and never record it to completion.

I need to know the exact chain of events that leads to an object cache miss causing an allocation to occur, or exactly the conditions which led to a slow path branch, or which request handler is consistently forcing buffer resizes, etc.

I never need a profiler to tell me "memory allocation is slow" (which is what perf will give me). I know memory allocation is slow, I need to know why we're allocating memory.

mananaysiempre · 2024-08-30T13:37:38 1725025058

perf is of course a sampling profiler, but perf record -g most definitely does provide you with a complete callstack, provided you have all your debug info in place.

dennis_moore · 2024-08-29T07:47:13 1724917633

Which profilers in particular are you referring to because I've always thought that Callgrind is a profiler? perf?

Sesse__ · 2024-08-29T12:17:46 1724933866

perf or Intel VTune are the two standard choices AFAIK. Both have a certain learning curve, both are extremely capable in the right hands. (Well, on macOS you're pretty much locked to using Instruments; I don't know if Callgrind works there but would suspect it's an uphill battle.)

Callgrind is a CPU simulator that can output a profile of that simulation. I guess it's semantics whether you want to call that a profiler or not, but my point is that you don't need a simulator+profiler combo when you can just use a profiler on its own.

(There are exceptions where the determinism of Callgrind can be useful, like if you're trying to benchmark a really tiny change and are fine with the bias from the simulation diverging from reality, or if you explicitly care about call count instead of time spent.)

rwmj · 2024-08-29T09:42:07 1724924527

perf on the whole system, with the whole software stack compiled with stack pointers, flamegraphs for visualisation, is an essential starting point for understanding real world performance problems.