Callgrind is certainly dispensable; just use a profiler. They are much, much faster, and more accurate as well. (Callgrind is using an idealized model of CPUs as they were around 1995, which doesn't match up all that well with how they work today.)
There are some situations where I find myself using Callgrind, in particular in situations where stack traces are hard to extract using a regular profiler. But overall, it's a tool that I find vastly overused.
perf does not provide me with the complete callstack, it's a sampling profiler.
In effectively all latency-sensitive contexts, sampling is worthless. 99.999999% of the time the program is waiting for IO, and then for a handful of microseconds there's a flurry of activity. That activity is the only part I care about and perf will effectively always miss it and never record it to completion.
I need to know the exact chain of events that leads to an object cache miss causing an allocation to occur, or exactly the conditions which led to a slow path branch, or which request handler is consistently forcing buffer resizes, etc.
I never need a profiler to tell me "memory allocation is slow" (which is what perf will give me). I know memory allocation is slow, I need to know why we're allocating memory.
perf is of course a sampling profiler, but perf record -g most definitely does provide you with a complete callstack, provided you have all your debug info in place.
perf or Intel VTune are the two standard choices AFAIK. Both have a certain learning curve, both are extremely capable in the right hands. (Well, on macOS you're pretty much locked to using Instruments; I don't know if Callgrind works there but would suspect it's an uphill battle.)
Callgrind is a CPU simulator that can output a profile of that simulation. I guess it's semantics whether you want to call that a profiler or not, but my point is that you don't need a simulator+profiler combo when you can just use a profiler on its own.
(There are exceptions where the determinism of Callgrind can be useful, like if you're trying to benchmark a really tiny change and are fine with the bias from the simulation diverging from reality, or if you explicitly care about call count instead of time spent.)
perf on the whole system, with the whole software stack compiled with stack pointers, flamegraphs for visualisation, is an essential starting point for understanding real world performance problems.
There are some situations where I find myself using Callgrind, in particular in situations where stack traces are hard to extract using a regular profiler. But overall, it's a tool that I find vastly overused.