Your #1 tool is simply a stopwatch. Write programs, time them, and then learn. S...

Erwin · on May 3, 2019

While "perf" is great (especially for something like optimizing a shared library loaded into Python) I had a hard time using any other counter than the instruction retirement for anything meaningful.

"perf list" list 1300 things, and it seems those counters come and go between architectures.

Intel's tools are free now, and give you a very nice graphical analysis tool. You don't need to buy/use the Intel compiler, just have a 1-2 gigs of free disk space.

Here's a page describing analysis if you want to go beyond why some section of your code spends 40% CPU: https://software.intel.com/en-us/vtune-amplifier-cookbook-to...

dragontamer · on May 3, 2019

That's true, but "perf" is more portable between platforms.

For example, I can't use vTune at all, because I run Threadripper 1950x. Personally speaking, I use AMD uProf instead.

Without knowing the guy's setup, "perf" is the most portable option. The processor-specific tools will be more accurate, but also more complicated. Its the Linux tool that tries to work across platforms: AMD, Intel, and even ARM platforms.

---------

"Perf" will give you L1 / L2 cache hits, and accurate counts at that. The main issue is "branch prediction", which requires very complicated circuitry to really predict. You need to use Intel's trace analyzer for accurate branch prediction statistics.

For AMD, you need to use instruction-based sampling in uProf for good branch prediction statistics.

L1 / L2 cache stuff is the beginner level though, so Perf should be a good starting point in this whole mess.