The current implementation of "default-perfevent" doesn't take into account perf counter multiplexing, so counters may be off if you have other perf measurements running in the background.
How does cycle counting interact with CPU frequency scaling? Do you just assume that the CPU is at a fixed frequency for performance comparisons & analysis? (I suppose this is accurate for short lived benchmarks.)
I imagine you'd see some weird effects if the CPU changed its frequency during a benchmark, or if the process moved between cores since (as I understand it) the RAM would get faster relative to the CPU when the CPU is clocked lower. So some operations would take fewer cycles when the CPU is running slower.
- default-perfevent: the kernel will keep track of counter values on core moves
- amd64-pmc: Accesses a cycle counter through RDPMC: best to pin the benchmark binary to a specific core to avoid measurement issues when the task moves between cores: "taskset -c 1 <benchmark-binary>"
- amd64-tsc, amd64-tscasm: RDTSC is off-core counter - not influenced by cross-core moves ("On current CPUs, this is an off-core clock rather than a cycle counter, but it is typically a very fast off-core clock, making it adequate for seeing cycle counts if overclocking and underclocking are disabled.")
The cycle counters are a just monotonically incrementing values, so polling them just lets you calculate the average effective frequency over the poll interval. You're right that stuff like i/o might happen in different numbers of cycles depending on actual frequency, but those operations generally take the same amount of wall time either way, so it comes out in the wash.
Yes, you must disable CPU frequency scaling in your BIOS if you're doing this kind of work (i.e. building cryptographic primitives that don't leak information via timing).
Given the extreme complexity of the modern CPU chips and the parallelism of their internal processes, as well as the fact that the “architecture” (the instruction set, registers, etc.) is now far removed from how things are done at the hardware level, - does the notion of a “CPU cycle” still make sense?
I think your question is a little too broad to answer precisely, as it depends on your definition of "making sense" - however, CPU cycles are still relevant in many aspects.
Generally speaking, just because modern x86 CPUs use numerous abstractions for high-level instruction set features does not mean they don't use cycles. (Clock) cycles are an inherent feature of synchronous logic, and all CPUs (modern as well as ancient) use synchronous logic. Yes, there might be some esoteric outliers, but 99.9% is synchronous.
The difference to ancient CPUs is that modern CPUs don't necessarily retire (essentially, "execute") an instruction in every clock cycle. In any given clock cycle, a CPU might retire none, just one or even multiple instructions.
A typical application (where clock cycles are important) is looking at the number of retired instructions per cycle (IPC), which can give a rough overview of whether a program is frontend-bound or backend-bound. You can try this yourself using "perf stat" (part of the Linux perf tools): Try running different applications using perf stat and look for "insn per cycle" in the report. You will generally find that interpreted programs (Python, Node etc.) will have a mediocre IPC of well below 1 (due to poor cache utilization and branch mispredictions) while compiled, optimized programs might reach an IPC of 2. A high IPC value is not necessarily better though, it's complicated^TM.
A professor of mine who worked on transactional memory and speculative execution in the Intel processors said the notion of a clock doesn’t generally hold any more. He has a story of when he showed up and asked for a timing diagram and they laughed, one had not been possible for many years at that point. Cycles can take a variable amount of time depending on various non deterministic attributes including speculation and out of order execution under speculation. My understanding is cycle counting degenerates to some normalization of time to make how long instructions take to be generally comparable and stable. But this isn’t my area, so I may just be talking out my ass.
Given that a constant clock speed is practically a security vulnerability in terms of EM emissions, it's not so surprising that determinism is out the window even at such a low level. Thanks for sharing
I don't have any experience with papi, but this looks like it could be a lighter weight solution for those who don't want to pull in a larger dependency (hobby projects?). Just a guess based on a quick skim of the papi docs [0].
https://cpucycles.cr.yp.to/libcpucycles-20230105/cpucycles/d...
It needs to read time enabled / time running and scale returned values: