Magic-trace: Diagnose tricky performance issues with Intel Processor Trace

haberman · on Jan 28, 2022

The visualization tools presented look really nice, but they seem to present program execution as sequential and linear, which is a model that seems like it will really break down at these time scales (10s of cycles).

Modern processors will look hundreds of instructions into the future and try to start executing them as soon as possible. Branches are predicted far in advance of when they can actually be evaluated. Many instructions can be executing simultaneously. A clean tidy flame graph showing 1-3ns slices (~5 cycles) cannot help but be a vast simplification of what the CPU is really doing.

The linked page about Processor Trace says this:

> instruction data (control flow) is perfectly accurate but timing information is less accurate

The article mentions using magic-trace to detect changes in inlining decisions made by the compiler. This is a case where it will shine, since PT can perfectly capture the control flow, and it doesn't necessarily rely on having perfect timestamps for everything.

samhw · on Jan 29, 2022

Hey - I wanted to say that I came across a comment of yours from more than a decade ago (https://news.ycombinator.com/item?id=2328627), and I was startled at how accurate it is as a prediction of how parsers and IDEs are combined today, about 11 years later. I'm glad you're still commenting on here (and what a criminal understatement it is for that page to characterise their tool's flaws as "timing information is less accurate" - that's bloody execution order that you're talking about!).

Anyway, I wanted to say how much I appreciate your comment of 10 years ago. I'm also a parser nerd, and a performance nerd, and I feel strongly that programmers have a professional responsibility to write code in a way that expresses our intent by a logical minimum of instructions/work. I strongly suspect that this will become important again in the future, not because the ratio of software-efficiency to hardware-power decreases again, but because climate concerns will drive us to measure our code in performance-per-watt rather than performance-per-dollar (depending on what action is taken on carbon pricing, it may be a distinction without a difference).

I look forward to the day when grossly inefficient software is rightly considered to be as unacceptable as grossly inefficient SUVs, and people in our profession are forced to take responsibility for the damage that their obscenely inefficient crap is doing. I hope Python 4 comes with a snorkel.

rrss · on Jan 28, 2022

It seems like this is basically unavoidable on existing hardware, though, right?

if we imagine there existed some visualization that could more accurately represent the complexity of a core, I don’t know how it would be possible to get the data, because AFAIK there are no methods to trace processor execution for modern processors at higher fidelity than this.

even sampling profilers have similar issues with being limited to the model of sequential instruction streams, since each sample gives a single program counter, not the full view of everything the core has in flight.

haberman · on Jan 28, 2022

Yes, I agree that higher-resolution data is not readily available. LLVM MCA has a timeline view that attempts to visualize the overlapping execution of instructions (https://llvm.org/docs/CommandGuide/llvm-mca.html#timeline-vi...), but this is based on models of how the CPU works (not runtime-collected data), and these models are not perfect.

I also agree that sampling profilers have the same issue: instruction-level views of sampling profiles should be taken with a grain of salt.

My concern is that flame graphs with 1-3ns of resolution are presented as a selling point of the tool, without any mention of the caveats around how this model really breaks down at this time scale. I would like to know more details of how the PT data actually relates to the out-of-order execution. Does a branch's timestamp correspond to when that branch was retired? Do we actually know what the timestamp corresponds to, or is it not well-specified? Are there cases where the timestamp is known to be misleading about the true bottleneck?

I don't know the answers to these questions, but I see a tool like this, I really want more information about the strengths and limitations of the data.

mlyle · on Jan 28, 2022

> A clean tidy flame graph showing 1-3ns slices (~5 cycles) cannot help but be a vast simplification of what the CPU is really doing.

Sure, the thinnest slices on the highest zoom are going to be misleading. They're also not what you generally will want to be looking at (though they may provide context for you to identify the part higher functions taking a long time or hints about cache contention etc).

temikus · on Jan 27, 2022

Oh man, this is major. I would’ve loved to have something like that 10 years ago when CPU was a bit more precious. Still very useful today, just not to the same extent.

joshspankit · on Jan 28, 2022

I wish more devs continued to realize that CPU is precious. One executable running in isolation is only limited by how fast it can consume clock cycles, but we now have hundreds of executables each running dozens or hundreds of functions all the time. One unoptimized IO function can cause stutters across the whole system as it runs. Thousands of unoptimized IO functions can....

Well, we all know. We use systems like that daily.

Terry_Roll · on Jan 28, 2022

Didnt know this existed, interesting, but certainly could be useful at a forensic level, have had tools to highlight slow running multi threaded code in apps for probably about 15years now, but this takes it to a whole new level.

From the link it says: " it needs a post-Skylake Intel processor " https://en.wikipedia.org/wiki/Skylake_(microarchitecture)

Man page has a description https://www.man7.org/linux/man-pages/man1/perf-intel-pt.1.ht...

Dont know who LauterBach in Germany are, but they have a training manual which goes back to 1989.

https://www2.lauterbach.com/pdf/training_ipt_trace.pdf

https://www2.lauterbach.com/pdf/trace_intel_pt.pdf

https://www2.lauterbach.com/pdf/debugger_x86.pdf

From their manuals "The Intel® Processor Trace (IPT) works similar to the LBR and BTS feature of Ix86 based cores (see “CPU specific Onchip Trace Commands” (debugger_x86.pdf)"

I knew about the debugger on ARM cpu's like the Rpi, didnt know about Intel having one, but its suggested AMD dont have one either, so there might be some security reason for that, depends on if the trace is just output only or whether its possible to use things like SOIC clips to alter bits and bytes in realtime, but like slowed down not normal cpu clock speeds.

carlmr · on Jan 28, 2022

>Dont know who LauterBach in Germany are

They make a lot of debuggers for embedded targets. Often with tracing capability and such. Really good tools.

shortsightedsid · on Jan 28, 2022

Really good tools and extremely expensive

carlmr · on Jan 28, 2022

That is true, but at least for the targets I've worked on it makes sense because the number of potential buyers is just very low.

signa11 · on Jan 28, 2022

tracy (https://github.com/wolfpld/tracy), mentioned in this article as well, for some reason is criminally underused, unknown etc. by wider community.

sydthrowaway · on Jan 28, 2022

And LIKWID

gnufx · on Jan 28, 2022

Does LIKWID actually do such high resolution sampling? (I'm pretty sure it doesn't do the tracing, anyway.) I've not used it seriously, and it seems to be a collection of things I can do with other tools, though I may mis-judge it.

signa11 · on Jan 29, 2022

i looked at it cursorily, and i don't think it does that..

carlmr · on Jan 28, 2022

The best I've found until now is gperftools (In contrast to perf you get good results even with -O3 optimization and heavily inlined code). This seems to be much more accurate, but I'm not sure we can handle that amount of data because we usually take longer traces.

gnufx · on Jan 28, 2022

This says post-skylake, but both my SKX workstation and i5-6200U laptop have 1 in /sys/bus/event_source/devices/intel_pt/caps/psb_cyc which seems to be the condition, though I haven't tried to use it.

silverlake · on Jan 28, 2022

Doesn’t VTune support processor trace? Some VMs support PT. And AWS has support also.