I've been trying to tell people to use perf for years now. I think there was a lot of disservice done by early versions that were buggy, crash and freeze prone. Still better then I oprofile before it.
It only got better once they added trace points to perf events. Now I can simply trace all futex syscall enters to look for lock/synchronization contention.
One extra tool I recommend is dynamic logging which you can enable from kernel build. Create any log mask (file/like me/module) set a level and you're off to the races. Invaluable in certain scenarios (like debugging bugs in fail path of fscache module).
Same here. I think the hard part is choose what to control, how to measure, what to capture. After all performance testing is running experiment. Generating pretty visualization is 1% of the hard work if you have the data you need...
These posts and talks by Brendan Gregg are awesome; they always prompt several hours of playing around with new tools and digging into kernel code. I'm trying to get familiar enough with these tools and methods that I can start using them at work and spreading the knowledge around.
Yes, all tracing tools have overhead. It's the observer effect. Nowadays I'm better at warning about this than I have in the past.
My company has a performance engineering team, but I imagine smaller companies don't have any performance staff, or time to deal with understanding this. Hence the value in creating canned tools for others to use, provided their warnings are made clear.
sysdig is simple, and has some awesome features, but also lacks many capabilities (eg, kernel dynamic tracing) that we currently need ftrace or perf_events for.
Edit: I should add -- there are three examples in the original post that use ftrace. You can't currently do any of these with sysdig.
This is so cool. Netflix seems to have a damn fine eng team. Its too bad they don't have roles for new grads :P
Is there an authoritative beginners guide to linux performance analysis? A cursory glance at trace-cmd tutorials shows I have a lot to learn. Or maybe it would be more beneficial to start playing with a tool like sysdig?
One problem is just knowing what's out there. My tools diagrams on http://www.brendangregg.com/linuxperf.html should help, and my last LinuxCon talk on Linux Performance Tools -- a good overview of all the stuff.
As for learning how to do all the stuff... If you're a programmer, a good approach is to write your own simple poor performance programs to anaylze. Eg:
- burn CPU in a function
- do too much disk I/O in a function
- do lots of large size network I/O from a function
- do heavy memory allocation from a function
Now analyze these using the Linux toolset. Quantify the behavior (how much per second), show its effect on system resources (%utilized, or IOPS, or whatever), then drill down and identify the code path responsible: the function.
This sounds easy, since you wrote the program to start with, and already know the answer. But there's often a lot of quirky behavior with perf tools, and assumed knowledge to learn, that you want the target to be as easy as possible.
A mistake beginners make is to aim performance tools at complex production workloads, and then get lost and confused. It's like trying to run before knowing how to walk.
I think Julia Evans is going to cover this approach in an upcoming talk, so that should become a good reference.
You'll also want to develop a good understanding of software, systems, and kernels, so that you can reason about performance. That may take years to pick up, and the study of various text books. I tried to summarize all the content in my last book, systems performance.
Brendan's performance in the cloud/enterprise book is super solid. Not only does it give information on performance analysis, but it's a very solid primer to the systems involved. It's not specifically Linux - lots of solaris/illumos stuff in it too - but it's definitely a starting point.
Not a ton on ftrace in it, though. Maybe we'll see a new book more focused on Linux specifically with some ftrace stuff?
Thanks, yes, my Systems Performance book focused on the how and why of performance analysis: background, concepts, architecture, methodologies. The implementation specifics (which are an application of the earlier content) were covered last in each chapter, which were Linux and Solari. I had some perf_events but not ftrace.
I've been thinking of how best to cover my new ftrace and perf_events content in book form...
I've really been enjoying the use of sysdig for performance and other analysis. It makes it quite easy to find what you need quickly. I do prefer built in tools though, so I'll have to look at this.
It only got better once they added trace points to perf events. Now I can simply trace all futex syscall enters to look for lock/synchronization contention.
One extra tool I recommend is dynamic logging which you can enable from kernel build. Create any log mask (file/like me/module) set a level and you're off to the races. Invaluable in certain scenarios (like debugging bugs in fail path of fscache module).