Linux debugging tools I love

gpcz · on July 9, 2016

Another great suite of debugging tools not mentioned is Valgrind ( http://valgrind.org/ ). The default application (Memcheck) can help find memory leaks and off-by-one errors. Cachegrind/callgrind are great supplements to Perf, and Massif can help get a better perspective on memory usage in your programs.

I feel reckless writing C/C++ code if I can't test it with Memcheck.

netheril96 · on July 9, 2016

I've grown fond of -fsantize=address.

optforfon · on July 9, 2016

what's the advantage of using perf over valgrind? I've never really managed to get it working (it hooks into the kernel and inevitable some part of it fails for me) - but maybe I'm missing out on something?

I understand it probably handles mutithreading better and probably has a much better performance - but is that really it?

crazy2be · on July 9, 2016

The performance is way better. Callgrind/cachegrind operates as a cpu-level emulator (i.e. software emulating individual CPU instructions), while perf is a sampling profiler. This means you lose ~0 performance using perf, but at least ~10x performance using callgrind/cachegrind.

There is also the advantage that perf is really testing the actual code on the real hardware. This can get important if you are trying to design code that is cache-aware. Although, with cachegrind, you can set what kind of emulated cache parameters you want, I'm not sure how well this emulates the real hardware, especially if you have multiple threads with cache conflicts.

larve · on July 9, 2016

It also samples the whole system, including the kernel. In embedded programming, a significant part of your application's time might be spent in some kernel routines, or servicing interrupts, and perf makes it extremely easy to figure it out. It can also trace arbitrary counters, even probes that the user can define. So for example, I can sample the number of bytes received over USB or other metrics like this. Tremendously useful.

edited: typo

aksx · on July 9, 2016

valgrind had a bunch of stuff like massif, callgrind and memcheck but none of them (AFAIK) tell you how long a function call took, and point out the area of code taking too much time, which perf does.

kazinator · on July 9, 2016

Years ago I made a little C++ class for this. You could declare

   {
     tracer tr(function, ...)
     timer ti(tr, milliseconds);

the constructor of the timer would sample and record the time. The destructor would sample the time again, and if it was over the specified milliseconds, it would log a diagnostic into the tracer.

JDevlieghere · on July 9, 2016

Indeed, because this wouldn't make much sense given the tenfold slowdown. Callgrind collect information about how many instructions were executed and who has been calling who. It's just a different profiling metric.

I love valgrind/callgrind and it has proven extremely valuable. As always, choose the right tool for the job.

mfukar · on July 9, 2016

perf and valgrind have different purposes.

bpchaps · on July 9, 2016

Instead of strace, check out sysdig. It's significantly more powerful so long as you're willing to install the associated kernel modules: http://www.sysdig.org/

abusque · on July 9, 2016

There's also LTTng which is more mature and, arguably, more powerful too (and it does userspace tracing as well): http://lttng.org/

Full disclaimer: I'm am a contributor to the LTTng project.

bpchaps · on July 10, 2016

Thanks, I'll check it out!

imiric · on July 9, 2016

Can someone experienced with SystemTap or dtrace4linux comment on why these tools aren't more popular among Linux professionals?

I recently had to debug a memory usage issue at work, and SystemTap seemed like it would be a lifesaver in those and many other occasions. Unfortunately, both myself and my coworkers were inexperienced with it and there wasn't much documentation about it online, so we ended up using a standard profiler to track down the issue, which turned out to be a much slower process.

brendangregg · on July 9, 2016

It's a complex question. For really advanced environments, advanced tools are in use (including ftrace, BPF, etc). If you mean more widespread usage, then there's the need for documentation, marketing, etc. Linux doesn't really have anyone focused on doing that (ie, marketing departments with large budgets).

therein · on July 9, 2016

> and there wasn't much documentation about it online, so we ended up using a standard profiler to track down the issue

I think you hit the nail on the head right there. Also I think SystemTap requires a certain kernel version or above and some enterprise systems are still stuck at RHEL 6.5.

gotsbee · on July 9, 2016

I made an account to say that the person who writes this blog sounds like a great person.

brbsix · on July 9, 2016

I think Julia is probably the most enthusiastic developer I've ever encountered. Which is quite a statement considering a lot of the subject matter is (IMHO) pretty mundane.

ambulancechaser · on July 9, 2016

Did she do a talk at strange loop about system calls? Her Google interview story was hysterical if it was her

Edit not on mobile now. I was randomly looking through strange loop videos and watched one of her. Her enthusiasm and attitude are infectious. Its so fun to watch her be so excited that she gets tripped up trying to decide which awesome thing to tell you. I saw no picture of her on the website i viewied but surmised from the other comments here. Great video and seemingly an amazing person.

The google interview story is superb to hear if anyone hasn't watched yet.

bpchaps · on July 9, 2016

Do you have a link?

jvns · on July 9, 2016

I gave a talk at strange loop called "you can be a kernel hacker!": https://www.youtube.com/watch?v=0IQlpFWTFbM

utefan001 · on July 9, 2016

Watching this now. I love it. Thanks for you awesome enthusiasm. I wish every high school kid would watch this. You make programming look so cool :)

krylon · on July 9, 2016

She's also a great writer! Somebody should talk her into writing a book! It's not just that she is enthusiastic, she is able to inspire that same enthusiasm in her readers, too (well, at least in me).

partycoder · on July 9, 2016

Taking core dumps, heap dumps and traffic dumps are very powerful ways for debugging. The most powerful tool I have seen lately is rr, the replaying debugger from Mozilla. http://rr-project.org

Personally I also like to extract data from logs with UNIX tools and feed them into csv/tsv files, then process them with R.

camperman · on July 9, 2016

I still love DDD (https://www.gnu.org/software/ddd/). It's really useful for when you want to see what your code is doing to some area in memory.

zenlikethat · on July 9, 2016

Julia's posts are always high quality! Keep up the good work.

hubatrix · on July 9, 2016

I was introduced to Crash, for actual kernel debugging today, you can dump your data and examine it in very detail.

oblio · on July 9, 2016

opensnoop -> lsof?

JoshTriplett · on July 9, 2016

lsof shows you what's currently open; opensnoopf shows you what's being opened and closed, live, like a system-wide trace. Much handier if you're trying to figure out "what happened right when the bad thing happened?".

x0 · on July 9, 2016

Julia is the coolest person.

cbd1984 · on July 9, 2016

Everyone forgets about ltrace: Like strace, but for calls to dynamic libraries.

It gets you something closer to what the program looks like at the source code level: You get calls to printf (well, __printf_chk on a modern Linux) instead of write, for example. The downside is that ltrace doesn't (and can't) know as much about every single function in every single dynamic library, so, while the names are there, the arguments are typically less convenient to work with and may be incorrect. (For example, it doesn't dereference pointers to print out nice strings, and it might not know how many arguments a function takes.)

tech2 · on July 9, 2016

This is anecdotal, and a number of years ago, but I was using ltrace and it managed to kill the process I had attached to. I think ltrace has to get a little more aggressive with how it extracts data from a process. If it's a situation where data-loss is not allowed, consider the risk.