Strace – My Favourite Secret Weapon (2011)

orivej · on Feb 11, 2018

Re. My Library Is Not Loading!, a pretty useful trick to find what files are missing is to extend your path env var with an empty directory and then grep strace output for its occurrences, e.g.

  mkdir /tmp/empty
  LD_LIBRARY_PATH+=:/tmp/empty  # or PATH, or PERL5LIB, etc.
  strace -f program |& grep /tmp/empty

I had a need to track what files are used by what processes spawned by a program (to infer dependencies between the processes), and it seemed (and probably was) simpler to use ptrace directly rather than to parse strace output, so I wrote https://github.com/orivej/fptrace/ . Soon it turned out useful to dump its data into shell scripts that reflect the tree of spawned processes and let you rerun an arbitrary subtree. I mostly use it to debug build systems, for example to trace configure and examine what env var affected a certain check in a strange way, or to trace make and rerun the C compiler with -dD -E to inspect an unexpected interference between includes.

glandium · on Feb 11, 2018

If your only concern is libraries, you can ask the dynamic loader to print out what it loads:

LD_DEBUG=libs command

You can check out all the valid options with:

LD_DEBUG=help cat

or check out `man ld.so`.

On mac, there is something similar. Lookup all the DYLD_PRINT_* variables in `man dyld`.

ereyes01 · on Feb 11, 2018

This is a cool trick. On Linux, the ldd command also tells you what libraries a program wants to link to. You can also play around with LD_LIBRARY_PATH to manipulate what ldd tells you.

hugelgupf · on Feb 11, 2018

Binaries can also use dlopen to open libraries that aren't listed by ldd.

ereyes01 · on Feb 11, 2018

Good point, dlopen is fully run-time, and the strace approach will catch those too, so it's more comprehensive.

v_lisivka · on Feb 11, 2018

IMHO,

  strace -e trace=file -f program

is much better in this case.

brendangregg · on Feb 12, 2018

No mention of overhead. strace can bring down your production environment -- it is not safe. This is why I wrote this: http://www.brendangregg.com/blog/2014-05-11/strace-wow-much-...

"perf trace" has improved since I wrote that, and may well be mostly strace-equivalent (but without crippling overhead) in the latest Linux.

zwischenzug · on Feb 12, 2018

Whenever I've had to use it in prod (in heavy OLTP environments) the seriousness of the issue has always outweighed any performance concerns. Ditto tcpdump. Often it was used specifically to determine the cause of performance issues. In any case you generally only strace 1 process, and if your application stack depends on one process you're probably in other kinds of trouble... unless it's erlang :)

brendangregg · on Feb 12, 2018

It's not a choice between strace or nothing. It's a choice between strace, ftrace, perf, or eBPF -- and that's just the Linux builtins. Many low overhead addons can also do syscall tracing (sysdig, LTTng).

I often run ftrace, perf, and eBPF on our production instances for syscall tracing. If I ran strace, the instance would suddenly be very slow, and it would trigger Hysterix (and other) timeouts and be removed from the ASG and auto terminated. Our environment is fault-tolerant, so yes, we can run strace -- you just don't get much output, and the load vanishes from the instance you are looking at.

Bromskloss · on Feb 12, 2018

Which of those alternatives are an option if the goal is to inspect and rewrite syscall arguments and return values, and do other things in between?

helper · on Feb 12, 2018

Last time I tried `perf trace` I realized how many things strace does that I take for granted. Things like file handle to filename resolution and pretty printing read() and write() buffers.

Do newer versions of `perf trace` expose these?

Bromskloss · on Feb 12, 2018

> it is not safe

Are you referring to that it makes syscalls slow, or is it about something else?

cynwoody · on Feb 12, 2018

Many thanks!

Yours is a very superior treatment of the strace topic.

shaklee3 · on Feb 12, 2018

Thanks. Really enjoy your blog.

zwischenzug · on Feb 11, 2018

Nice to see something I wrote 7 years ago hasn't gone entirely to waste. Spot the 'ancient' technologies like Solaris and perl mentioned there...

pjc50 · on Feb 11, 2018

> I’m often asked in my technical troubleshooting job to solve problems that development teams can’t solve

This sounds like a really interesting job - consultancy? How did you get started in it

Strace was always my go-to on Linux for solving error messages that fail to be a complete sentence: "connection refused" (to what?) "file not found" (where did you look?) and so on. Since I've moved to Windows the best alternative seems to be Process Explorer.

techsupporter · on Feb 11, 2018

> This sounds like a really interesting job - consultancy?

I'm not OP but I do the same thing, solve the problems and bugs that the dev team can't. In my world, this is simply "Operations" or, in later years, "Systems Engineer." I'm really good at doing that but I am not at all skilled at large-scale software development even though I can read and understand already-written code as part of my troubleshooting.

Descriptions like the author's are why I'm dismayed that Ops has such a bad reputation in the modern computing industry and why I'm not so keen on the combination of the two roles into "DevOps." Troubleshooting a working, in-motion system requires a different set of skills, in my experience, from writing and debugging code. It's also a set of skills that doesn't seem to overlap very often. The two roles go hand-in-hand, of course, but asking someone to do both in a large environment hasn't ended will (again, in my experience).

devonkim · on Feb 11, 2018

The anti-pattern / co-option of devops as a role is almost entirely overlapping with companies undergoing digital transformations and advised by some CIO magazine, Big Four, or Gartner team that it’s important for said transformation. Because the cultural transformation in itself is already being branded as something new and idiosyncratic to the company the term “devops” as the cultural change itself is dropped to avoid confusion.

In most large non-tech companies operations is a cost center budgeted alongside facilities and basic costs of business and also usually run by managers less frequently with a software background as much as an IT or traditional business background focused around cost efficiencies. I’m not sure whether this designation or Taylorism is fundamentally the root causes for why “devops” from the cultural sense is not really happening.

pjc50 · on Feb 12, 2018

> Troubleshooting a working, in-motion system requires a different set of skills, in my experience, from writing and debugging code

I'm not convinced of this - or rather, if someone can't troubleshoot a system in motion, they're going to struggle to actually develop anything substantial. Debugging is a very underestimated skill of programming.

I think the conflation of "devops" is nothing to do with skills and everything to do with culture. When you have them as separate business units or teams it becomes adversarial: every deployment is a potential headache for Ops, so Ops try to prevent deployments or set up staging barriers to minimise this. Whereas developers may not appreciate the difficulties involved in deploying a large system vs the toy one they're developing on.

y4mi · on Feb 11, 2018

There are countless definitions of devops around, but a developer that also does the ops part is one of the worst.

Devops can also mean that you're developing automation to reduce manual maintenance or even remove some ops responsibilities entirely.

Think of self healing clusters, auto provisioning and so on.

I've also come across the the definition of devops as a traditional ops dude that just worked for the dev team, keeping their development environment maintained. I'm still confused why they called that a devops position...

windowsworkstoo · on Feb 11, 2018

Heh are you me? I too have the same problem with "DevOps" - I prefer to evangelise it as a culture of Developers and Operations/Engineering being able to work closely and collaboratively - two departments that have historically been at each others throats, with a mentality of "toss it over the wall and forget about it"

voltagex_ · on Feb 12, 2018

I'd love to get into this kind of work (currently in webdev, but I'm a good debugger). My contact details are in my profile - can we chat?

windowsworkstoo · on Feb 11, 2018

Process Explorer uses ETW under the hood and you should too...learn this and become a Windows debugging master.

Admittedly ProcExp will make you seem amazing as it is but harnessing ETW power will make you unstoppable

zwischenzug · on Feb 11, 2018

I've moved on from there, but it was an internal function within a larger software dev house. In retrospect a pretty intense place to work, but I knew no different - it was effectively my first job out of uni. I got to that position after about 10 years.

You can read more about my background here IYI:

https://zwischenzugs.com/2017/10/15/my-20-year-experience-of...

bmn__ · on Feb 12, 2018

To be fair, Perl emits a pretty good error message for the very common problem of a missing library and strace is generally not necessary to understand what's going on.

    > /usr/bin/perl -mblahlib
    Can't locate blahlib.pm in @INC (you may need to install
    the blahlib module) (@INC contains:
    /usr/lib/perl5/site_perl/5.26.1/x86_64-linux-thread-multi
    /usr/lib/perl5/site_perl/5.26.1
    /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.26.1
    /usr/lib/perl5/5.26.1/x86_64-linux-thread-multi
    /usr/lib/perl5/5.26.1 /usr/lib/perl5/site_perl).
    BEGIN failed--compilation aborted.

agumonkey · on Feb 11, 2018

I remember a talk about strace (julia evans ?). Still it's rarely mentioned. It's quite useful and I found myself often seeking this kind of tools/mindset. Basically stop guessing in the void, use anything that pokes inside to give you evidence.

lucb1e · on Feb 11, 2018

Offtopic: does your username mean "between moves" or "between trains"? I guess the former because iirc it would be "züge" for the plural of trains. Or perhaps it's just "between-move" as a noun, but I have no clue what that would mean.

monktastic1 · on Feb 11, 2018

Probably: https://en.wikipedia.org/wiki/Zwischenzug

lucb1e · on Feb 12, 2018

Of course Wikipedia holds the answer. I should have known.

blub · on Feb 12, 2018

Also see: https://en.m.wikipedia.org/wiki/Zugzwang

zwischenzug · on Feb 12, 2018

In between moves. I play chess, and I started the blog as 'in-between moves' outside of work days.

lucb1e · on Feb 16, 2018

Cool! Thanks for clarifying :)

y4mi · on Feb 11, 2018

Technically, singular works as well. I.e a train that leaves between two others (either physically or timetable-wise)

"Zug" is also a form of pull(ziehen). So it could also be interpreted as 'while pulling'. But that's pretty far fetched tbh

I'm not the parent, however, and can't answer your question.

lucb1e · on Feb 11, 2018

Strace and Wireshark. If a program's output doesn't immediately tell me what the hell its issue is, I'll just dive into this depending on the kind of trouble.

The only thing I'd still like to have is something like strace but for general function calls. Often, the issue is within the application and does not show in syscalls. I guess this should be possible with gdb, but I haven't looked into it yet, also because any meaningful names are often stripped from the binaries.

schoen · on Feb 11, 2018

You might enjoy using ltrace! It does pretty much what you're looking for. (Although it only shows calls into dynamic libraries and not other internal function calls, as far as I know. Maybe that's different if you have debugging symbols in the binary?)

lucb1e · on Feb 11, 2018

    $ whatis ltrace
    ltrace (1)           - A library call tracer

That seems nice indeed. It seems to make things a lot slower, but I guess that's to be expected from tracing at this level!

amelius · on Feb 11, 2018

The secret weapon has a (serious) flaw on Linux though: you can't run strace on programs that use strace. That's because the Linux ptrace call is not re-entrant.

orivej · on Feb 12, 2018

Actually you can strace strace! Try

  strace -o 1.log strace -o 2.log ls

and observe in 1.log how the second strace uses ptrace call.

The limitation is that a process can not be ptraced by multiple processes at the same time. If you add -f to the first strace, it will start tracing the fork meant for ls before the second strace has a chance to setup its tracing. That setup will fail, and the second strace will kill the fork instead of running ls. You can read this from 1.log!

If I'm not mistaken, a ptrace tool may untrace its grandchild right when it intercepts a child attempt to trace that grandchild to make the attempt succeed, but I don't know if strace can.

shinnok · on Feb 12, 2018

A good complement to strace is ltrace(1):

  ltrace is a program that simply runs the specified command until it exits. 
  It intercepts and records the dynamic library calls which are called by the executed process and the signals which are received by that process. 
  It can also intercept and print the system calls executed by the program.

Also noteworthy is that you can simply press s key in htop to immediately attach a process and inspect it, which was handy many times for me.

thedatamonger · on Feb 12, 2018

I was in awe when I first used strace. The world of tracing is vast these days and filled with wonders. I can't possibly say it better than this this guy so I'll just include a link.

http://www.brendangregg.com/blog/2016-03-05/linux-bpf-superp...

snissn · on Feb 11, 2018

Is there a substitute for mac/osx?

dmix · on Feb 12, 2018

Note on new MacOS builds you will likely have to boot into recovery and run:

    csrutil enable --without dtrace

Otherwise you'll get a permissions error due to recent MacOS code signing protection... unless you copy the program you want to analyze to /tmp as a work around.

zbentley · on Feb 11, 2018

Dtruss: https://opensourcehacker.com/2011/12/02/osx-strace-equivalen...

Also, on OSX you have access to (mostly) the full power of dtrace, so you can do diagnostics on closed-source, debug-stripped running programs that make strace look small by comparison.

slrz · on Feb 12, 2018

Just tried it on iTunes and it doesn't seem to work at all.

I already allowed myself to debug my own system by disabling "System Integrity" in recovery mode.

tim-- · on Feb 12, 2018

A quick Google for `PT_DENY_ATTACH` and dtrace should solve that

lamby · on Feb 13, 2018

Here's my strace hack from 2008: https://chris-lamb.co.uk/posts/can-you-get-cp-to-give-a-prog...

adultSwim · on Feb 12, 2018

This is my favorite strace resource: https://jvns.ca/strace-zine-v3.pdf

ggm · on Feb 11, 2018

strace is great. There are times this feels like the modern equivalent of printf() debugging, because working out why a syscall is failing involves a fair amount of back tracing to context sometimes. But, as a foot in the door? its ace.

partycoder · on Feb 11, 2018

If you want to listen on a port under 1024 you require superuser access or special privileges.

https://www.w3.org/Daemon/User/Installation/PrivilegedPorts....

posharma · on Feb 12, 2018

Is there a strace equivalent for windows?

vram22 · on Feb 11, 2018

As I said in a comment on the OP, lsof (LiSt Open Files - it does more than its name suggests) is pretty useful too.

fuser is useful too, and available on many Unixen. I've used "fuser -k" many times to find and kill rogue processes.

https://en.wikipedia.org/wiki/Lsof

https://en.wikipedia.org/wiki/Fuser_(Unix)

Edited for grammar.

twic · on Feb 11, 2018

One of my favourite hacks is this script (which might well not work - it's flaky):

https://bitbucket.org/twic/devtools/src/1b7a8f9ab849b36de70c...

Which, given a specification of a filehandle (eg an IP address or path) uses lsof to determine the PID of the owning process and the numeric value of the filehandle, and then uses gdb to attach to that process and close it. It's a crude but simple way of exercising error-handling code.

vram22 · on Feb 12, 2018

Interesting one.

Are you running Elixir in the last line? the iex?

twic · on Feb 13, 2018

That's this line, i think:

    sudo gdb -batch -n -iex "set auto-load off" -p $TARGET_PID -ex "call close(${TARGET_FD})"

The -iex is a flag which executes a command on startup:

    -init-eval-command command
    -iex command
       Execute a single GDB command before loading the inferior (but after loading gdbinit files). See Startup.

[1] https://sourceware.org/gdb/current/onlinedocs/gdb/File-Optio...

vram22 · on Feb 13, 2018

Got it now, thanks.