Hacker News new | past | comments | ask | show | jobs | submit login
Strace – My Favourite Secret Weapon (2011) (zwischenzugs.com)
311 points by zwischenzug on Feb 11, 2018 | hide | past | favorite | 55 comments



Re. My Library Is Not Loading!, a pretty useful trick to find what files are missing is to extend your path env var with an empty directory and then grep strace output for its occurrences, e.g.

  mkdir /tmp/empty
  LD_LIBRARY_PATH+=:/tmp/empty  # or PATH, or PERL5LIB, etc.
  strace -f program |& grep /tmp/empty
I had a need to track what files are used by what processes spawned by a program (to infer dependencies between the processes), and it seemed (and probably was) simpler to use ptrace directly rather than to parse strace output, so I wrote https://github.com/orivej/fptrace/ . Soon it turned out useful to dump its data into shell scripts that reflect the tree of spawned processes and let you rerun an arbitrary subtree. I mostly use it to debug build systems, for example to trace configure and examine what env var affected a certain check in a strange way, or to trace make and rerun the C compiler with -dD -E to inspect an unexpected interference between includes.


If your only concern is libraries, you can ask the dynamic loader to print out what it loads:

LD_DEBUG=libs command

You can check out all the valid options with:

LD_DEBUG=help cat

or check out `man ld.so`.

On mac, there is something similar. Lookup all the DYLD_PRINT_* variables in `man dyld`.


This is a cool trick. On Linux, the ldd command also tells you what libraries a program wants to link to. You can also play around with LD_LIBRARY_PATH to manipulate what ldd tells you.


Binaries can also use dlopen to open libraries that aren't listed by ldd.


Good point, dlopen is fully run-time, and the strace approach will catch those too, so it's more comprehensive.


IMHO,

  strace -e trace=file -f program
is much better in this case.


No mention of overhead. strace can bring down your production environment -- it is not safe. This is why I wrote this: http://www.brendangregg.com/blog/2014-05-11/strace-wow-much-...

"perf trace" has improved since I wrote that, and may well be mostly strace-equivalent (but without crippling overhead) in the latest Linux.


Whenever I've had to use it in prod (in heavy OLTP environments) the seriousness of the issue has always outweighed any performance concerns. Ditto tcpdump. Often it was used specifically to determine the cause of performance issues. In any case you generally only strace 1 process, and if your application stack depends on one process you're probably in other kinds of trouble... unless it's erlang :)


It's not a choice between strace or nothing. It's a choice between strace, ftrace, perf, or eBPF -- and that's just the Linux builtins. Many low overhead addons can also do syscall tracing (sysdig, LTTng).

I often run ftrace, perf, and eBPF on our production instances for syscall tracing. If I ran strace, the instance would suddenly be very slow, and it would trigger Hysterix (and other) timeouts and be removed from the ASG and auto terminated. Our environment is fault-tolerant, so yes, we can run strace -- you just don't get much output, and the load vanishes from the instance you are looking at.


Which of those alternatives are an option if the goal is to inspect and rewrite syscall arguments and return values, and do other things in between?


Last time I tried `perf trace` I realized how many things strace does that I take for granted. Things like file handle to filename resolution and pretty printing read() and write() buffers.

Do newer versions of `perf trace` expose these?


> it is not safe

Are you referring to that it makes syscalls slow, or is it about something else?


Many thanks!

Yours is a very superior treatment of the strace topic.


Thanks. Really enjoy your blog.


Nice to see something I wrote 7 years ago hasn't gone entirely to waste. Spot the 'ancient' technologies like Solaris and perl mentioned there...


> I’m often asked in my technical troubleshooting job to solve problems that development teams can’t solve

This sounds like a really interesting job - consultancy? How did you get started in it

Strace was always my go-to on Linux for solving error messages that fail to be a complete sentence: "connection refused" (to what?) "file not found" (where did you look?) and so on. Since I've moved to Windows the best alternative seems to be Process Explorer.


> This sounds like a really interesting job - consultancy?

I'm not OP but I do the same thing, solve the problems and bugs that the dev team can't. In my world, this is simply "Operations" or, in later years, "Systems Engineer." I'm really good at doing that but I am not at all skilled at large-scale software development even though I can read and understand already-written code as part of my troubleshooting.

Descriptions like the author's are why I'm dismayed that Ops has such a bad reputation in the modern computing industry and why I'm not so keen on the combination of the two roles into "DevOps." Troubleshooting a working, in-motion system requires a different set of skills, in my experience, from writing and debugging code. It's also a set of skills that doesn't seem to overlap very often. The two roles go hand-in-hand, of course, but asking someone to do both in a large environment hasn't ended will (again, in my experience).


The anti-pattern / co-option of devops as a role is almost entirely overlapping with companies undergoing digital transformations and advised by some CIO magazine, Big Four, or Gartner team that it’s important for said transformation. Because the cultural transformation in itself is already being branded as something new and idiosyncratic to the company the term “devops” as the cultural change itself is dropped to avoid confusion.

In most large non-tech companies operations is a cost center budgeted alongside facilities and basic costs of business and also usually run by managers less frequently with a software background as much as an IT or traditional business background focused around cost efficiencies. I’m not sure whether this designation or Taylorism is fundamentally the root causes for why “devops” from the cultural sense is not really happening.


> Troubleshooting a working, in-motion system requires a different set of skills, in my experience, from writing and debugging code

I'm not convinced of this - or rather, if someone can't troubleshoot a system in motion, they're going to struggle to actually develop anything substantial. Debugging is a very underestimated skill of programming.

I think the conflation of "devops" is nothing to do with skills and everything to do with culture. When you have them as separate business units or teams it becomes adversarial: every deployment is a potential headache for Ops, so Ops try to prevent deployments or set up staging barriers to minimise this. Whereas developers may not appreciate the difficulties involved in deploying a large system vs the toy one they're developing on.


There are countless definitions of devops around, but a developer that also does the ops part is one of the worst.

Devops can also mean that you're developing automation to reduce manual maintenance or even remove some ops responsibilities entirely.

Think of self healing clusters, auto provisioning and so on.

I've also come across the the definition of devops as a traditional ops dude that just worked for the dev team, keeping their development environment maintained. I'm still confused why they called that a devops position...


Heh are you me? I too have the same problem with "DevOps" - I prefer to evangelise it as a culture of Developers and Operations/Engineering being able to work closely and collaboratively - two departments that have historically been at each others throats, with a mentality of "toss it over the wall and forget about it"


I'd love to get into this kind of work (currently in webdev, but I'm a good debugger). My contact details are in my profile - can we chat?


Process Explorer uses ETW under the hood and you should too...learn this and become a Windows debugging master.

Admittedly ProcExp will make you seem amazing as it is but harnessing ETW power will make you unstoppable


I've moved on from there, but it was an internal function within a larger software dev house. In retrospect a pretty intense place to work, but I knew no different - it was effectively my first job out of uni. I got to that position after about 10 years.

You can read more about my background here IYI:

https://zwischenzugs.com/2017/10/15/my-20-year-experience-of...


To be fair, Perl emits a pretty good error message for the very common problem of a missing library and strace is generally not necessary to understand what's going on.

    > /usr/bin/perl -mblahlib
    Can't locate blahlib.pm in @INC (you may need to install
    the blahlib module) (@INC contains:
    /usr/lib/perl5/site_perl/5.26.1/x86_64-linux-thread-multi
    /usr/lib/perl5/site_perl/5.26.1
    /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.26.1
    /usr/lib/perl5/5.26.1/x86_64-linux-thread-multi
    /usr/lib/perl5/5.26.1 /usr/lib/perl5/site_perl).
    BEGIN failed--compilation aborted.


I remember a talk about strace (julia evans ?). Still it's rarely mentioned. It's quite useful and I found myself often seeking this kind of tools/mindset. Basically stop guessing in the void, use anything that pokes inside to give you evidence.


Offtopic: does your username mean "between moves" or "between trains"? I guess the former because iirc it would be "züge" for the plural of trains. Or perhaps it's just "between-move" as a noun, but I have no clue what that would mean.



Of course Wikipedia holds the answer. I should have known.



In between moves. I play chess, and I started the blog as 'in-between moves' outside of work days.


Cool! Thanks for clarifying :)


Technically, singular works as well. I.e a train that leaves between two others (either physically or timetable-wise)

"Zug" is also a form of pull(ziehen). So it could also be interpreted as 'while pulling'. But that's pretty far fetched tbh

I'm not the parent, however, and can't answer your question.


Strace and Wireshark. If a program's output doesn't immediately tell me what the hell its issue is, I'll just dive into this depending on the kind of trouble.

The only thing I'd still like to have is something like strace but for general function calls. Often, the issue is within the application and does not show in syscalls. I guess this should be possible with gdb, but I haven't looked into it yet, also because any meaningful names are often stripped from the binaries.


You might enjoy using ltrace! It does pretty much what you're looking for. (Although it only shows calls into dynamic libraries and not other internal function calls, as far as I know. Maybe that's different if you have debugging symbols in the binary?)


    $ whatis ltrace
    ltrace (1)           - A library call tracer
That seems nice indeed. It seems to make things a lot slower, but I guess that's to be expected from tracing at this level!


The secret weapon has a (serious) flaw on Linux though: you can't run strace on programs that use strace. That's because the Linux ptrace call is not re-entrant.


Actually you can strace strace! Try

  strace -o 1.log strace -o 2.log ls
and observe in 1.log how the second strace uses ptrace call.

The limitation is that a process can not be ptraced by multiple processes at the same time. If you add -f to the first strace, it will start tracing the fork meant for ls before the second strace has a chance to setup its tracing. That setup will fail, and the second strace will kill the fork instead of running ls. You can read this from 1.log!

If I'm not mistaken, a ptrace tool may untrace its grandchild right when it intercepts a child attempt to trace that grandchild to make the attempt succeed, but I don't know if strace can.


A good complement to strace is ltrace(1):

  ltrace is a program that simply runs the specified command until it exits. 
  It intercepts and records the dynamic library calls which are called by the executed process and the signals which are received by that process. 
  It can also intercept and print the system calls executed by the program.
Also noteworthy is that you can simply press s key in htop to immediately attach a process and inspect it, which was handy many times for me.


I was in awe when I first used strace. The world of tracing is vast these days and filled with wonders. I can't possibly say it better than this this guy so I'll just include a link.

http://www.brendangregg.com/blog/2016-03-05/linux-bpf-superp...


Is there a substitute for mac/osx?


Note on new MacOS builds you will likely have to boot into recovery and run:

    csrutil enable --without dtrace
Otherwise you'll get a permissions error due to recent MacOS code signing protection... unless you copy the program you want to analyze to /tmp as a work around.


Dtruss: https://opensourcehacker.com/2011/12/02/osx-strace-equivalen...

Also, on OSX you have access to (mostly) the full power of dtrace, so you can do diagnostics on closed-source, debug-stripped running programs that make strace look small by comparison.


Just tried it on iTunes and it doesn't seem to work at all.

I already allowed myself to debug my own system by disabling "System Integrity" in recovery mode.


A quick Google for `PT_DENY_ATTACH` and dtrace should solve that



This is my favorite strace resource: https://jvns.ca/strace-zine-v3.pdf


strace is great. There are times this feels like the modern equivalent of printf() debugging, because working out why a syscall is failing involves a fair amount of back tracing to context sometimes. But, as a foot in the door? its ace.


If you want to listen on a port under 1024 you require superuser access or special privileges.

https://www.w3.org/Daemon/User/Installation/PrivilegedPorts....


Is there a strace equivalent for windows?


As I said in a comment on the OP, lsof (LiSt Open Files - it does more than its name suggests) is pretty useful too.

fuser is useful too, and available on many Unixen. I've used "fuser -k" many times to find and kill rogue processes.

https://en.wikipedia.org/wiki/Lsof

https://en.wikipedia.org/wiki/Fuser_(Unix)

Edited for grammar.


One of my favourite hacks is this script (which might well not work - it's flaky):

https://bitbucket.org/twic/devtools/src/1b7a8f9ab849b36de70c...

Which, given a specification of a filehandle (eg an IP address or path) uses lsof to determine the PID of the owning process and the numeric value of the filehandle, and then uses gdb to attach to that process and close it. It's a crude but simple way of exercising error-handling code.


Interesting one.

Are you running Elixir in the last line? the iex?


That's this line, i think:

    sudo gdb -batch -n -iex "set auto-load off" -p $TARGET_PID -ex "call close(${TARGET_FD})"
The -iex is a flag which executes a command on startup:

    -init-eval-command command
    -iex command
       Execute a single GDB command before loading the inferior (but after loading gdbinit files). See Startup.
[1] https://sourceware.org/gdb/current/onlinedocs/gdb/File-Optio...


Got it now, thanks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: