Re. My Library Is Not Loading!, a pretty useful trick to find what files are missing is to extend your path env var with an empty directory and then grep strace output for its occurrences, e.g.
mkdir /tmp/empty
LD_LIBRARY_PATH+=:/tmp/empty # or PATH, or PERL5LIB, etc.
strace -f program |& grep /tmp/empty
I had a need to track what files are used by what processes spawned by a program (to infer dependencies between the processes), and it seemed (and probably was) simpler to use ptrace directly rather than to parse strace output, so I wrote https://github.com/orivej/fptrace/ . Soon it turned out useful to dump its data into shell scripts that reflect the tree of spawned processes and let you rerun an arbitrary subtree. I mostly use it to debug build systems, for example to trace configure and examine what env var affected a certain check in a strange way, or to trace make and rerun the C compiler with -dD -E to inspect an unexpected interference between includes.
This is a cool trick. On Linux, the ldd command also tells you what libraries a program wants to link to. You can also play around with LD_LIBRARY_PATH to manipulate what ldd tells you.
Whenever I've had to use it in prod (in heavy OLTP environments) the seriousness of the issue has always outweighed any performance concerns. Ditto tcpdump. Often it was used specifically to determine the cause of performance issues. In any case you generally only strace 1 process, and if your application stack depends on one process you're probably in other kinds of trouble... unless it's erlang :)
It's not a choice between strace or nothing. It's a choice between strace, ftrace, perf, or eBPF -- and that's just the Linux builtins. Many low overhead addons can also do syscall tracing (sysdig, LTTng).
I often run ftrace, perf, and eBPF on our production instances for syscall tracing. If I ran strace, the instance would suddenly be very slow, and it would trigger Hysterix (and other) timeouts and be removed from the ASG and auto terminated. Our environment is fault-tolerant, so yes, we can run strace -- you just don't get much output, and the load vanishes from the instance you are looking at.
Last time I tried `perf trace` I realized how many things strace does that I take for granted. Things like file handle to filename resolution and pretty printing read() and write() buffers.
> I’m often asked in my technical troubleshooting job to solve problems that development teams can’t solve
This sounds like a really interesting job - consultancy? How did you get started in it
Strace was always my go-to on Linux for solving error messages that fail to be a complete sentence: "connection refused" (to what?) "file not found" (where did you look?) and so on. Since I've moved to Windows the best alternative seems to be Process Explorer.
> This sounds like a really interesting job - consultancy?
I'm not OP but I do the same thing, solve the problems and bugs that the dev team can't. In my world, this is simply "Operations" or, in later years, "Systems Engineer." I'm really good at doing that but I am not at all skilled at large-scale software development even though I can read and understand already-written code as part of my troubleshooting.
Descriptions like the author's are why I'm dismayed that Ops has such a bad reputation in the modern computing industry and why I'm not so keen on the combination of the two roles into "DevOps." Troubleshooting a working, in-motion system requires a different set of skills, in my experience, from writing and debugging code. It's also a set of skills that doesn't seem to overlap very often. The two roles go hand-in-hand, of course, but asking someone to do both in a large environment hasn't ended will (again, in my experience).
The anti-pattern / co-option of devops as a role is almost entirely overlapping with companies undergoing digital transformations and advised by some CIO magazine, Big Four, or Gartner team that it’s important for said transformation. Because the cultural transformation in itself is already being branded as something new and idiosyncratic to the company the term “devops” as the cultural change itself is dropped to avoid confusion.
In most large non-tech companies operations is a cost center budgeted alongside facilities and basic costs of business and also usually run by managers less frequently with a software background as much as an IT or traditional business background focused around cost efficiencies. I’m not sure whether this designation or Taylorism is fundamentally the root causes for why “devops” from the cultural sense is not really happening.
> Troubleshooting a working, in-motion system requires a different set of skills, in my experience, from writing and debugging code
I'm not convinced of this - or rather, if someone can't troubleshoot a system in motion, they're going to struggle to actually develop anything substantial. Debugging is a very underestimated skill of programming.
I think the conflation of "devops" is nothing to do with skills and everything to do with culture. When you have them as separate business units or teams it becomes adversarial: every deployment is a potential headache for Ops, so Ops try to prevent deployments or set up staging barriers to minimise this. Whereas developers may not appreciate the difficulties involved in deploying a large system vs the toy one they're developing on.
There are countless definitions of devops around, but a developer that also does the ops part is one of the worst.
Devops can also mean that you're developing automation to reduce manual maintenance or even remove some ops responsibilities entirely.
Think of self healing clusters, auto provisioning and so on.
I've also come across the the definition of devops as a traditional ops dude that just worked for the dev team, keeping their development environment maintained. I'm still confused why they called that a devops position...
Heh are you me? I too have the same problem with "DevOps" - I prefer to evangelise it as a culture of Developers and Operations/Engineering being able to work closely and collaboratively - two departments that have historically been at each others throats, with a mentality of "toss it over the wall and forget about it"
I've moved on from there, but it was an internal function within a larger software dev house. In retrospect a pretty intense place to work, but I knew no different - it was effectively my first job out of uni. I got to that position after about 10 years.
To be fair, Perl emits a pretty good error message for the very common problem of a missing library and strace is generally not necessary to understand what's going on.
> /usr/bin/perl -mblahlib
Can't locate blahlib.pm in @INC (you may need to install
the blahlib module) (@INC contains:
/usr/lib/perl5/site_perl/5.26.1/x86_64-linux-thread-multi
/usr/lib/perl5/site_perl/5.26.1
/usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi
/usr/lib/perl5/vendor_perl/5.26.1
/usr/lib/perl5/5.26.1/x86_64-linux-thread-multi
/usr/lib/perl5/5.26.1 /usr/lib/perl5/site_perl).
BEGIN failed--compilation aborted.
I remember a talk about strace (julia evans ?). Still it's rarely mentioned. It's quite useful and I found myself often seeking this kind of tools/mindset. Basically stop guessing in the void, use anything that pokes inside to give you evidence.
Offtopic: does your username mean "between moves" or "between trains"? I guess the former because iirc it would be "züge" for the plural of trains. Or perhaps it's just "between-move" as a noun, but I have no clue what that would mean.
Strace and Wireshark. If a program's output doesn't immediately tell me what the hell its issue is, I'll just dive into this depending on the kind of trouble.
The only thing I'd still like to have is something like strace but for general function calls. Often, the issue is within the application and does not show in syscalls. I guess this should be possible with gdb, but I haven't looked into it yet, also because any meaningful names are often stripped from the binaries.
You might enjoy using ltrace! It does pretty much what you're looking for. (Although it only shows calls into dynamic libraries and not other internal function calls, as far as I know. Maybe that's different if you have debugging symbols in the binary?)
The secret weapon has a (serious) flaw on Linux though: you can't run strace on programs that use strace. That's because the Linux ptrace call is not re-entrant.
and observe in 1.log how the second strace uses ptrace call.
The limitation is that a process can not be ptraced by multiple processes at the same time. If you add -f to the first strace, it will start tracing the fork meant for ls before the second strace has a chance to setup its tracing. That setup will fail, and the second strace will kill the fork instead of running ls. You can read this from 1.log!
If I'm not mistaken, a ptrace tool may untrace its grandchild right when it intercepts a child attempt to trace that grandchild to make the attempt succeed, but I don't know if strace can.
ltrace is a program that simply runs the specified command until it exits.
It intercepts and records the dynamic library calls which are called by the executed process and the signals which are received by that process.
It can also intercept and print the system calls executed by the program.
Also noteworthy is that you can simply press s key in htop to immediately attach a process and inspect it, which was handy many times for me.
I was in awe when I first used strace. The world of tracing is vast these days and filled with wonders. I can't possibly say it better than this this guy so I'll just include a link.
Note on new MacOS builds you will likely have to boot into recovery and run:
csrutil enable --without dtrace
Otherwise you'll get a permissions error due to recent MacOS code signing protection... unless you copy the program you want to analyze to /tmp as a work around.
Also, on OSX you have access to (mostly) the full power of dtrace, so you can do diagnostics on closed-source, debug-stripped running programs that make strace look small by comparison.
strace is great. There are times this feels like the modern equivalent of printf() debugging, because working out why a syscall is failing involves a fair amount of back tracing to context sometimes. But, as a foot in the door? its ace.
Which, given a specification of a filehandle (eg an IP address or path) uses lsof to determine the PID of the owning process and the numeric value of the filehandle, and then uses gdb to attach to that process and close it. It's a crude but simple way of exercising error-handling code.