Hacker News new | past | comments | ask | show | jobs | submit login
Why use strace in 2023? [video] (ccc.de)
143 points by r4um on Jan 8, 2024 | hide | past | favorite | 92 comments



strace is basically my first go-to when a command immediately fails with an obscure message.

strace, skim the last 4-5 pages of trace output before the error message is emitted, looking in particular for error return values (it's often not right before the error because the program has started to clean up). Figures out the problem 80% of the time within a couple of minutes.


Same here. I have a theory. The obscure error are due to unexpected Syscall results. Programmers are very aware of the states their programs can be into given the visible logic. They often are not aware of the possible states a syscall can be and their error modes. A pattern I see in python programs is the sys.exit and exception avoidance because they don’t want the user to see a stack trace. So they rely on conditionals and sys.exits, never considering and catching the exceptions that things like open and write can raise, leading ton exception traces anyway. This is one of the reasons strace sometimes is required. I have seen experienced programmers doing this pattern again and again and they are really convinced that it is helpful and consistent. This highlights the what I wrote above programmers often ignore the way external dependencies work, in this case the syscalls.

Another common syscall related issue strace helps is with permissions denied. Most scripts just defend against the file existing and not whether the operation is possible, leading to obscure messages like operation not permitted. Strace to the rescue in such cases

It is one oft first lines of debugging, even if I have the source code, as sometimes the syscall that fails is buried deep in library code, but with the failed syscall and arguments I can track the line that made the wrong call.


> I have seen experienced programmers doing this pattern again and again and they are really convinced that it is helpful and consistent.

In a nutshell, this is why I hate working in go. While I'm sure your code, dear hn golang fan, doesn't suffer from this, much of the go code I've had to deal with extends the principle of “unaware of many common failure modes and swallow them into the exit code” to every function.


that's strange, I've been working professionally in Go for years and two things happen:

1) what are exceptions in many languages are plain errors in Go 2) a defer-recover block is usually put at the very top of a risky call

This covers 99% of sources of panics.

Now, the one that truly does mess you up is when it comes from CGO/calling out to C. That will tear your application and there is nothing you can do about it.


I wish they paniced. I'm talking about trees of function calls with variations on “err = thing(); if err: return err” multiple times in a function, so receiving an error code in the root function tells you very little about what went wrong or where it went wrong.


Exactly! Throwing away the call stack before deciding how to handle the error is not ideal.


I've honestly never needed it when dealing an error, the fact that errors are handled near where they occur means a stack is rarely needed.

If you bubble up errors a lot then errors.Is/As is made for that purpose.

You really only care about a stack trace with a panic.

In anycase, runtime.Stack() iirc should return the stack.

But this sounds more like trying to write other languages in Go rather than writing in Go.


> Most scripts just defend against the file existing

To me, any code calling a `DoesFileExist(filename)` type of function is a code smell.

There is no point to checking if the file exists. None. Ever![1]

There is never a case where this is useful. The only times I see code that checks for the existence of a file is when, on subsequent lines, the file is used as if it exists (no error-checking on open), which is a bug.

A line in code that checks for the existence of a file correlates, in my experience, very strongly with using that particular file in that specific buggy manner.

[1] You might say that it's to provide a better user experience, such as "Your config file does not exist, do you want to create it?" or similar, but that's still buggy if you're determining whether the file exists by checking existence.


> There is no point to checking if the file exists. None. Ever![1]

> There is never a case where this is useful.

Only a Sith deals in absolutes. What about the use case when you change behavior when a sentinel file exists and don't care about the contents? ;) e.g. touch firstrun

It can also be useful if there's a whole bunch of prerequisites you're going to check before doing a bunch of work; yes, opening the descriptor and keeping it around might be nicer, but...

Also, what about if you want to not overwrite existing output if it exists? O_CREAT|O_EXCL could be better (and important to use if the overwrite check has security implications)... though that doesn't exist in fopen(3) and you'd need to use fdopen which is inconvenient and risky in its own ways.


> Only a Sith deals in absolutes.

Point taken :-)

> What about the use case when you change behavior when a sentinel file exists and don't care about the contents?

You'll still care that you can write that file, right?

> It can also be useful if there's a whole bunch of prerequisites you're going to check before doing a bunch of work; yes, opening the descriptor and keeping it around might be nicer, but...

But, after you've checked the prerequisites, and you start your main work, you're still going to have to handle the case in that main work that one of the files you checked has the wrong permissions/is locked/was deleted.

It's not unusual for users to accidentally open a program twice by misclicking it when they only opened it once.

> It can also be useful if there's a whole bunch of prerequisites you're going to check before doing a bunch of work; yes, opening the descriptor and keeping it around might be nicer, but...

You answered this question below for the `open`/`openat` syscall. For `fopen` ...

> though that doesn't exist in fopen(3)

It has been in the standard since C11, I believe - mode `"x"` for exclusive create and open. You can use it with most C implementations today as most systems support C11.


> You'll still care that you can write that file, right?

Sometimes you don't need it.

systemd itself just checks that `/etc/initrd-release` exists and runs in initrd mode changing its default target to boot into (IIRC you can also manually change the default target to `initrd.target` in the initrd, but this way the default systemd vendored files don't need to be touched).

https://github.com/systemd/systemd/blob/7f13af72f89452950226...


Okay..what is the alternative?


> Okay..what is the alternative?

Just ... open the file and handle the error?

What are you getting by checking that the file exists that you don't get from the 'file not found' error that the `openFile()` routine returns?

Because it doesn't matter if `doesFileExist()` returns true, you still have to handle the `file not found` error[1] when calling `openFile()` on the next line anyway.

[1] Just because the file exists when the program checked, that doesn't mean that the program can open it (permissions), that the file still exists (could have been removed between the call to check and the call to open), that the file is, in fact, a file that can be opened (and not a directory, for example), that the file is not exclusively locked (Windows) by some other process that opened it, etc. `doesFileExist()` tells you nothing that would change how the subsequent is written.


That makes sense.

I guess we’re accustomed to thinking that absence of a file is not “exception-worthy” because it is expected under normal circumstances. But the cases you raised make sense.


And if it's the initial configuration file?


> And if it's the initial configuration file?

You're still going to have to open it and handle the errors (in case it's the wrong permissions, or the filename is a directory, etc).

Checking if it exists doesn't make the subsequent code any easier or shorter - you still have to check for errors even if it exists.


This is why for my Python programs, while I swallow exceptions when possible and turn them into user-friendly error messages that are likely to be the cause, I also add a -d,—-debug flag that ignores that and spits out everything. Because when you need to know, you need to know.


People who need to be doing other things with their lives utter silent prayers of gratitude when they hope your code that does that exists, and does


I often use it to see what configuration files the thing reads, it is much more reliable than documentation.


Since you almost always need to open a file in order to do anything useful `strace -vf -e openat` helped solve a lot of issues from simple configuration file lookup to ld.so library loading problems.


It is often useful to also watch for calls to the stat family of syscalls; some applications stat files that they wish to open rather than depending on the return code from open, so just looking for open will miss files that don't exist.


same, and I'm often surprised how many files are touched / read, and also how many times (trying to load local configs from dozens of places)..

some unix practices seems very inefficient


strace solved an issue for me today - logging in to a different account on my debian PC using 'su -l guest' would hang for 30s before letting me in. Interestingly, 'strace su -l guest' failed with a permission error during the trace so looked like I was stuck ('sudo' didn't help).

Until I opened a 2nd terminal window and did 'while [ 1 ]; do sudo strace -p $(pidof su); done' - that looped until I tried the login again, then attached, followed the calls and showed the delay was due to elogind and dbus; a quick 'apt purge elogind' then fixed the slow login.


And then IME, most of the time it's some file missing, or you can't figure out where it's getting its configuration from, so a grep for ^open is a good first try.


Isn't there a flag for getting only the opened filenames? (If not, why not)


You can use "-e trace=open" to trace only open(2) calls

Alternatively "-e trace=%file" to get all file-related system calls (will catch eg failing pre-emptive checks using access(3) -> stat(2)).


FTR, on modern-ish glibc-powered systems (in code that actually does use libc, and does not do its very own syscall-related thing instead), you will not find a single call to open(2) issued, in my experience. That's because the library functions shadowing these syscalls were rewired to use openat(2) under the hood.

    $ strace -e trace=open cat /dev/null 
    +++ exited with 0 +++


    $ strace -e trace=openat cat /dev/null 
    openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
    openat(AT_FDCWD, "/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
    openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
    openat(AT_FDCWD, "/dev/null", O_RDONLY) = 3
    +++ exited with 0 +++


If you want to catch both `open` and `openat`, the opensnoop BPF[1] program is pretty nifty, especially if you are trying to figure out file stuff across several different programs ("which #$%^-ing program keeps modifying this file", for example).

[1] I've been dipping my toes into BPF recently, and while complicated (best to simply clone the bpftools repo and work off of that) there's a lot that can be done that tools like strace won't be able to match.


Ok, but then you will still need to parse the output to get the filenames. That's ok, but since it is something that is used a lot, you'd expect a flag.


Check out strace -y and strace -yy (--decode-fds)


You still need to pull out the paths?

A sprinkling of grep/perl (awk/sed/ruby/...) is mostly good enough eg:

strace -e trace=%file cat /etc/passwd 2>&1 >/dev/null | grep ^open | grep -Po '(?<=").*(?=")'


Is your example situation really all that common?

If so, what format do you expect for the output?

If it's one filename-per-line then how do you encode filenames with embedded newlines?

How do you encode non-UTF8 characters, or is the file meant to be parsed only in binary mode?

I don't know of any generally agreed upon spec for this, so no matter what you think is right, most people are going to have to write a special-purpose parser.

In which case you might as well parse the native strace output since one is about as complex as the other.


It can use the same format as the Unix find utility. This utility has a -print0 flag to separate filenames by NUL characters instead of newlines if desired.


That is a good point.

I still don't see it as common use case.


strace's -Z flag is quite useful to filter only syscalls that failed.

In a similar way, -k is useful to get tracebacks.


I should write that down; -Z would've been really useful when my SSH server was falling over in a heap and I was left scrolling through strace to work out what happened.

I wrote that experience up last week, for what it's worth: https://imrannazar.com/articles/linux-upgrade-fail


The widespread switch to epoll(7)-based I/O systems has taken a big bite out of strace's usefulness for me.

Back when most programs did most of their I/O synchronously or with select(2)/poll(2), it was really easy to use strace to track down slow network services. If you wanted to figure out why a process was taking too long, you could strace it, grab the file descriptor numbers that it was spending a long time reading/writing/selecting/polling, feed those into lsof and turn them into network addresses/ports, then shell into those remote hosts and see what it was doing with the slow client's connection (e.g. if the remote host was a MySQL database you could check what queries were running on its connection/session; if it was a RabbitMQ you could see what queues its connection was operating on).

Don't get me wrong: strace is still an incredibly useful tool every Linux developer, no matter where they are in the stack, should know (Julia Evans made that case better than I could[1]). I also understand why I/O systems switched to more "file descriptor opaque" tools like epoll/kqueue/io_uring, which makes sense and brings a lot of benefits. I just miss the transparency of the old APIs a little bit, you know?

1: https://jvns.ca/blog/2014/04/20/debug-your-programs-like-the...


I hated macOS for gating / crippling dtruss with SIP every time I need to find out what’s wrong with a random cmd.

Is there any alternatives on Mac?


Perhaps `trace`?

But yeah, I guess macOS is better suited for frontend/non system programming work, unlike Linux/BSD.


> Is there any alternatives on Mac?

Not "on", "to"

And the answer is "yes"

:-)


As a Mac user this stuff hurts to read and as a Linux user it never gets old.


I recently used Instruments.app to try to debug some filesystem calls. It seemed to have some relevant functionality without disabling SIP.


Yeah the fact that macOS doesn't have strace is a serious shortcoming


dtrace is a far more comprehensive tool than strace. The common complaint on MacOS is that SIP by default will not allow you to use it. The same issue occurs on Linux when you're running SELinux or similar.


SIP doesn’t block dtrace. You can use it just fine to e.g. trace dynamic library function calls.

What really is a problem on MacOS is that they stopped shipping kernel syscall symbols. So you can’t trace syscalls anymore in the default setup. I’ve read on some forum that you can download the symbols from some website and it works then, but didn’t bother with it at that point.

Ironically, despite DTrace originally being a Unix tool, I’ve had the most success using it on Windows. On Mac I can’t trace syscalls by default, on FreeBSD for some reason it didn’t notice that processes inside jails were opening files (despite DTrace running on the host)... On Windows on the other hand DTrace works perfectly for me.


> SIP doesn’t block dtrace.

  » dtrace echo
  dtrace: system integrity protection is on, some features will not be available
  dtrace: failed to initialize dtrace: DTrace requires additional privileges


You need to use sudo. I have SIP enabled and use DTrace regularly. Trust me, it works. (As long as you use it properly, i.e. not like an alias for strace the way you tried here.)


> You need to use sudo.

Sigh.

  » sudo dtrace echo hi
  dtrace: system integrity protection is on, some features will not be available

  dtrace: no probes specified
  (last command returned 1.)
  » sudo dtruss echo hi
  dtrace: system integrity protection is on, some features will not be available

  dtrace: failed to execute echo: Operation not permitted
  (last command returned 1.)
  »
There's a host of problems that come with that, though: how do you then execute the tracee? If your problem is nice & simple and works under root, that's great I suppose, but it always seems like some access to $HOME destroys that possibility for me.

Then there's that, even ignoring all that … I've still yet to figure out a minimal example?

Give me a barebones minimal example. Every article I hit on this tool is right into space shuttle levels of complexity. That's what makes strace the winner.

> As long as you use it properly, i.e. not like an alias for strace the way you tried here.

… "you're holding it wrong."


On the many occasions where I've needed strace in macOS, on a number of them, I have tried to use dtrace. I have never once been successful with it; while I am sure there are dtrace experts merrily debugging anything and everything, the tool is nigh incomprehensible to the new user.


The analog to strace in macOS would be dtruss.

(AFAICT after some quick Googling, Linux strace is patterned after the old SunOS command, truss. SunOS had a trace(1) utility, too, but truss is the successor with features like fork following. So basically Linux strace == SunOS truss, and in the DTrace world strace == dtruss. If I have the history wrong, please correct me.)


The history is … interesting, I guess?

… but like, what we (anyone who has used strace on Linux) are looking for is an strace-equivalent to macOS.

I've gone down the "maybe it's dtruss, and not dtrace? what's the difference?" branch in my attempts to get a working strace replacement on macOS, too, also without success.

dtruss, on macOS, emits the same errors as what I put in another comment on this subthread. Even as root, though that would open up a whole 'nother can of worms in attempting to get a working replacement. (And now I've gone down a whole new rabbit hole of "What is SIP? How does it work?" and … I'm still no closer to actually running something that looks like strace. 2024 has got to be the year of the Linux desktop, if macOS's UX is this.)


> dtrace is a far more comprehensive tool than strace.

Yes, but if you watch TFV you'll see that strace's advantage lies in large part in having decoders for things that DTrace and eBPF don't.


man fs_usage (report system calls and page faults related to filesystem activity in real-time :-)


Oh boy, strace is soooo useful, and not just for coding or sysadmin.

As a user, when e.g. my browser does funky stuff like freezing for no reason, my knee-jerk reflex is always strace.

I was initially a little put down by the russian style moody intro (the longish "why would you use such an antiquated thing as strace in 2023" which made me think he was going to tear down strace), but stick with the video past it, it is actually an ode to strace and it's well worth it.

Personnaly, I never realized strace had this many features.


> As a user, when e.g. my browser does funky stuff like freezing for no reason, my knee-jerk reflex is always strace.

I'm currently trying to figure out why the whole Chrome UI started freezing intermittently last two weeks, when scrolling, switching tabs, etc.

What you're suggesting is using dtrace (MacOS) and see what's the output around that given time?


Talking about strace per the title without watching; wow is strace ever helpful!

You can build an ordered tree of all the programs launched by a program, recursively, using strace. You can do the same for observing files opened. It's an amazing tool when you need to observe some software from the outside; in many ways it'll give you much better understanding of a program than even reading it's source code.


You should watch the video then, as it goes into more details about new strace features introduced in the last few years, like print only system calls with a specified return status (successful, failed, etc.).

I'm particularly intrigued about what I can do with system call tampering. I like the example of being able to make unlink() be a no-op, giving you easy access to a temporary file which is normally automatically cleaned up.


This would be an interesting GUI that could be in a web browser to cross reference effects between programs such as communication between programs and Wayland. Specifically the parameters of syscalls create an effect, like write to block. Wireshark for syscalls. I think it already exists.

I kind of think of the interactions with the OS as the "effects" of the application like the IO Monad.

It's like a protocol too, function application in a certain order causes the effects of the software. The syscalls is a stream.

There is also a inner protocol between the buffers of binary protocols inside the syscalls such as a network buffer.


In emacs you can use syslog-mode for analyzing strace output: https://github.com/vapniks/syslog-mode It allows you to easily navigate, filter, highlight lines, and lookup documentation.


The safer/non-slowing down alternative for tracing syscalls would be "perf trace" (and eBPF scripts), but I've found in past that perf trace didn't decode enough syscall arguments (and just listed pointers to structs, where strace showed the info inside these structs). But I just ran a few simple "perf trace" commands on RHEL 9, looks like it has caught up somewhat.


He mentions all these other ways of doing tracing and I assume he means eBPF, but I've yet to see anything useful I can run from the CLI to use that.

So what is there out there? Because I'm a veteran Linux/Unix sysadmin for 25 years now and I love strace when I'm in a bind.

So what should I look at now based on eBPF instead?


Brendan Gregg wrote about it alot

https://www.brendangregg.com/ebpf.html

Ime it was hard to set up, idk if anything has changed recently. Looks like it's just packages now.


I have started using https://github.com/iovisor/bpftrace for things that I once used sysdig for, it works quite well and is not broken as often as sysdig.


another useful tool is sysdig with its curses gui csysdig which summarizes operations, you can filter on them etc.


Sysdig is very useful, but it's almost always broken on Fedora when I want to use it. I have recently started using https://github.com/iovisor/bpftrace instead, and so far it has covered the same use cases.


Sysdig is neat, but csysdig in my experience was extremely fragile and fickle, felt more like a demo than a usable tool. Which is a shame because it had some nice things to it.

Also at this point I'd say sysdig in general is just another also-ran together with systemtap and lttng (and others)


What are some options on MacOs? Hard to do much with SIP enabled. Ran into the issue a few days ago and still no solution.


ktrace is somewhat similar, at least when it comes to syscall traces.


Just be aware that in prod strace can have real impacts and run-away side-effects please!


Such as?


Instead of explaining it poorly, try this admittedly older article from Brendan Gregg https://www.brendangregg.com/blog/2014-05-11/strace-wow-much...

tldr - syscalls and userland / kernel context switching


For those curious about the Windows equivalent, procmon is a system-wide strace tool; filtering mandatory.


Why use C in 2024?


strace works on anything that makes a system call - written in any language


Except it doesn't work on programs that call strace.


Why not? A simple "strace strace date" works on my machine.


Ok, well this used to be not the case.


It's always been the case, as far as I know. You may be thinking of the fact that you can't ptrace a process that's already being ptraced, which includes processes that are ptracing themselves.


Doesn't strace use the ptrace() system call under the hood?


Correct, which is why I'm mentioning it here. Stracing another instance of strace does not involve any of the processes ptracing themselves, though, and I don't think it ever has.


So they implemented recursion/reentrance as a special case?


This isn't recursion. "strace strace foo" first runs the "strace" binary (PID X), which attaches one ptrace to "foo" (PID Y) and runs an I/O loop printing retrieved trace data. The outer "strace" call launches as PID Z, and attaches a ptrace to PID Y: the inner strace. The inner strace is not itself being ptraced until that happens, so there's no special case needed to allow "stracing strace".

However, you may be referring to a different scenario: multiple ptrace calls do hit the same process if you (for example) run "strace -p PID" for the same PID more than once simultaneously. In that case, strace invocations after the first one will fail with an error like 'strace: attach: ptrace(PTRACE_SEIZE, PID): Operation not permitted'.


The only constraint is that each process (or maybe each thread, I'm not sure off the top of my head) has at most one tracer. That constraint is not violated here.

In the case of strace -> strace -> date, the middle strace has one parent and one child. That's fine.

You can even have mutually recursive ptracing (occasionally used as an anti-debug strategy), since that doesn't violate the constraint either.


strace is language agnostic.

Any program that interacts with the OS - interacting with files, allocating memory, etc - will have to make system calls. Whether it's in C, Rust, Zig, Haskell, APL or Prolog.


I want to note that this is also true for “managed” languages like Java and Python.

However, because these languages have a far heavier runtime than the “systems” languages, it is usually possible to trace them at the runtime level rather than the syscall level, and doing that will typically give a better experience.


You troll, but I'll feed you. There is no universally better language for portable high-performance system programming for resource constrained platforms. Every single alternative have unacceptable trade-offs.


What is the unacceptable tradeoff of rust?


Not easily portable to most microcontroller targets: No SDK for our target SoCs and official SoC/µC support. Increased code size (which is also a performance issue in many SoCs). High cost of retraining engineers. High cost of reenginering existing code base and tooling. Lack of commercial support.

I'm optimistic about Rust, and it taken great strides to replace C, but there are still many hurdles that prevent it replacing C in microcontrollers and SoCs. Even if it wasn't for the lack of a platform SDK and porting existing code, the risk is too great with lack of official support for many commercial SoCs.


What SoC are you using out of curiosity?

ARM cortex, riscv and espresif all seem like they have first class support. Unless you are talking about peripherals in which case, why would you expect the language to write/maintain low level drivers?

The hurdles to replace it in micros might be worth it depending on the team size and requirements. Personally, I wouldn't expect it to ever replace existing code bases, but could be a reasonable choice for greenfield designs.


In my case, wireless SoC chips (BT, BLE, DECT). Current generation are ARM M0 cores, but 16-bit RISC/CR16 architectures are widespread still. Not expecting language to provide platform support, but SoC providers don't either. They provide C frameworks only, if anything.


Thanks for the response, this provides valuable insight to those not working in the industry.


> Why use C in 2024?

Jesus Christ, it's 2024!

Enough with this constant insecurity that forces people to derail every damn unrelated thread with drive-by comments about how C programs should be rewritten in a different language.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: