Awesome post, as always. I know you've at one point taken a look at shark (https://github.com/sharklinux/shark), which seems to be similar in a way except one makes probes (either perf_events or eBPF) in just Lua. Do you have any notes about how it compares vis-a-vis BCC?
From the looks of it, the eBPF program is also specified in some C (dialect?). I wonder how it's similar/different to BCC. From the dependencies on shark, LLVM is also involved.
EDIT2: (sorry for all the edits). My last question would be: you say "What's new in Linux 4.3 is the ability to print strings from Extended Berkeley Packet Filters (eBPF) programs.", and then that you needed this for many tools. However, I can't see that in the biolatency source you pasted. It seems to me like it's aggregating a map in kernel-space, and printing it out from userspace once it's done. Just like was possible in kernel 4.1+. What am I missing?
Author of BCC here... I hadn't seen shark before, but the mindset does appear similar.
A couple comparison points I noticed while briefly investigating shark:
C code is passed to clang+llc as external calls, whereas in BCC clang+llvm are statically linked in.
Both support native (lua/python) bindings to the eBPF maps.
In shark, I don't see that it is easy to dereference kprobe'd function arguments, as it is in `bpf_trace_printk("1 W %s %d %d ?\\n", req->rq_disk->disk_name, ...)` of <bcc>/tools/biosnoop.
This should also answer your last question, which was where is "%s" used. tools/opensnoop also uses string printks.
Comparison points aside, I intentionally made sure that the clang legwork that is being done in BCC is wrapped with a C api, so any language bindings besides python should be trivial to implement. It would be ideal (in my mind) if shark could leverage libbcc and make both tools better in the process.
Can you outline how bcc translates that req->rq_disk->disk_name expression to bytecode? AIUI, there are no pointer-dereferencing bytecodes. Is it using the BPF_FUNC_probe_read?
Exactly, it is using bpf_probe_read. Internally, BCC uses clang's Rewriter functionality to mangle valid C (but invalid BPF) into valid C with bpf helper functions.
The req->rq_disk->disk_name expression would expand into:
If you are playing with the tools, the BPF() class takes an optional argument debug=, where bit 2 (0x4) will print the rewritten C output for your edification.
Shark looks promising... I haven't tried it yet so I don't have much of an opinion. Contributions seem a bit sparse: it's not in heavy development. Maybe that will change over time.
Yes, it would have been nice to have DTrace in Linux a while ago. But the reality is a bit more complex than it might look.
Many people I know haven't been aware that Linux has had in-built dynamic tracing capabilities for years, with ftrace, kprobes and later uprobes. These are much more difficult to use (which is why I wrote some front-ends: https://github.com/brendangregg/perf-tools). But if you really cared about performance, you could use them. So it's not that Linux has been completely missing out; It's been missing out on a some specific features (eg, in-kernel aggregations and variables), and a nice interface.
As for attempts unabated to reinvent DTrace: FWIW, that's not really the genesis in the case of bcc/eBPF. Extended Berkeley Packet Filtering (eBPF) was developed to create virtual network infrastructures. Things like this: https://www.iovisor.org/sites/cpstandard/files/pages/images/... . It provides a kernel virtual machine for executing sandboxed bytecode. As a bonus, it can be used for system tracing as well, to do custom capabilities that ftrace/perf_events was missing.
There are too many tracers for Linux already, so I'm pretty glad that eBPF is getting integrated into the Linux kernel, since it should stop discourage anyone inventing yet-another-linux-tracer, since the kernel will already have a powerful tracer in-built. We'll no longer need new tracers. We'll want front-ends, like bcc.
My beef was with exactly that - too many things that do system tracing, none as good as DTrace. But I will take your word for it that eBPF will finally make it possible to have something closer to a good tracer like DTrace and all the other incompetent attempts will die off.
Also: Thank you for working on Linux tracing - your blog posts have been very valuable to keep track of where it's all headed.
Can you explain, for someone who isn't familiar with the kernel, why there are so many different and apparently overlapping tracing tools built in, instead of just one? Are there plans to clean up and unify them under something like bcc?
Another example of duplication is BPF and iptables, but I suppose iptables isn't going away soon simply because it's so popular (and, compared to BPF, simple to use)?
There's at least 9 different tracers for Linux, but only 3 built-in.
- perf_events (the "perf" command), which is intended as the official end-user profiler/tracer. It's great at PMCs and sampling. It can do dynamic tracing and tracepoints.
- ftrace (currently being renamed to "trace", to bring its ambiguity on-par with doing an internet search for "perf"), which is really a collection of custom lightweight tracing capabilities, developed to aid the real time kernel work.
- eBPF, which is an engine that both perf_events and trace could make use of.
So you could say that the plan is there is one: perf_events. There's already work to bring eBPF to perf. perf already has a Python scripting interface, so it's not inconceivable that one day bcc will become part of perf.
Some (f)trace capabilities could be rewritten/improved in eBPF, which could mean some cleanup. But the ftrace implementation wasn't bulky to start with.
FWIW, I managed to compile SystemTap with DWARF debug info (using a kernel with CONFIG_DEBUG_INFO, CONFIG_KPROBES and CONFIG_UPROBES) on my Slackware-current box just recently, and have been using to write process probes on shared libraries with success.
It's somewhat slow, the UI is clunky and I had to comment out a rather anal sanity check related to module timestamps from the source, but it works. By far the biggest drawback is it has no formal API, instead you have to wrangle with subprocesses to automate it. It's pretty ugly to see things like the SystemTap initscript or SystemTap toolkits which execute scripts as interpolated strings with a template engine, though that's reality.
It has certainly improved from its former reputation as a routine kernel panic trigger, though fiddling with guru mode haphazardly will still hang or crash your system perhaps a bit too easily.
Yes, SystemTap is much better nowadays, but people are still spooked from the panics of years ago. (And I'm partly to blame, as I wrote about its panics in 2011.)
SystemTap has a grammar and extensive tapsets, so one interesting possibility would be for SystemTap to use eBPF as a backend: compile to eBPF bytecode. You'd then get safety, a grammar, and all the tapsets, which include many tracing odds and ends that you sometimes run into, which SystemTap has solved years ago.
Although at times I think the tapsets have gone too far, and become too big and untested (one of the reasons I started writing lightweight tools https://github.com/brendangregg/systemtap-lwtools).
Honestly, I see BPF being used for way more than tracing. Having generic kernel virtual machine is game changing for a lot of kernel/user interfaces if you can secure it a little more.
Yeah, totally. It needs more time to bake in general. I was just mainly referring to the fact that it's current requirement of CAP_SYS_ADMIN is kind of a non starter for more general use at the moment.
Yeah, in it's current form it absolutely should require CAP_SYS_ADMIN. A restricted subset (and probably some extensions) would be required to remove that requirement. There's a lot of benefit though to doing that work. Think AIO programs that kick off new work on completion kind of like s/360 channel programs. Or emulation of regular devices that approaches the speed of para virtualized devices in KVM. But you need to find a (safe) way to allow non privileged users to access this functionality for this to make sense.
I think that licence is not the only problem here. The way some things were implemented within Linux kernel are making things how they are today. I mean DTrace could be ported to Linux, maybe, I don't know, but best thing to do would be to reach Bryan Cantrill and ask him, I think he would give you pretty reasonable answer. He spoke about some things on BSD Now podcast.
So, the DTrace language specification and other public interfaces are well-documented. GNU has already reimplemented a bunch of other stuff, so why not DTrace too?
Am I the only one who thought for a moment this might be about Bruce Evans' C compiler (bcc)? I believe Linux used bcc in the early days. I know they used the assembler (as86) and I think they still do. Could be wrong.
This seems to be an example eBPF back and front-end: https://github.com/sharklinux/shark/blob/master/samples/bpf/...
From the looks of it, the eBPF program is also specified in some C (dialect?). I wonder how it's similar/different to BCC. From the dependencies on shark, LLVM is also involved.
EDIT: seems like they're encouraging people to download a precompiled llc-bpf for now: https://github.com/sharklinux/llc-bpf
EDIT2: (sorry for all the edits). My last question would be: you say "What's new in Linux 4.3 is the ability to print strings from Extended Berkeley Packet Filters (eBPF) programs.", and then that you needed this for many tools. However, I can't see that in the biolatency source you pasted. It seems to me like it's aggregating a map in kernel-space, and printing it out from userspace once it's done. Just like was possible in kernel 4.1+. What am I missing?