Bcc: Taming Linux 4.3+ Tracing Superpowers

aktau · on Sept 23, 2015

Awesome post, as always. I know you've at one point taken a look at shark (https://github.com/sharklinux/shark), which seems to be similar in a way except one makes probes (either perf_events or eBPF) in just Lua. Do you have any notes about how it compares vis-a-vis BCC?

This seems to be an example eBPF back and front-end: https://github.com/sharklinux/shark/blob/master/samples/bpf/...

From the looks of it, the eBPF program is also specified in some C (dialect?). I wonder how it's similar/different to BCC. From the dependencies on shark, LLVM is also involved.

EDIT: seems like they're encouraging people to download a precompiled llc-bpf for now: https://github.com/sharklinux/llc-bpf

EDIT2: (sorry for all the edits). My last question would be: you say "What's new in Linux 4.3 is the ability to print strings from Extended Berkeley Packet Filters (eBPF) programs.", and then that you needed this for many tools. However, I can't see that in the biolatency source you pasted. It seems to me like it's aggregating a map in kernel-space, and printing it out from userspace once it's done. Just like was possible in kernel 4.1+. What am I missing?

drzaeus77 · on Sept 23, 2015

Author of BCC here... I hadn't seen shark before, but the mindset does appear similar.

A couple comparison points I noticed while briefly investigating shark:

C code is passed to clang+llc as external calls, whereas in BCC clang+llvm are statically linked in.

Both support native (lua/python) bindings to the eBPF maps.

In shark, I don't see that it is easy to dereference kprobe'd function arguments, as it is in `bpf_trace_printk("1 W %s %d %d ?\\n", req->rq_disk->disk_name, ...)` of <bcc>/tools/biosnoop.

This should also answer your last question, which was where is "%s" used. tools/opensnoop also uses string printks.

Comparison points aside, I intentionally made sure that the clang legwork that is being done in BCC is wrapped with a C api, so any language bindings besides python should be trivial to implement. It would be ideal (in my mind) if shark could leverage libbcc and make both tools better in the process.

fche · on Sept 23, 2015

Can you outline how bcc translates that req->rq_disk->disk_name expression to bytecode? AIUI, there are no pointer-dereferencing bytecodes. Is it using the BPF_FUNC_probe_read?

drzaeus77 · on Sept 23, 2015

Exactly, it is using bpf_probe_read. Internally, BCC uses clang's Rewriter functionality to mangle valid C (but invalid BPF) into valid C with bpf helper functions.

The req->rq_disk->disk_name expression would expand into:

({ typeof(char [32]) _val; memset(&_val, 0, sizeof(_val)); bpf_probe_read(&_val, sizeof(_val), (u64)({ typeof(struct gendisk *) _val; memset(&_val, 0, sizeof(_val)); bpf_probe_read(&_val, sizeof(_val), (u64)req + offsetof(struct request, rq_disk)); _val; }) + offsetof(struct gendisk, disk_name)); _val; }));

If you are playing with the tools, the BPF() class takes an optional argument debug=, where bit 2 (0x4) will print the rewritten C output for your edification.

brendangregg · on Sept 23, 2015

Shark looks promising... I haven't tried it yet so I don't have much of an opinion. Contributions seem a bit sparse: it's not in heavy development. Maybe that will change over time.

blinkingled · on Sept 23, 2015

Attempts to reinvent Dtrace continue unabated thanks to stupid Sun making up a new OSS license. Such a shame.

It'd have been great to have GPLed, in-kernel Dtrace and ZFS for Linux.

brendangregg · on Sept 23, 2015

Yes, it would have been nice to have DTrace in Linux a while ago. But the reality is a bit more complex than it might look.

Many people I know haven't been aware that Linux has had in-built dynamic tracing capabilities for years, with ftrace, kprobes and later uprobes. These are much more difficult to use (which is why I wrote some front-ends: https://github.com/brendangregg/perf-tools). But if you really cared about performance, you could use them. So it's not that Linux has been completely missing out; It's been missing out on a some specific features (eg, in-kernel aggregations and variables), and a nice interface.

As for attempts unabated to reinvent DTrace: FWIW, that's not really the genesis in the case of bcc/eBPF. Extended Berkeley Packet Filtering (eBPF) was developed to create virtual network infrastructures. Things like this: https://www.iovisor.org/sites/cpstandard/files/pages/images/... . It provides a kernel virtual machine for executing sandboxed bytecode. As a bonus, it can be used for system tracing as well, to do custom capabilities that ftrace/perf_events was missing.

There are too many tracers for Linux already, so I'm pretty glad that eBPF is getting integrated into the Linux kernel, since it should stop discourage anyone inventing yet-another-linux-tracer, since the kernel will already have a powerful tracer in-built. We'll no longer need new tracers. We'll want front-ends, like bcc.

blinkingled · on Sept 23, 2015

My beef was with exactly that - too many things that do system tracing, none as good as DTrace. But I will take your word for it that eBPF will finally make it possible to have something closer to a good tracer like DTrace and all the other incompetent attempts will die off.

Also: Thank you for working on Linux tracing - your blog posts have been very valuable to keep track of where it's all headed.

atombender · on Sept 23, 2015

Can you explain, for someone who isn't familiar with the kernel, why there are so many different and apparently overlapping tracing tools built in, instead of just one? Are there plans to clean up and unify them under something like bcc?

Another example of duplication is BPF and iptables, but I suppose iptables isn't going away soon simply because it's so popular (and, compared to BPF, simple to use)?

brendangregg · on Sept 23, 2015

There's at least 9 different tracers for Linux, but only 3 built-in.

- perf_events (the "perf" command), which is intended as the official end-user profiler/tracer. It's great at PMCs and sampling. It can do dynamic tracing and tracepoints.

- ftrace (currently being renamed to "trace", to bring its ambiguity on-par with doing an internet search for "perf"), which is really a collection of custom lightweight tracing capabilities, developed to aid the real time kernel work.

- eBPF, which is an engine that both perf_events and trace could make use of.

I talked about the more here: http://www.brendangregg.com/blog/2015-07-08/choosing-a-linux...

So you could say that the plan is there is one: perf_events. There's already work to bring eBPF to perf. perf already has a Python scripting interface, so it's not inconceivable that one day bcc will become part of perf.

Some (f)trace capabilities could be rewritten/improved in eBPF, which could mean some cleanup. But the ftrace implementation wasn't bulky to start with.

vezzy-fnord · on Sept 23, 2015

FWIW, I managed to compile SystemTap with DWARF debug info (using a kernel with CONFIG_DEBUG_INFO, CONFIG_KPROBES and CONFIG_UPROBES) on my Slackware-current box just recently, and have been using to write process probes on shared libraries with success.

It's somewhat slow, the UI is clunky and I had to comment out a rather anal sanity check related to module timestamps from the source, but it works. By far the biggest drawback is it has no formal API, instead you have to wrangle with subprocesses to automate it. It's pretty ugly to see things like the SystemTap initscript or SystemTap toolkits which execute scripts as interpolated strings with a template engine, though that's reality.

It has certainly improved from its former reputation as a routine kernel panic trigger, though fiddling with guru mode haphazardly will still hang or crash your system perhaps a bit too easily.

brendangregg · on Sept 23, 2015

Yes, SystemTap is much better nowadays, but people are still spooked from the panics of years ago. (And I'm partly to blame, as I wrote about its panics in 2011.)

SystemTap has a grammar and extensive tapsets, so one interesting possibility would be for SystemTap to use eBPF as a backend: compile to eBPF bytecode. You'd then get safety, a grammar, and all the tapsets, which include many tracing odds and ends that you sometimes run into, which SystemTap has solved years ago.

Although at times I think the tapsets have gone too far, and become too big and untested (one of the reasons I started writing lightweight tools https://github.com/brendangregg/systemtap-lwtools).

monocasa · on Sept 23, 2015

Honestly, I see BPF being used for way more than tracing. Having generic kernel virtual machine is game changing for a lot of kernel/user interfaces if you can secure it a little more.

fche · on Sept 23, 2015

(and not just "secure" as in security, but also "secure" as in fixed-interface)

monocasa · on Sept 23, 2015

Yeah, totally. It needs more time to bake in general. I was just mainly referring to the fact that it's current requirement of CAP_SYS_ADMIN is kind of a non starter for more general use at the moment.

fche · on Sept 23, 2015

Given that the bytecode system can at least read arbitrary kernel memory, it better be CAP_SYS_ADMIN due to information disclosure alone.

monocasa · on Sept 23, 2015

Yeah, in it's current form it absolutely should require CAP_SYS_ADMIN. A restricted subset (and probably some extensions) would be required to remove that requirement. There's a lot of benefit though to doing that work. Think AIO programs that kick off new work on completion kind of like s/360 channel programs. Or emulation of regular devices that approaches the speed of para virtualized devices in KVM. But you need to find a (safe) way to allow non privileged users to access this functionality for this to make sense.

mhurron · on Sept 23, 2015

Ya, it's in FreeBSD and OS X, but it's Sun that's the problem here.

blinkingled · on Sept 23, 2015

Yes, if Sun could've dual licensed it like many other things (BSD/GPL) this would not even be a situation.

krakensden · on Sept 23, 2015

Kind of- last time I tried dtrace on OS X all of the examples shipped with it were broken, and I couldn't find any OS X specific documentation.

Philipp__ · on Sept 23, 2015

I think that licence is not the only problem here. The way some things were implemented within Linux kernel are making things how they are today. I mean DTrace could be ported to Linux, maybe, I don't know, but best thing to do would be to reach Bryan Cantrill and ask him, I think he would give you pretty reasonable answer. He spoke about some things on BSD Now podcast.

zvrba · on Sept 23, 2015

So, the DTrace language specification and other public interfaces are well-documented. GNU has already reimplemented a bunch of other stuff, so why not DTrace too?

sigjuice · on Sept 23, 2015

FreeBSD maybe?

ised · on Sept 23, 2015

Am I the only one who thought for a moment this might be about Bruce Evans' C compiler (bcc)? I believe Linux used bcc in the early days. I know they used the assembler (as86) and I think they still do. Could be wrong.

mkesper · on Sept 23, 2015

Is there a platform independent way of using this or do we need specific C code for every architecture?