A reminder that on the platforms eBPF is most commonly used, verifier bugs don't matter much, because unprivileged code isn't allowed to load eBPF programs to begin with. Bugs like this are thus root -> ring0 vulnerabilities. That's not nothing, but for serverside work it's usually worth the tradeoff, especially because eBPF's track record for kernel LPEs is actually pretty strong compared to the kernel as a whole.
In the setting eBPF is used today, most of the value of the verifier is that it's hard to accidentally crash your kernel with a bad eBPF program. That is comically untrue about an ordinary LKM.
The PoC uses eBPF maps as their out-of-bounds pointer, but it sounds like it would also be exploitable via non-extended BPF programs loadable via seccomp since it's just improper scalar value range tracking, which doesn't require any privileges on most platforms.
And, of course, root -> ring0 is less of a problem with unprivileged user namespaces where you can make yourself "root", as we've seen in every eBPF bug PoC since distros started turning that on (and have since turned it off again, mostly)
Ok that's fair. check_seccomp_filter actually has a more restrictive list than just "BPF with no backwards jumps", and in particular doesn't allow BPF_IND in the BPF_LDX, so you can't read out of bounds because you can't use a dynamic displacement...but BPF_STX is allowed, so you can probably write out of bounds? BPF_W is the seccomp_data address and the control flow diagram they show to compute incorrect scalar ranges doesn't require any backwards jumps...
Let's not forget also that we can give CAP_BPF to containers. With things like Cilium on the rise, the attack vector of landing in container environment that has cap_bpf is more and more realistic
I don't believe shared-kernel container systems are real security boundaries to begin with, so, to me, a container running with CAP_BPF isn't much different than any other program a machine owner might opt to run; the point is that you trust the workload, and so the verifier is more of a safety net than a vault door.
That pessimistic view is not shared by everyone who is working on namespaces, cgroups, etc so I think that’s a pretty unproductive comment in this context.
It reminds me of early days in hypervisors when someone would get an exploit to break out of the isolation and someone would dismiss it because “virtual machines aren’t real isolation anyway”.
Look, I get it and I frankly agree with you in the current state of the world, but this is the time to shut up and get out of the way of people trying to make forward progress. Breakouts of containers are a big deal for people pushing the boundary there.
I don't know who you're really talking to (it's not me), but all I'm saying is that CAP_BPF doesn't bother me much, because it's problematic only for a security boundary that is already problematic with a much lower degree of difficulty for attackers than the eBPF verifier.
> it's problematic only for a security boundary that is already problematic
I’m absolutely talking to you because you’re dismissing an issue in a space where people are actively working to make it not “already problematic”.
“I don’t care about hypervisor vulnerabilities because they are only problematic for a security boundary that is already problematic. Smart people are bare metal only.”
My point is you don’t dismiss something as unworthy of attention because it’s in a larger area that needs active attention.
That pattern I’ve seen repeated for decades for things like “using credit cards to buy things online” to “cryptography protecting websites” to “hypervisors providing security”.
It’s pointlessly negative and doesn’t contribute to meaningful technical discussions. It’s a useful opinion if you’re advising tech stacks to adopt today or whatever, but that’s not interesting for the state of the art.
In Spanish, it's common for double negatives to not actually be double negatives. For example, if you wanted to say "there's nothing here", you'd say "no hay nada aquí", which word-for-word means "there's not nothing here".
Checking out the Royal Spanish Academy, here's what they say about it:
> The so-called "double negation" is due to the obligatory negative agreement that must be established in Spanish, and other Romance languages, in certain circumstances (see New Grammar, § 48.3d), which results in the joint presence in the statement of the adverb no and other elements that also have a negative meaning.
> The concurrence of these two "negations" does not annul the negative meaning of the statement.
Same in French: "Je ne sais pas" means I do not know, not I do not not know (aka I know).
In any case, the meaning of the sentence above: "uno no es ninguno" in Spanish is clearly one is not zero, or one is not none, or one is different than none.
"Uno no es nada" could be "one is nothing", and "one is not nothing". It all depends on the frame of reference (in this case English), but for this sentence, the "one is not none" is correct IMO. I would never even do a second pass on that sentence, as a native Spanish speaker (appeal to authority, I know)
The one time I tried to use eBPF it wasn't expressive enough for what I needed.
Does the limited flexibility it provides really justify the added kernel space complexity? I can understand it for packet filtering but some of the other stuff it's used for like sandboxing just isn't convincing.
There are other technologies for this, such as DTrace. The kernel's choice isn't eBPF or nothing, it's eBPF or something else like it.
You may not use it much, but some people use it all day. I think FAANG engineers have said that they run tens (hundreds?) of these things on all servers, all the time. And that's excluding one-offs. And FAANG has full time kernel coders on staff, so they're also funding this complexity that they use.
But also yes, I've solved problems by using eBPF. Problems that are basically unsolvable by non-kernel-gurus without eBPF. I rarely need it. But when I need it, there's nothing else that does the trick.
In some cases, even for kernel gurus, it's a choice between eBPF or maintaining a custom kernel patch forever.
> There are other technologies for this, such as DTrace. The kernel's choice isn't eBPF or nothing, it's eBPF or something else like it.
To add on this point: I successfully used SystemTap a few years ago to debug an issue i was having.
Before going further: keep in mind that my point of view (at the time) was the one of somebody working as a devops engineer, debugging some annoyances with containers (managed by Kubernetes) going OOM. I'm no kernel developer and I have a basic-good understanding of the C language based on first-years university course and geekyness/nerdyness. So in this context I'm a glorified hobbyist.
Learning SystemTap is easier in my opinion. I followed a tutorial by RedHat to get the hang of the manual parts but after that I remember being fairly easy:
1. Try to reproduce the issue you're having (fairly easy for me)
2. Skim the source code of the linux about the part that you think might be relevant (for me it was the oom killer)
3. Add probes in there, see if they fire when you reproduce the issue
4. Look back at the source code of the kernel and see what chain of data structures and fields you can follow to reach the piece of information you need
5. Improve your probes
6. If successful, you're done
7. Goto 4
I think it took like one or two days between following the tutorial and getting a working probe.
DTrace and eBPF are "not so different" in the sense that dtrace programs / hooks are also a form of low-level code / instruction set that the kernel (dtrace driver) validates at load. It's an "internal" artifact of dtrace though, https://github.com/illumos/illumos-gate/blob/master/usr/src/... and to my knowledge, nothing like a clang/gcc "dtrace target" exists to translate more-or-less arbitrary higher-level language "to low-level dtrace".
The additional flexibility eBPF gets from this is amazing really. While dtrace is a more-targeted (and for its intended usecases, in some situations still superior to eBPF) but also less-general tool.
Yes, thank you. Long before eBPF existed, we spent a ton of time on the safety of DTrace[0][1] -- there's a bunch of subtlety to it. The proof is in the pudding, however: thanks to our strict adherence to the safety constraint, we have absolute confidence in using DTrace in production.
I’m curious which part of these tenets would feel would have prevented the bug demonstrated, besides “oh we tried harder”? I don’t see any of those that seem unique to DTrace other than limiting where probes can be placed.
Well, we didn't merely "try harder" -- we treated safety as a constraint which informed every aspect of the design. And yes, treating safety as a constraint rather than merely an objective results in different implementation decisions. From the article:
This working model significantly increases the attack surface of the kernel, since it allows executing arbitrary code at a high privilege level. Because of this risk, programs have to be verified before they can be loaded. This ensures that all eBPF security assumptions are met. The verifier, which consists of complex code, is responsible for this task.
Given how difficult the task of validating that a program is safe to execute is, there have been many vulnerabilities found within the eBPF verifier. When one of these vulnerabilities is exploited, the result is usually a local privilege escalation exploit (or container escape in containerized environments). While the verifier’s code has been audited extensively, this task also becomes harder as new features are added to eBPF and the complexity of the verifier grows
DTrace was developed over 20 years ago; there have not been "many vulnerabilities" found in the verifier -- and we have not grown the complexity of the verifier over time. You can dismiss these as implementation details, but these details reflect different views of the problem and its contraints.
No, like, the bug that was demonstrated seems to be fairly fundamental to running any sort of bytecode in the kernel: they need to verify all branches, and this is potentially slow, so they optimize it (which is where the bug is). What are you doing differently? It seems to me that you’re either not going to optimize this or you are?
The DTrace instruction set is more limited than that of the eBPF VM; eBPF is essentially a fully functional ISA, where DTrace was (if I'm remembering this right) designed around the D script language. An eBPF program is often just a clang C program, and you're trusting the kernel verifier to reject it if it can't be proven safe. Further: eBPF programs are JIT'd to actual machine code; once you've loaded and verified an eBPF program, it has conceptually all the same power as, say, shellcode you managed to load into the kernel via an LPE.
That's not to say that security researchers couldn't find DTrace vulnerabilities if they, for instance, built DIF/DOF fuzzers of 2023 levels of sophistication for them. I don't know that anyone's doing that, because DTrace is more or less a dead letter.
For those who read this thread - DTrace is in use in Solaris and in Illumos, and various of us who use Illumos for our production use cases (like Oxide does) still very much use DTrace.
I appreciate the rest of tptacek's comment which is informative. I also acknowledge that there may not be fuzzers written that have been disclosed.
Oh, sorry, totally fair call-out. There's like a huge implicit "on Linux" thing in my brain about all this stuff.
I'd also be open to an argument that the code quality in DTrace is higher! I spent a week trying to unwind the verifier so I could port a facsimile of it to userland. It is a lot. My point about fuzzers and stuff isn't that I'm concerned DTrace is full of bugs; I'd be surprised if it was. My thing is just that everything written in memory unsafe kernel code falls against Google Project Zero-grade vulnerability research, at some point.
That's true of the rest of the kernel, too! So from a threat perspective, maybe it doesn't matter. I think my bias here --- that's all it is --- is that neither of these instrumentation schemes are things I'd want to expose to a shared-kernel cotenant.
- it cannot branch backwards (this is also true of eBPF)
- it can only do ternary operator branches
- it cannot define functions
- functions it can call are limited to some builtin ones
- it can only scribble on the one pre-allocated probe buffer
- it can only access the probe's defined parameters
If the verifier can prove to itself that a loop is bounded, it'll accept it. A good starting place for eBPF itself: if a normal ARM program could do it, eBPF can do it. It's a fully functional ISA.
It depends on what you're using it for. If you want to expose this to untrusted code, yes, but I wouldn't be comfortable doing that with DTrace either.
There's two untrusted code cases here: untrusted DTrace scripts / users, and untrusted targets for inspection. The latter has to be possible to examine, so the observability tools (like DTrace) have to be secure for that purpose. This means you want to make it difficult to overflow buffers in the observability tools.
There's also a need to make sure that even trusted users don't accidentally cause too much observability load. That's why DTrace has a circular probe buffer pool, it's why it drops probes under load, it's why it pre-allocates each probe's buffer by computing how much the probe's actions will write to it, it's why it doesn't allow looping (since that would make the probe's effect less predictable), etc.
Bryan, Adam, and Mike designed it this way two plus decades ago, and Linux still hasn't caught up.
Linux has a different design than DTrace; eBPF is more capable as a trusted tool, and less capable for untrusted tools. It doesn't make sense to say one approach has "caught up" to the other, unless you really believe the verifier will reach a state where nobody's going find verifier bugs --- at which point eBPF will be strictly superior. Beyond that, it's a matter of taste. What seems clearly to be true is that eBPF is wildly more popular.
It's really hard to bring a host to its knees using DTrace, yet it's quite powerful for observability. In my opinion it is better to start with that then add extra power where it's needed.
I understand the argument, but it's clear which one succeeded in the market. Meanwhile: we take pretty good advantage of the extra power eBPF gives us over what DTrace would, so I'm happy to be on the golden path for the platform here. Like I said, though: this is a matter of taste.
And I should say that DTrace probe actions can dereference pointers, but NULL dereferences do not cause crashes, and rich type data is generally available.
DTrace does not have arrays on purpose because they reasoned that those bounds checks will not be secure. And DTrace provides seemless tooling up into user-space scripts. Kernel, libc, scripts.
Until eBPF came around and said we can now prove it to be secure.
Until the sidechannel hackers came around to prove the opposite.
> I've solved problems by using eBPF. Problems that are basically unsolvable by non-kernel-gurus without eBPF. I rarely need it.
Would you mind giving some examples? I recently started learning about ebpf's from Liz Rice's book and is curious about what makes ebpf the correct choice in a particular scenario.
I'm not sure "Google engineers use it" is a very good counter argument. They have a very high tolerance for complexity and like most large corporations what actually gets built and used tends to be driven more by internal politics than technical merit.
I don't mean it as a counter argument, or I don't think the way you mean it, at least.
You may not use it at your smaller scale. But there are millions of machines out there that do use it, and the alternative for the same functionality is much worse.
I bet you never use SCTP sockets either. eBPF is used much more than SCTP.
And its users "fund" its development, so it's not a burden to those who don't use it.
But are you sure your systems don't use it? Run "bpftool prog" to see. Whatever you see there someone thought was better than the alternative.
Wouldn't even the classic loadable kernel mode driver be a better choice than a patch and eBpf? I know they are unsafe but people who deal with it, know the power comes with responsibility.
No? SREs roll eBPF programs on the fly just in the process of debugging problems; if you tried to do that with an LKM, you'd almost certainly blow up your system. People who write Linux kernel code routinely crash their systems in the process of development.
In the setting eBPF is used today, most of the value of the verifier is that it's hard to accidentally crash your kernel with a bad eBPF program. That is comically untrue about an ordinary LKM.