A reference counting bug which leads to local privilege escalation in io_uring

ec109685 · on June 22, 2021

It still is strange there is zero testing of these functions (e.g. that fix doesn’t come with any tests).

Because there isn’t any requirement for testing, it allows these functions to become super complex and harder to see where errors could occur.

staticassertion · on June 22, 2021

Linux is extremely optimized for happy path coding.

pjmlp · on June 22, 2021

Why is this strange? I never worked in a project where fixes required tests beyond some QA guy/gal stating "it works, done".

onei · on June 22, 2021

From [1]

> MariaDB includes test cases for all fixed bugs.

It's not something I've practiced much myself, but writing a test case to reproduce the bug and then fix the bug seems like a more reasonable form of TDD in my head.

I've seen it once when reporting a bug to another team I worked alongside, where I was stress testing a new feature and found said bug. Instead of having to run the stress test, they wrote a unit test to reproduce it at a much smaller scale and then had a much smaller feedback loop to ensure it worked after they fixed it.

[1] https://mariadb.com/kb/en/mariadb-vs-mysql-features/

pjmlp · on June 22, 2021

Sure, but that is actually more the exception than the rule.

jjav · on June 23, 2021

In many projects a test case is required which reproduces the bug and then shows that it's fixed.

tptacek · on June 21, 2021

Further grist for the mill about the effectiveness of seccomp-style filtering for multitenant Docker, since it's unlikely anyone was filtering out `io_uring_setup`.

catblast01 · on June 22, 2021

People can do whatever they want with seccomp-bpf obviously, but is it really that uncommon to use it for whitelisting? As for kernel vulnerabilities being a weakness of sandboxing in general, if anyone still doesn’t understand that by now it must be willful and I don’t know if they can be helped.

tptacek · on June 22, 2021

No matter how you mask off attack surface for the kernel, you're not super likely to want to disable io_uring, is the point I'm making. It's easy to find recent threads here with people sticking up for shared-kernel multitenant isolation.

(Be forewarned that I'm talking my book a bit here, since we have a commercial thingy built on multitenant VMM isolation).

touisteur · on June 22, 2021

BTW while on the topic, what do you think about having a heavy host kernel with a guest vmm attached to the network with a hardened firecracker and a dedicated network interface. Would you feel it's 'better' than shared kernel/os + namespaces? Or is it 'smallest hardened root hypervisor or no go'. Not sure I'm making sense...

tptacek · on June 22, 2021

The heavyweight host (which is the normal state of affairs) is problematic attack surface; moving the workload into a hardened VMM on that improves security regardless.

touisteur · on June 22, 2021

Thanks Thomas for the insight.

CodesInChaos · on June 25, 2021

Isn't the standard pattern dropping privileges after the setup is finished?

catblast01 · on June 22, 2021

> sticking up for shared-kernel multitenant isolation.

Seems like willful snake oil.

sva_ · on June 21, 2021

For a moment, I thought 'escalati' (in the title of the submission) was some kind of professional term that had so far evaded me. It sounds pretty elegant. But of course, the title was just cut off. Almost disappointing.

hospadar · on June 21, 2021

Escalati: the secretive guild of hereditary escalator engineers who maintain the escalators in the Illuminati's secret volcano lair (escalator reliability engineering is a major concern when world leaders are frequently escalating over giant cauldrons of molten lava)

spartanatreyu · on June 21, 2021

Escalati: Cousins to the Air Conditioning Repairmen: https://www.youtube.com/watch?v=KrcY6PXkGuE

froh · on June 22, 2021

what a puzzling fragment of american culture did you just unearth for us! it says s03e06 so it survived a surprising while. was this popular amongst HN crowd? was it all so absurd?

rocqua · on June 22, 2021

This is really a minor part of the show "community". Whilst community definitely deserves a watch, it is a weird comedy about community college. It is not focused on technology in any way.

Give the show a try, it is good! but don't expect tech-focus.

microtherion · on June 21, 2021

The Escalati - a secret society controlling the world by means of privilege escalation.

988747 · on June 21, 2021

As opposed to Iluminati, who try to do the same with smart lightbulbs?

loopz · on June 21, 2021

Have we lightened up yet?

Lammy · on June 21, 2021

You got it backwards. Remember that when people say “illuminati” they are speculating about occultists, not about illumists.

dcminter · on June 21, 2021

The same - though I read it as being a tongue-in-cheek plural for escalation in a security context. Perfect for high-falutin' conference papers!

edoceo · on June 21, 2021

pwn2own: escalati the boxen!

_ofdw · on June 22, 2021

>Escalati

Found the name for my next CTF team.

jtbayly · on June 21, 2021

escalati: plural of escalatum

OR

escalati: The beings who control the illuminati

edoceo · on June 21, 2021

Here's the CVE

https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-2022...

For some reason the article don't link there :(

e12e · on June 21, 2021

Thank you. But CVE seems to disagree with the headline?

> The highest threat from this vulnerability is to data integrity, confidentiality and system availability.

Or is this more of a "read/modify /etc/shadow or /sbin/su" kind of thing?

tptacek · on June 21, 2021

As I read it: it's a kernel UAF; memory corruption, in the context of the kernel. There's a secondary attack vector related to the refcount mishandling, where you can obtain control of file table entries after an `execve`, even if you exec a SUID, which is also bad.

secondcoming · on June 21, 2021

The actual code bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1873476#c16

amerine · on June 21, 2021

Any idea what the diagrams were generated with? It looks graphviz-y to me.

jandrese · on June 21, 2021

Strangely the Redhat bug tracker listed in the CVE has this issue closed with "NOTABUG". I guess it's not technically Redhat's problem?

https://bugzilla.redhat.com/show_bug.cgi?id=1873476

wereHamster · on June 21, 2021

> The affected code was not introduced into any kernel versions shipped with Red Hat Enterprise Linux making this vulnerable not applicable to these platforms.

Might explain the strange status.

saagarjha · on June 21, 2021

It would be nice if the title mentioned what was affected, perhaps something like "CVE-2021-20226: io_uring privilege escalation via reference counting bug".

dang · on June 21, 2021

That's easy. We don't need CVE numbers in titles. The information is trivially available to anyone who needs it.

(Submitted title was "CVE-2021–20226 a reference counting bug which leads to local privilege escalati".)

hsbauauvhabzb · on June 21, 2021

If anything, [GNU/Linux] would be more relevant.

NewJazz · on June 22, 2021

What does this have to do with GNU? Afaict it affects any Linux kernel of the relevant version.

hsbauauvhabzb · on June 22, 2021

You’re probably right that this is the one time where GNU is unrelated to the problem. Congratulations.

mhh__ · on June 21, 2021

So HN should be optimized for people who don't click the link?

marshray · on June 21, 2021

Perhaps the titles at least should be optimized for people deciding whether to click the link.

jrockway · on June 22, 2021

I think these articles are more aimed at the postmortem aspect; reading about a bug that happened so you can try to avoid it when you're designing a similar system. So it doesn't really matter that it affects Linux, or what io_uring is, etc. The lesson is relevant even if you use Foobian GNU/OpenBSD emulated under Windows 11 on an M2 Mac.

If you are just looking for notifications that you should patch your system, you probably want a method other than HN for that -- you will miss a lot of critical patches.

junon · on June 22, 2021

Man. I've been on the io_uring train since basically the beginning.

Since following, I've seen reason after reason not to ever use it. Between the skewed performance tests, the dubious funding (coming from Facebook), and the several security risks (including this one), I just don't see it taking off.

touisteur · on June 22, 2021

Skewed performance tests? Dubious funding? I'm not sure what you mean. Can't Facebook fund any technical work?

I see some reasons not to use it (like it's a vulkan-like low-level-only API, or portability issues, or some missing APIs in my case-s) but 'it was funded by facebook' and 'it has security risks'? I mean I'm not sure there's an area in the kernel without corporate funding/support, or without past security bugs. We keep finding vulnerabilities in ipv6, sctp... Yes the Linux kernel dev process is lacking but likely for different symptoms & causes?

junon · on June 22, 2021

The project has a long history of skewing the performance benchmarks and making wild claims like "60% faster than epoll" and the creator gets very defensive when anyone questions it.

There are claims about SQPOLL that simply cannot be reproduced. Several ongoing threads about it on the liburing tracker.

There have been privilege escalation problems from the start that seem like they aren't being addressed.

It feels as though Facebook wants to have something revolutionary at the cost of quality and they're using the Linux kernel to do so, though I realize that's my own hot take.

Again, I've been following this (and writing code for it) since I could get my hands on the dev branches. It's only really good for filesystem I/O in its current form as that's the biggest focus they have for it. They (Facebook) care less about other resource types (e.g. sockets).

You can choose to believe me or not, I suppose.

touisteur · on June 22, 2021

Thanks for taking the time to expand a bit more, on your experience with the code and the community around it. I'll have a look.

Right now I'm more interested in the chaining aspect than raw performance but I know some of my high-throughout network workloads behave far better in latency with the complete syscall removal on recv and send though keeping up with the completion queue is hard-ish. I still prefer dpdk right now for this kind of network stuff but just because my use case is perfectly adapted for it (no fragmentation, no complex protocol, constant data stream...) and dpdk ain't no party either.