I discovered the vulnerability, and I'm not entirely sure that Trevor Jay fully understands the issue (though to be fair, the easiest way of exploiting it is using ptrace(2) which is blocked by most default security policies). You don't need to use ptrace(2) or CAP_SYS_PTRACE to exploit the vulnerability.
You just need to have proc_fd_access_allowed(). I've not checked if ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS) calls into SELinux hooks (it probably does, and if it doesn't then resolving further files probably does too) but neither seccomp profiles (unless you're blocking open(2)) nor blocking CAP_SYS_PTRACE can help you here.
Now, the LXC exploit used ptrace in order to stop the process from closing its file descriptors. I'm not sure how you would reliably hit the race in this issue (something with SIGSTOP presumably?).
In any case, SUSE's update has additional fixes which also fix the issue even when you give a container CAP_SYS_PTRACE (the released patch does _not_ protect containers that have CAP_SYS_PTRACE enabled). The patches will be merged upstream ASAP, but Docker didn't want them in the patchset sent to its customers (preferring instead to update their vendored runC once they are merged upstream).
Sorry for not seeing your comment until now. Amazingly great vuln BTW. It's early in 2017, but this is probably going to be one of this year's best.
It's very important for everyone to understand my advice is RHEL/Fedora specific, which is---I think---the source of the misunderstanding here.
Putting aside `ptrace` being the best way to guarantee a race win, the reason for my focus on `CAP_SYS_PTRACE` is that with SELinux enabled there is no other way to exploit having access to the file descriptors. Even if you explicitly try to pass a containerized process an external file descriptor "legitimately" (e.g. with `sendmsg`) SELinux will still ultimately block the access due to the type restrictions. This means that with `setenforce 1` you need to use something like code injection to get the external process to access the file descriptors on your behalf.
> Sorry for not seeing your comment until now. Amazingly great vuln BTW. It's early in 2017, but this is probably going to be one of this year's best.
Thanks. :D
> Putting aside `ptrace` being the best way to guarantee a race win, the reason for my focus on `CAP_SYS_PTRACE` is that with SELinux enabled there is no other way to exploit having access to the file descriptors. Even if you explicitly try to pass a containerized process an external file descriptor "legitimately" (e.g. with `sendmsg`) SELinux will still ultimately block the access due to the type restrictions. This means that with `setenforce 1` you need to use something like code injection to get the external process to access the file descriptors on your behalf.
Ah okay, yeah I suspected that's what you meant (on _RHEL_ xyz is the case). Thanks for clarifying.
> [...] you meant (on _RHEL_ xyz is the case) [...]
>
Totally on me. We fight against it, but it's hard not to have the implicit context of RHEL/Fed be omnipresent on the Red Hat bugzilla. In fact, when I wrote the comment in question I had just finished lighting my incense to the sīla of `systemd`... :)
Did not realize you were in Sydney. AU truly has the best hackers.
"This is an extremely difficult to exploit flaw on standard RHEL and Fedora systems.
I checked the 1.10.3 and 1.12.5 builds on Brew. Both drop the `CAP_SYS_PTRACE` capability by default. 1.10.3 blacklists `ptrace` calls under the default seccomp profile. Thus, this flaw only comes into play for containers that already have elevated privileges.
Even if `ptrace` is available. The proposed exploit scenario of quickly attaching to a process joining the container space and using its file descriptors is not possible under the default SELinux configuration. The containerized PID 1 will have a type of `container_t` or similar SELinux type and thus will be blocked by standard type enforcement from accessing accessing any resources that haven't already been made available to containerized processes."
For anyone who is annoyed by SElinux enough to drop it, which is most people, and anyone who would ever like to ptrace a process, which is most people, or anyone who doesn't use distros like these with these default protections for container processes, it's still viable.
Every once in a while I'll say something along the lines of "Wouldn't it be nice if there was <describes SELinux>" or "wouldn't it be nice if you could <describes using strace>" etc. and get enthusiastic nods, disbelief or "if only" and have to break it to someone that they don't know what the hell they're doing and they simply overlooked a huge feature of their production platform.
Oh good, yet another vulnerability from the model of retroactively changing the execution environment of a process after it's been created. We had a thread about setuid binaries a week ago, which is the most common case of this design: https://news.ycombinator.com/item?id=13312722
We would all be better off if we designed systems such that some helper process, already running with the right environment / config / privileges, spawns the process for you and proxies input/output to your terminal.
And (as I mentioned in the other thread) this helper process could be literally sshd. Instead of having sudo, ssh root@localhost. No weird process trees with confusing things like effective UIDs. Instead of having runc exec, ssh root@container. No file descriptors get passed that aren't explicitly forwarded over the SSH connection.
Patching sshd to run over UNIX sockets without encryption and to use getpeername() for authentication is left as an exercise to the reader.
No patches nessecary :) I use this quite successfully in my ~/.ssh/config currently. With some options baked on, You probably could disable encryption too:
Hah, that is so close to the design I'm advocating, except that lxc-attach is yet another command in the mold of runc exec or sudo :-)
That design certainly lets lxc-attach permit only running sshd, though, which is a benefit. (And I forgot about sshd -i, which you could presumably point at a UNIX socket using socat or something.)
I think that is closer to how you do things in Windows, first set up the process environment and finally call CreateProcess or something like that. Maybe someone who knows Win32/NT internals better could comment?
I've often wondered what it would take to have a linux environment with setuid completely turned off, and this is actually a really interesting thought on how to achieve part of that. Nice. :)
The Go ssh library works fine over unix sockets, and you can easily customise auth, so it is pretty easy to make quick custom clients and servers with it.
How would you implement e.g. ping? I mean obviously you could have e.g. user accounts which were "sufficiently" locked down, but that seems like an even more likely source of problems.
1. ping hasn't needed to be setuid since Linux 3.0 (and isn't on most distros), the kernel lets you call socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP) without privileges, which lets you send ping packets and nothing else. Looks like Mac OS X and FreeBSD, at least, also support the same interface. This approach has successfully been used to eliminate other setuid binaries like pt_chown, which fixes ownership of a tty (the kernel now just sets the ownership correctly when you call open).
2. Use something like ForceCommand to allow "ssh root@localhost /sbin/ping", and make /bin/ping a shell script that does that. This carries strictly less complexity than making a setuid binary; any attack that applies to it also applies to setuid binaries, but the execution environment for setuid binaries is more open to the attacker's control.
3. Make a little ping server that you can request to conduct pings for you. For ping in particular this is probably silly, but for things like updating utmp (traditionally you make every program setgid utmp, or you use a helper setgid binary called utempter), there's probably some existing daemon like logind that can grow some small APIs.
File capabilities are certainly better than setuid, but they still have the same problem of elevating privileges in a potentially-attacker-controlled environment. If setuid ping has a vulnerability that lets you get root, CAP_NET_RAW ping would also have a vulnerability that lets you read all traffic into the machine and spoof packets from privileged ports or existing connections. That's an uncomfortably large amount of access, even if it isn't quite root.
CoreOS engineers started deploying patches across all channels for this CVE minutes after it was made public. More info here: https://coreos.com/blog/cve-2016-9962.html
You just need to have proc_fd_access_allowed(). I've not checked if ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS) calls into SELinux hooks (it probably does, and if it doesn't then resolving further files probably does too) but neither seccomp profiles (unless you're blocking open(2)) nor blocking CAP_SYS_PTRACE can help you here.
Now, the LXC exploit used ptrace in order to stop the process from closing its file descriptors. I'm not sure how you would reliably hit the race in this issue (something with SIGSTOP presumably?).
In any case, SUSE's update has additional fixes which also fix the issue even when you give a container CAP_SYS_PTRACE (the released patch does _not_ protect containers that have CAP_SYS_PTRACE enabled). The patches will be merged upstream ASAP, but Docker didn't want them in the patchset sent to its customers (preferring instead to update their vendored runC once they are merged upstream).