One thing that really excites me about getting this into runC is that now we can work on making other parts of container orchestration and management run as an unprivileged user.
Huge props to the Cloud Foundry team who already have taken rootless containers and have some experimental support for them[1]. It'd be awesome if we could do something similar to Kubernetes so that you could start clusters as an unprivileged user (in my mind the networking is the hardest part and I think the only way right now is to implement pseudo-bridge interfaces in userspace). But I'm pretty excited about the possibilities. :P
While it'd be really cool to go to said offices for a merge party, I feel like that might be a bit too self-indulgent. Though meeting some of the Pivotal team in Sydney might be fun. :P
You're welcome to drop by most Tuesdays for lunch or breakfast at 8:30 any day of the week, we're at 155 Clarence. Drop me a line at abhih at pivotal.io if you'd like that :)
Hey, another Sydneysider here (currently based in China), did some early implementations with LXC, eg. https://github.com/globalcitizen/lxc-gentoo (in US) and talked with the IBM guys at that time.
Nitpicks: About ~9min in your talk on a user namespace slide you talk about device creation restrictions implying it is linked to user namespaces but I am pretty sure that is the device cgroup's job and it's possible to bugger that up if you're not careful. Similarly at 19:50 or so I believe the statements about things not working only apply to some container systems, eg. IIRC mknod can be allowed and controlled on a per-major:minor basis via the device cgroup.
Another security area I looked at was docker grsec incompatibilities, eg. https://github.com/docker/docker/issues/20303 though didn't get closure despite suggesting what IMHO seems a decent fix. Meh.
Re: ~32:50 + crystal balling: My personal impression is that the whole container based infrastructure / development trend will continue to snowball in to larger devops solutions that automatically secure systems in many ways (via cgroup configurations, kernel security toolkit policies such as syscall and device whitelisting, readonly path lockdowns, network traffic and bandwidth restrictions, etc.) over the next few years. Existing tools like atime, fsnotify and network traffic dumps should make this pretty easy. IMHO it should be a natural progression for obtaining low hanging security enhancements currently unused due to configuration complexities. It will be basically enabled through decent workflow (ie. CI/CD + testing), and may result in a "safe CFLAGS"-like set of aggressiveness presets for new services. Thought this out and implemented to some extent while at Kraken (~2011-2015). (Now working on a more hardware/mech eng. related business and am not longer spending time in the area. Down for a drink next trip though, email in profile!)
> About ~9min in your talk on a user namespace slide you talk about device creation restrictions implying it is linked to user namespaces [...]
No, mknod() is also gated by a capable(CAP_MKNOD) check if the node is a character or block device (see vfs_mknod). Which you can't have if you've created an unprivileged namespace. Devices cgroup is an additional restriction but you hit capable(CAP_MKNOD) long before that check.
> Similarly at 19:50 or so I believe the statements about things not working only apply to some container system
The entire talk is about rootless containers, so I'm specifically talking about the case where you have only mapped a single user and don't have the ability to modify cgroups. I'm well aware how to grant access to devices if you have root -- the whole point of rootless containers is that you don't have root. ;)
> Re: ~32:50 + crystal balling: [...]
Yeah, I wrote that slide about 10 minutes before my talk. My point was that I'm really hoping one day we see containers (or specifically the sandboxing capabilities of containers) be integrated into normal applications and extended to the point where every user uses this stuff. And as you said, the main benefit is that you can then apply a bunch of useful security profiles that knock out the low-hanging fruit.
No, mknod() is also gated by a capable(CAP_MKNOD) check if the node is a character or block device (see vfs_mknod). Which you can't have if you've created an unprivileged namespace. Devices cgroup is an additional restriction but you hit capable(CAP_MKNOD) long before that check.
Namespaces (of which there are many, so 'unpriveleged' must be qualified) != cgroup controllers (of which there are similarly many) != capabilities (of which there are also many).
IIRC these are separate and no use of one implies the other. Therefore I do not understand your comment and believe it may be in error, as it seemed both in your talk and in this reply that you are implying a link is now implied at the kernel level. I strongly suspect that your impression comes from container userland runtime system specific code (in which my "for some of these comments it depends which container runtime you are using" comment stands), and I strongly doubt it is kernel code, but am always more than willing to be re-educated.
> Namespaces (of which there are many, so 'unpriveleged' must be qualified) != cgroup controllers (of which there are similarly many) != capabilities (of which there are also may).
You're correct that namespaces != cgroups. But capabilities are actually part of namespaces (well, specifically the user namespace but a lot of permission checks in the kernel depend on other namespaces and having capabilities in pinned namespaces). So while you could say that capabilities != namespaces it's not really accurate since they are quite inter-related. Not to mention that every process is in a set of cgroups and namespaces and has a set of capability sets.
As for the specific example of mknod, here's a link to the relevant kernel source[1]. You're not correct in your assumption.
cgroups and namespaces are separate when you are actually administrating them, and they are wholly separate subsystems. However, at the end of the day any security policy will have to be applied to whatever syscalls they apply to. capable(CAP_MKNOD) is a namespace check, and devcgroup_inode_mknod is a cgroup check.
> I strongly suspect that your impression comes from container userland runtime system specific code
I read the kernel code for namespaces and cgroups all the time when debugging issues. I'm well aware how containers work from both the userspace abstraction side as well as the kernel side (and have written kernel code for containers too).
> and I strongly doubt it is kernel code, but am always more than willing to be re-educated.
See the link[1] for the example of mknod. My point was that both cgroups and namespaces gate the requirements for creating new devices. And that makes logical sense because any user can create a new user namespace with full capabilities, but the device cgroup rules aren't changed by doing that (so it would allow for unprivileged users to do all sorts of bad things as a result).
Yes, obviously mknod is capability checked in the kernel, as there is a capability with that very name. Where else would it occur? I never suggested otherwise.
My point was that in your talk and comments you make some suggestions which I feel are misleading or inaccurate. In some cases this is probably because you are discussing your own implementation. In other cases this is because I believe your use of terminology is incorrect or misleading, eg. capabilities are not 'part of' namespaces, any more than block devices are 'part of' the filesystem. Similarly, the phrase 'unprivileged namespace' comes across as exceptionally obtuse, given that most security checks occur in other kernel subsystems, and a holistic conception is implied by the word 'unpriveleged'. Naming is hard, but kernel terminology should already be consistent.
I can't see us getting anywhere further with this discussion but an honest thanks for taking the time to respond.
> Yes, obviously mknod is capability checked in the kernel, as there is a capability with that very name.
Yes, and it's checked against the root user namespace, not your current one. Which means that it's not as simple as "do I have this capability". It's "do I have this capability in this user namespace". Which is what I said originally, so I'm not sure what you're debating here.
> My point was that in your talk and comments you make some suggestions which I feel are misleading or inaccurate. In some cases this is probably because you are discussing your own implementation.
Can you give some examples? If I'm wrong about something, I'd love to see an example of it so I can correct it in the future. In particular, I'd like to know what parts of my discussion are related to runC's implementation of containers.
> capabilities are not 'part of' namespaces, any more than block devices are 'part of' the filesystem
Capabilities are effectively 4 bitmasks that are scoped to user namespaces. Yes, they aren't technically part of "struct user_namespace_t" but the way they are used make them quite intricately related. Much more intricately related than block device inodes in a filesystem.
> given that most security checks occur in other kernel subsystems, and a holistic conception is implied by the word 'unpriveleged'.
But those security checks generally will use either ns_capable (so they're all hooked up to the user namespace scoping logic), basic UID/GID checks (which are also hooked up to user namespace mapping logic) or some other security framework that is effectively an additional ACL (and isn't the purview of namespaces anyway).
The main complaint I would understand from the use of "unprivileged namespace" is that the kernel doesn't actually mark namespaces as being privileged or unprivileged, it just so happens that unprivileged namespaces act slightly differently because of how they are set up. But AFAICS that's not the argument you're making.
> Another security area I looked at was docker grsec incompatibilities, eg. https://github.com/docker/docker/issues/20303 though didn't get closure despite suggesting what IMHO seems a decent fix. Meh.
I don't like Docker's chrootarchive any more than the next person, but your fix is not trivial (unless the fix you're referring to was to disable the security features). Part of the point of chrootarchive is that it protects against a malicious archive from modifying the host filesystem through dodgy paths -- something that using a chroot prevents. Otherwise chrootarchive would have to implement all of the chroot path handling code the kernel normally deals with (which is not good fun, take it from me).
The security issues that grsec is trying to protect users from do not apply to chrootarchive because (in principle) the binary being run in chrootarchive is safe (it is `tar -x` so shouldn't actually be trying to access or otherwise mess with disks for example).
The general 'application resolution' of grsec features is system-wide, although with additional configuration some features may be limited to certain users/processes.
This in a nutshell is why the incompatibility is significant: systems may be running more than just a docker workload requiring these security features, so saying "it's secured in the normal docker case anyway", even if true, doesn't really help to solve the problem.
The preferred suggested fix was "mknod and chmod during image builds to be executed from outside of any chroot". In siding with docker in not considering this fix, you seem to be alluding to last year's http://seclists.org/fulldisclosure/2016/Oct/96 vulnerability, now fixed. A more intelligent fix would be 'require a modern/secure archive extraction tool'.
The second fallback was "Docker could check for the existence of /proc/sys/kernel/grsecurity/chroot_deny_chmod and /proc/sys/kernel/grsecurity/chroot_deny_mknod and, if present, ensure the values of both are 0."
> so saying "it's secured in the normal docker case anyway", even if true, doesn't really help to solve the problem.
I don't think you read my comment properly. I'm saying that the security feature doesn't apply to Docker, but disabling it would be bad because it will affect other applications on the system.
> The preferred suggested fix was "mknod and chmod during image builds to be executed from outside of any chroot".
Which as I said, wouldn't be a good idea because of how chrootarchive works. In short it would lead to path traversal attacks, as with the issue you mentioned. The fix was to properly chroot things IIRC.
> A more intelligent fix would be 'require a modern/secure archive extraction tool'.
I agree. Are you willing to write a patch for it (or add patches to GNU tar?), because I'm afraid I don't have the time to do it at the moment (and I don't contribute to Docker that much these days).
> The second fallback was [...]
Which would be a bad thing to do because it would purposely disable security features that will affect other applications.
I suppose we'd better return this sculpture we had made then... If you're ever over London way (where most of the CF Garden team are), we'd love to host you for something! Otherwise hopefully some conference overlap.
You can't both use iptables and have access to host network interfaces (as an unprivileged user), if that's what you mean. You can either create a network namespace and manage network interfaces (but not have a bridge to the host) or just deal with what access you're given. In the future my plan is to write a CNI plugin that implements pseudo-bridges in userspace so that you could get both (though I'm not sure if it'll work at the moment).
But if you're an administrator you can restrict it like any other process.
As shown in the linked thread, this Hacker News submission was promoted via an image of the submitted link in /newest. This is a modern form of vote manipulation (although in this case perhaps unintentionally) which seems to be the rage nowadays, especially on a certain other social-voting website. (The content of the submission is good/important regardless, but others should keep in mind that this form of voting manipulation isn't clever)
I'm not sure I understand how this is vote manipulation? Surely they are just replying to the "somebody should post this to HN" comment by showing that they did? Why does an image get illegitimate votes?
Posting an image makes it obvious what to upvote. (as opposed to replying "I posted on Hacker News")
But even then that would technically be vote manipulation because it's drawing attention to an immature submission. (and there is little genuine reason to do so. The only counterexample I can think of is using HN as a comments section)
There were 15 participants in the GitHub issue. Hardly enough people to manipulate anything, so obviously that was not the intention. Even though some 300 people are watching the repo as a whole I doubt that many of them would see that particular comment. As the parent to your comment said, a screenshot is a reasonable way to inform the others in the thread that the link has been posted.
I am against vote-manipulation of course but I don't think that this is an instance of it.
This thread has 71 points in 3 hours. Having 15 people upvote it all at once could get it onto the front page, where it then keep getting upvoted naturally.
@minimaxir OP here. I don't really understand why / how this is vote manipulation, and I also wouldn't care either. I'm not a regular HN user. Wouldn't be the same to just reply to the thread by just sharing the thread link?. Just thought that sharing the image would generate a nice impact.
> Wouldn't be the same to just reply to the thread by just sharing the thread link?.
That is still against the HN rules, and it downweights votes made through direct links for those reasons. (highlighting a submission without using a direct link is the primary voting manipulation game on the said other site, and there have been some unique innovations)
<3 The one thing I've really learned from all of this is just how many different things can break if you take away root privileges from a process. I'm kinda excited to see how much of cri-o and kubernetes we can make run as an unprivileged user (especially given how Cloud Foundry has already taken advantage of this).
I've been nagging anyone who will listen that this ought to be done. (Easy for me: I don't have to do it. I will of course take credit to the fullest extent the law allows.)
If a rootless container process runs as the root user and can't be switched, is it considered to be "root" as far as the kernel is concerned? As in, does it have access to root-only kernel features (like the root keychain)?
No, the kernel knows what it's "real" UID and GID are. It even knows what unmapped UIDs and GIDs are. I haven't tried to access the root keychain inside a rootless container, but if it does work I would consider that to be a vulnerability in the kernel.
I'd compare runC to LXC as opposed to rkt. It just takes a path and configuration and starts a container, without managing images or networking topologies. rkt is on the same level as cri-o (or containerd) since it also implements image handling as well as other related things.
Also, rkt (by default at least) uses systemd-nspawn to create containers. In runC we have our own implementation called libcontainer.
Huge props to the Cloud Foundry team who already have taken rootless containers and have some experimental support for them[1]. It'd be awesome if we could do something similar to Kubernetes so that you could start clusters as an unprivileged user (in my mind the networking is the hardest part and I think the only way right now is to implement pseudo-bridge interfaces in userspace). But I'm pretty excited about the possibilities. :P
[1]: https://github.com/cloudfoundry/garden-runc-release/blob/dev...