In my experience, multi-tenant Linux container security is still in its infancy due to Docker's unwillingness to build out container security models, the complexity of configuring namespace permissions in the kernel, and the lack of namespace awareness among key driver providers like NVIDIA. The most innovative work I've seen done in this space actually comes from the lxc/lxd group at Canonical, who for many years have been doing the heavy lifting in getting cgroup and namespace functionality working properly and securely - and without any of the glamorous publicity that Docker and CoreOS get. They use AppArmor extensively as an "application firewall for syscalls", which may sound ugly but is the most practical security solution I've seen in this space.
> They use AppArmor extensively as an "application firewall for syscalls", which may sound ugly but is the most practical security solution I've seen in this space.
Seems like a good use case for Capsicum[1]. Too bad Capsicum hasn't been merged into the Linux Kernel yet[2].
kinda, although you could implement sandboxing with SELinux it's going to take much more effort. SELinux is a Mandatory Access Control Framework and Capsicum is a hybrid Unix capabilities system. While there is overlap in what you can do with them they really our different sets of tools.
Mandatory access controls are for system administrators to lock down and control access on a machine. Capsicum is for application authors to sandbox their application. Ideally they should be used together.
Back to my point that while MAC(SELinux) can be used for sandboxing but will take more effort then using something like capsicum or pledge. With MAC you have to think of all the things you don't want an application to do. With Capsicum or Pledge you only have to think of the things you want your application to do and everything else will be blocked automatically. So for example to sandbox chromium with capsicum it took 100 LOC, to sandbox chromium with SELinux it took 200 LOC so not bad but it doesn't stop IPC primitives. So half the code and more protection.
I'm of the impression that SELinux is much more operationally complex than AppArmor--so much so that it lends itself to security holes by way of accidental misconfiguration. This is all hearsay, but I'd like to hear some good discussion about this.
> ... it lends itself to security holes by way of accidental misconfiguration.
I'm not sure I understand how SELinux would introduce additional security issues. It doesn't permit any behavior that wouldn't be allowed if it weren't in use.
> ... I'd like to hear some good discussion about this.
Some people seem to have -- for whatever reasons (previous issues or whatever) -- sworn off SELinux completely which, IMO, is a mistake.
Unfortunately, there are way too many tutorials and how-to guides on the Internet where step 1 is often "turn off SELinux". Instead of taking the time to figure out what the problem is and fixing it, they just take the easy route and get rid of it completely.
I am the opposite. Most of my servers run RHEL/CentOS and I insist that it be enabled on all of them. This is primarily because I've seen, firsthand, SELinux prevent a compromise (well, kind of).
In my career, I've had one machine that I had responsible for (even though it was "shared responsibility") get compromised (that I know of, of course). It was because of a web application (used by Marketing) where a vulnerability had been found and a patch had been released but not yet installed. I was on vacation at the time and, when I returned, just mistakenly assumed that this had been addressed. An exploit existed and was being used "in the wild" and this machine was hit.
The first thing that exploit did was to reach out and download the rest of its toolkit via TFTP. Fortunately, that attempt was prevented by SELinux and logged. Thanks to that log entry, we began investigating, figured out what happened, and promptly updated the machine.
So, yeah, SELinux saved our ass once, many years ago, and so now I make sure that it is enabled everywhere possible. It didn't technically prevent a compromise but it certainly prevented it from being much, much worse. We were able to figure out that nothing else happened after the TFTP transfer failed. The attacker didn't "try again" and we got it patched and back online without any other issues.
SELinux is complex and requires administrators to learn new things, but it's nothing that someone running Linux servers isn't used to -- it's just one more thing you need to learn.
Meh I don’t consider it bad. It’s in FreeBSD. You want it? Run FreeBSD. Waiting for Linux to implement their “version” of capsicum, poorly, Might be a while. Plus you have other great tech like CloudABI in FreeBSD as well which makes it even more compelling.
It would be wonderful for FreeBSD as well as Linux if both operating systems supported capsicum. It would mean that both would have the same sandboxing API. Which would make it more likely for application authors to use capsicum, which in turn increases the amount of applications using it. Also why do you assume Linux's implementation would be poor? there's nothing to indicate that it is. Also capsicum is already implemented for Linux and has been for a couple years, it's just not in Linus's tree[1].
Because almost every thing that Linux implements is poorly done. Let’s take the subject at hand. Containers. They never created containers. Instead they came up with docker. And never designed it to be secure. Hence all these efforts to figure out how to secure docker. They keep trying to retroactively bolt security on. Which is never going to secure it. Eventually someone’s going to have to throw it out and redo it with the initial goal of making it secure. When it should have been designed secure as the main feature. Like Zones or Jails were.
Also for sand boxing apps you really should look at CloudABI. Ed has put a ton of work into it.
The last time I looked at capsicum for Linux it wasn’t complete. That’s changed?
Also notary from docker can prevent MIM injection in the layers as they are being uploaded and downloaded (a surprisingly easy task actually, I think someone did it without having to break SSL).
Yes! Especially when you compile your kernel with VIMAGE and use ZFS as a filesystem. I should imagine that Illumos zones are pretty awesome too because they have the same features and Illumos is the source of ZFS, but unfortunately both my servers motherboards were not supported by SmartOS, so besides for messing around in a VM, I didn't really get do much.
What is the state of Docker container security? Is it still the case that root inside of a container trivially equals root outside of the container, or has that been addressed? I thought it had been. I've been finding it hard to find an answer to this.
You'll have a hard time finding people who will say you are secure, but there are no known vulnerabilities allowing breaking out of a default-configuration docker container at this time.
JessFraz has been running https://contained.af/, which is a much more locked-down than the default container, for ages, and no one has broken out and captured the flag yet.
If you do some legwork, it's not too bad currently.
It has been partially addressed you can enable username spacing in docker, but it is not the default. Additionally per tenant user namespacing is not yet possible with docket (though it is in the kernel).
MySQL Ubuntu packages pull in AppArmor by default - BY DEFAULT - and include a profile that doesn't let you put your data dir anywhere except for one particular location. And it's virtually impossible to tell why MySQL can write to THAT LOCATION RIGHT THERE where is clearly has the appropriate permissions.
Isn't that one of the main functions of a Mandatory ACL tool like AppArmor/SELinux - not allowing a program to operate outside of the normal confines?
I'd be happy to know it was working as indented, and a short google would lead you to the answer. In fact, Digital Ocean's documentation is the first result and it's right there in the steps.
It is the purpose of AppArmor. But in normal land, we just expect documented MySQL options to work.
It really was insane for MySQL package maintainers to make AppArmor the default experience. It ought to have been a separate package, where people that wanted AppArmor could have AppArmor.
The error messages you get are NOT helpful in allowing you to figure out what just went wrong, so if you don't know about it there's a huge cognitive gap because things that SHOULD work just don't for no valid reason.
Denials by AppArmor (or any other ACL addition) really should have a more verbose diagnostic message AND a dedicated ERRNO that is different from the standard ones.
This can also cause a bunch of fun errors (on installation no less) when you have custom PAM plugins. At least last time I had to deal with that (might have been as long ago as 12.04).
I agree with much of what you said, but I think the word “infancy” is pretty strong. Containers are actually older than VMs (going back to the old tymeshare days of the 60s, but also in Linux right after 2001). Multi-tenant containers are relatively new, but I think this NIST guideline is a pretty thorough.
I like how there is an implied influence (e.g. dockerism) about containers being immutable/ephemeral... meanwhile I've been incredibly happy moving most of my VM deployments over to nspawn because many standards of systems management were anti-patterns in dockerland.
Oh hey, another nspawn user. I've had really good luck using it for isolation purposes as well, and it's really handy for testing clean builds (arch-nspawn comes to mind, but it's also really useful for simulating installs without using a full VM).
I'm curious what extra tools, if any, you've used for nspawn (automation or otherwise). In my experience, it's been pretty easy to manage just with the default tools but it seems images can be a bit fiddly if you deviate too far from the host OS or anything without systemd--although I can't imagine that'd be useful outside experimentation or testing.
nspawn looks cool. Seems like a copy of LXC. I agree with you; containers don't have to be ephemeral. Using them like lightweight, snapshottable VMs on a bridged network is good for quick and dirty hosting.