> Every Google application, even search, runs on Kubernetes-managed containers.
I don't think this is true at all. They run on borg which is kubernetes spiritual predecessor. But a distinctly different, and much more mature, piece of software.
Google is mainly using Borg, but will gladly talk about the few internal services that they do run on top of Kubernetes, in addition to everything that runs on GKE (their hosted Kubernetes solution).
I like them when I use them the same way I use BSD jails: providing a very thin amount of high-level resource isolation. And that's it, because that's all they are actually good at. That's not, however, how the trend is to use them, and given the overwhelming, please-fund-us hype of the container movement, I'm not surprised you're unimpressed. Containers emphatically do not replace virtualization (which, while it certainly has some drawbacks, at least works as a developer will expect it to, and yet the tooling all desperately wants you to believe it does because that's how you'll buy into their ~platform~ and their ~vision~ and be so very monetized on their behalf.
(Five point releases after promising this wouldn't be the case anymore, Docker still doesn't use user namespaces, so I really hope your Dockerfiles aren't using root! Have fun.)
The lack of user namespaces in Docker is just crazy. People talk about them being insecure, but what is more secure about running all of your containers as the root user? Docker's security story is consistently lacking, it's like they actively try to do the worst thing possible.
It's not the worst thing in the world, because a competent administrator/infrastructure engineer who's comfortable with the underlying tools can account for this and mitigate it. That said, Docker markets consciously and directly to people who are not competent, with an undercurrent of "you don't need competent people to use this!". And, as in most computer-y cases, it's not the fault of the incompetent and the ignorant that they trust the providers of tools; implicit in actually providing them is that they are generally safe with the defaults that are given. Which in Docker's case they're emphatically not. That hurts people, and very little gets me saltier than hurting people unnecessarily.
(I haven't ruled out that it might be a lack of a culture of security, too, on the Docker team's part; there was that whole "oh, yeah, put desktop applications in Docker containers, never mind that now you're running Chrome as root and putting the unprivileged X socket in the container, letting it pump messages to any other application" thing from a Docker core contributor. Maybe they just don't know, themselves? Woof.)
Docker containers do not need to run as root and the documentation does recommend running containers as non-root users. Yet, unless something is default, it's hard to make users follow best practices. For that reason, user namespaces are important. Now, while they're not in a stable release yet, user namespaces have landed into Docker's master branch. Don't expect to wait long for this feature.
If you feel there are further actionable steps to move best-practices into default behaviors in Docker today, or improve security please feel free to submit a PR or open a thread on the mailing list. (If it's a vulnerability, however, email security@docker.com)
> Docker containers do not need to run as root and the documentation does recommend running containers as non-root users. Yet, unless something is default, it's hard to make users follow best practices. For that reason, user namespaces are important.
Definitely, but, in my experience, it gets worse than that, and why this is really important and not just important. Most containers are built on top of full OS images, with a full OS image's worth of vulnerabilities (some inert because services aren't running, but setuid's still respected--and I wonder how many setuid'd tools are actively tested in a containerized environment!), and it's rare in my experience that those base containers actually get updated after the end user selects one and a version to go along with it. Which means that even a properly-built container that doesn't run its internals as a non-root user may--and a pessimist might say "probably does"--have user escalation bugs squirreled away somewhere that the user not only doesn't know about, but isn't capable of auditing or is even aware that they might exist.
This, not "users don't read docs", is why I am so very, very salty about this in Docker, and while I'm glad that now Docker users (and, full disclosure, I'm not one anymore in part because of the poor security story--monolithic do-everything daemon running as root, no user namespaces--and in part just not wanting to be the stooge for somebody's platform play) don't have to "expect to wait long", this was promised in something like Docker 1.4, a year ago.
Have you considered using rkt as an alternative to docker? It tries to avoid a lot of the security related problems docker has. Specifically there's no daemon which runs as root, instead rkt is invoked directly, so the only time rkt requires root, is the actual execution of a container, not downloading or verifying for example. Next, rkt by default wont run unsigned or untrusted images meaning by default you must trust the image/author, and finally rkt has the ability to run your container in a light weight VM using lkvm, getting you all the benefits of VM level isolation, but you have the ability to use the same container tooling for it all, and decreasing the overhead by optimizing for the container use-case.
Recently rkt also got support for logging different events into the TPM audit log, making it possible to have tamper-proof audit trails of what containers have run on your system. User namespaces are also implemented, but I'm not sure how well tested they are.
Solving the problem of using, and creating large full OS images is quite difficult. As a stepping stone we have also created a tool which scans container images on our Quay.io registry looking for images effected by CVEs. This should hopefully help until we can properly solve creating functional minimal containers easily, but that's unfortunately not quite as easy as just telling people to use buildroot (you still need a way to update those images and know when apps need have security updates).
> Have you considered using rkt as an alternative to docker?
I have, and I respect it a lot more--and CoreOS in general has struck me for quite some time as being the adults in the room in this space, I think the value prop remains very shaky but I appreciate that CoreOS seems to give a damn--but I don't have a lot of use for rkt in general. The only place I have any multi-tenancy, I'm using FreeBSD and jails. (I'd like to use Illumos or OpenIndiana, as I better understand that stack than I do FreeBSD, but cperciva has done a great job with FreeBSD AMIs on AWS and I don't have the bandwidth to maintain an OpenIndiana AMI.)
> Most containers are built on top of full OS images, with a full OS image's worth of vulnerabilities
This is one of the problems DockerSlim (http://dockersl.im) is trying to address. You take those containers built on full OS images and you remove everything your app is not using reducing the attack surface.
Don't you need root access to even start the container? I think that's there is a disconnect here. There is a difference between running a container as root and the process w/in the container running as root.
Sure, you don't need to have the process running as root in the container, but you need to have root-equivalent access to start the container. For a (large) subset of use-cases, this just isn't an option.
I should be able to allow any given user of a system the ability to start a docker container and be confident that they won't be able to break the host system. Until this is the case, you don't have security in Docker.
The key is user namespaces. Unprivileged users can create containers easily via user namespaces. Once the user creates a user namespace, they have root in that namespace and are free to unshare the rest of the namespaces. This is how I wrote 'guix environment --container'[0], a tool for creating isolated development environments using the GNU Guix package manager. The big caveat is that unprivileged users do not setuid/setgid capability, so the number of uids/gids in the container is limited to 1, but I believe that even this is being dealt with in Linux now.
You also need to be root to manage your init process and in production, this is how Docker is treated. That said, I've seen work being done to add RBAC, given there are a set of users that desire this.
Docker 1.9 supports all other namespaces, cgroups, pivot_root, cap drop, selinux, apparmor, uid/gid drop.
You can also sign and verify all images with a built-in Notary/TUF implementation, and we partnered with Yubico to support hardware signing out of the box. I'm hoping we can make image signing the default in the near future, and make it mandatory within the year.
At this point I'm comfortable saying Docker's security story is strong (although not perfect of course). But if you have specific suggestions for improvements we are interested!
EDIT I got it wrong userns is still experimental and will land in 1.10
Going by discussions like https://bugs.archlinux.org/task/36969 (and the RHEL ticket it links in turn), it seems USER_NS in Linux isn't quite ready for production use anyway.
But yeah, I guess we should totally fund container startups, because… something.
(They seem good enough for local development, at least.)
There are definitely vulnerabilities being discovered in user namespaces, but they're a significant layer of defense as-is that will only improve.
To be honest, I don't think containers are even particularly good for local development. I'd rather run something that acts like my production environment, which means a virtualized system that's version- and patch-equivalent to my prod environment. I use Vagrant and scaled-down virtual machines for that.
I love containers, but I'm also using jails on FreeBSD. With jails I can put a different application in each jail, give each jail it's own network stack, firewall and ZFS filesystem. That way I can isolate each application from each other, give quotas to each container. I can turn on filesystem encryption for my database container, use ZFS snapshot and ZFS send to backup my database container. If you need to scale up your application you can snapshot the different application containers and ZFS send those containers to different servers. To update applications, you update your local copy of the container, and snapshot and send it to the production server, if something goes wrong with the update just ZFS rollback the container.
Then if you throw nanobsd into the mix you can create server images that are read only except for the application containers. Then you have a single server image for your application already setup, that you can just boot from or upload to some cloud service.
And now that FreeBSD has a 64bit linux emulator and docker ported everything just get's better.
I went to an event on new cloud computing technology in San Francisco. Half the presentations on containers were given by VCs who had some big stake in either CoreOS, Mesosphere, or Docker. Seems like a lot of the momentum behind containers is driven by the Silicon Valley investment community.
Containers have a rather narrow use case: efficient, decoupled deployment of Linux-hosted applications, particularly as part of a CI/CD pipeline. I would not expect most in "IT" to care too much about it, it's more of a "DevOps"/developer/SRE thing.
Containers are far from only being for linux applications, FreeBSD and illumos both have containers there just called jails and zones. Also FreeBSD and illumos sys admins use containers for application isolation because both jails and zones were designed to be secure.
There are configuration and package management systems (e.g. Salt or Ansible and apt or pip). Containers provide that + isolation, but I bet for majority cases no one cares if two services are walled from each other or not - as long as they don't interfere. Usually, they rarely do - and when it happens (say, an ancient blob wants an older libc), in most of cases a mere chroot does the trick, without need for cgroups/namespaces/extras.
There are lots of companies doing a lot more with containers than just using them in a pipeline. Google, for instance, runs all of its apps (including search) in containers (using Google's own container framework, LMCTFY), managed by Kubernetes.
I love containers as I can finally give up OS X because docker makes developing with virtual machines so painless. I can now once again buy windows hardware and just do my dev using a project's docker container.
wait until one of those dev containers tries to create a symlink in the folder docker-machine shares with your host. (hint: it fails, hard, in an unfixable way)
Not sure what you're on about. I just symlinked a file on the container inside a folder that was mapped to the host and nothing bad happened. Of course the host couldn't see the content of the file, but I was able to remove the file with no errors.
If I recall correctly this is a problem with virtualbox and not docker and it's only if the symlink is made in a specific way, like how npm makes symlinks.
Another abstraction layer on top of another abstraction layer. Some day i hope some one will clean up this mess. But right now, most of those XonXen/HYpervisor or OSv / Unikeneral aren't that promising.
LXC has supported user namespaces since 2013 and it works well enough. You could run unprivileged containers from Ubuntu 14.04. I think Nspawn supports it now too.
This is different from running the Docker from a user account. With Docker you are interfacing with the Docker daemon from a non root account but the container is still running as root. With Unprivileged containers thanks to user namespaces containers are launched and run from the user account.
But unprivileged containers need to be a simple no fuss experience and this can only get addressed in the kernel. On top of that the most popular user land container implementation Docker chooses to run containers without an init and since most apps you want to run in a container are not designed to work in an init less environment and will require daemons, services, logging, cron and when run beyond a single host, ssh and agents, just managing the basic process of running apps and their state adds tons of additional complexity for users. Integrating user namespaces on top of this is going to be non trivial.
Contrast that with LXC containers which have a normal init and can manage multiple processes enabling your VM workloads to move seamlessly to containers without any extra engineering. Any orchestration you already use will work obviating the need for reinvention. That’s a huge win but if you listen to the current container narrative and the folks pushing a monoculture and container standards it would appear there are no alternatives and running init less containers is the only ‘proper’ way to use containers, never mind the complexity.
A lot of problems related to unprivileged containers are in kernel namespaces and cgroups which are not going to be solved in user land. Cgroups are not namespace aware and only root users can manage them. Access to resources like mounts and networking require privileges and that won’t change. These are probably not 'sexy or fundable’ so the problems remain unsolved, and instead complexity deriving from niche use cases whether security or micro services is foisted on everyone.
A container is just a Linux process in its own namespace that anyone can create with unshare, iplink and chroot or pivot root. An ‘immutable container’ is nothing but launching a copy of a container enabled by overlay file systems like aufs or overlayfs, a ‘stateless’ container is a bind mount to the host. Using worlds like stateless, immutable or idempotent just obscures simple underlying technologies and prevents wider understanding of core Linux technologies that need to be highlighted and supported. But we choose to focus on wrappers and the narrative becomes about funding and standards. Without more focus and support on the filesystems, namespaces and other critical enabling technologies end users will not get a consistent experience. How much support do they have beyond the occasional article on LWN? These devs and projects work in obscurity with little support. This does not seem to be sustainable development model.
That would be great. It's surprising to me that every docker startups is focused on running large clusters, instead of a few focused on running very few images on small embedded devices. I could see a really nice continuous integration pipeline for the devices and it would speed up deployment for new users wanting to explore IoT.
Trusted computer is good. But TNO (Trust No One) is the only way to really be safe. Think. DeLL Cert fiasco and how that sort of think would break this.
In a trusted environment, I have found managing software updates and verifying that the updates are safe, challenging. Will be curious to know, how regular Linux kernel OS updates and runtime updates are managed by the CoreOS. Will there be someone verifying if the OS and runtime updates are safe and not compromising?
This is definitely a step in the right direction in the adoption of trusted computing.
This is a split responsibility between the OS vendor and the customer, just as with a trusted environment that's not backed by crypto/TPM.
The normal benefits of frequently releasing code are at play here, just at the OS/kernel level instead of a webapp. Testing can be completed against the different channels of CoreOS in staging environments as well. It's recommended to run some beta machines mixed into a fleet of stable machines to catch any issues specific to your environment.
A unique feature of CoreOS is that it ships an upstream kernel that doesn't have tons of backports and bugfixes. This means the upstream testing/performance infrastructure is leveraged for more visibility into the release.
I don't think this is true at all. They run on borg which is kubernetes spiritual predecessor. But a distinctly different, and much more mature, piece of software.