Kata Containers – The speed of containers, the security of VMs

jeremyjh · on Dec 7, 2017

They don't seem to have written any code yet. [1] So what we have at this point is a marketing website about their ambition and goals?

[1]https://github.com/kata-containers/runtimes

tecleandor · on Dec 7, 2017

It is comprised by many parts that, already, have seen development as:

Kata Agent: https://github.com/kata-containers/agent

Kata Shim: https://github.com/kata-containers/shim

Kata Proxy: https://github.com/kata-containers/proxy

KSM Throttler: https://github.com/kata-containers/ksm-throttler

And some forks to provide for their necessities, I suppose, as:

Linux Kernel: https://github.com/kata-containers/linux

QEMU: https://github.com/kata-containers/qemu

paulfurtado · on Dec 7, 2017

The code comes from Intel's Clear Containers and hyper. The interesting bit is that the tech is now part of the openstack foundation, under the name Kata Containers. At Kubecon yesterday, they did a demo, showing a fork bomb taking out a container, but not the host. It actually seems nearly ready to use.

redtuesday · on Dec 7, 2017

Can't you just combat fork bombs with e.g

  docker run --pids-limit=64

paulfurtado · on Dec 8, 2017

Yes, there are several ways to combat fork bombs (ulimits or pid namespaces). This was purely for the sake of the live demo that required a kernel crash example, there are certainly other ways to combat it.

mugsie · on Dec 7, 2017

There does seem to be a fair bit left to write (or clean up and open source from Hyper.sh and Intel)

e.g. the entire testing repo is empty

I would suggest we are looking at a conference driven development release

aleeland · on Dec 7, 2017

I am a product manager for Intel's Clear Containers and am also working in this community with Kata. We are still under development and merging Intel Clear Containers (CC) and Hyper.sh runV. Our 1.0 release is scheduled for March timeframe, at which point we plan to have a migration path for customers using runv or CC. We launched this week so that we can build our community and continue to merge the code in the open!

reacharavindh · on Dec 7, 2017

Impressive backing by the big name companies.

The idea of treating containers as secure and isolated as VMs is enticing for non-ephemeral services. Are these strictly tuned to exploit intel Hardware features or would they consider supporting the equivalent features in say AMD?

On the other hand, isn't this the realm of mainline distributions like RHEL, Debian and the like? To support such isolation facilities. I always thought clear Linux was a Intel playground for proof-of-concept which will eventually be up streamed to major Linux distributions.Is it not true?

I guess my question is why a separate project like this, instead of RedHat Enterprise Containers or Debian containers?

perlgeek · on Dec 7, 2017

> On the other hand, isn't this the realm of mainline distributions like RHEL, Debian and the like?

At least Debian doesn't develop isolation solutions on its own; it tends to package software that's already out there. And if it's popular enough, it might be integrated fairly tightly into the distribution.

odiroot · on Dec 7, 2017

Interesting that nearly half of the backers are Chinese companies.

supermatt · on Dec 7, 2017

Not when you consider that half of all companies are Chinese companies.

For comparison: USA: ~30m China: ~80m

hacknat · on Dec 7, 2017

At the Kubecon demo yesterday they cracked a joke about promising to support more architectures. They seem to sincerely want to, which is why they donated the project to OpenStack. Honestly it seems really cool, the spin up time is damn impressive.

0xbadcafebee · on Dec 7, 2017

> Impressive backing by the big name companies.

From someone who works at big name companies: this should not impress anyone. Big companies love slapping their name on things that give them "innovation" credibility. It's like Pepsi sponsoring the X Games.

> why a separate project like this

This is an OpenStack project, so it's not vendor-specific. It's also supposed to be a new "standard container", which I highly doubt will happen because they're just slapping together two other projects.

FooBarWidget · on Dec 7, 2017

> The idea of treating containers as secure and isolated as VMs is enticing for non-ephemeral services

Are you saying that security and isolation is not enticing for ephemeral services? I know that an ephemeral container is reset after a restart but I think that it's a bit naive to think that that is a good enough replacement for true isolation.

reacharavindh · on Dec 7, 2017

> Are you saying that security and isolation is not enticing for ephemeral services?

Didn't mean to imply the reverse logic of my statement. I believe Linux Containers (and hence Docker) depend only on Kernel namespaces to provide isolation. In my admittedly naive eyes, they were not good enough/mature to replace my KVM VMs yet. Too much to trade off for little convenience/performance.

However, if Linux containers matured up and offered the same isolation facilities that something like KVM does, then I can think about switching to them in future, and enjoy the performance boost.

>I know that an ephemeral container is reset after a restart but I think that it's a bit naive to think that that is a good enough replacement for true isolation.

If I'm looking to run an application for which I care about solid isolation of resources, I'd spend my time running it as VM. But, if I'm running a one-time script that chews some data and I don't care much about it bothering other workloads in the system or other workloads bothering it, then I'd fall back on the isolation facilities offered by namespaces by using Containers. Nothing wrong with that.

Security view on these is another argument. If I can't afford the application escalating it's view and looking into other workloads in that system, I just wouldn't run them in Containers today.

devonkim · on Dec 7, 2017

Sounds like you are looking for something closer to LXD or perhaps Rkt.

reacharavindh · on Dec 7, 2017

But, AFAIK, LXD and RKt are similar to Docker as a container runtime though. They all share the host kernel, and if one container is hosed/tainted, your host kernel becomes the attack vector. If I read correctly, hypercontainer/kata containers lets you bring your own kernel for your containers and isolates it from the host using intel hardware features(same ones that KVM leverages). That's where it gets interesting to me.

bonzini · on Dec 7, 2017

Kata Containers uses KVM; QEMU, which is the userspace KVM client, is configured so that it looks like you are running on a container.

However, what you get is indeed a virtual machine. It is simply impossible for "real" containers to provide the same isolation as virtual machine, simply because the attack surface is that of the shared kernel; a hypervisor presents a much more constrained interface to a VM than the full kernel, even if you add QEMU to the mix.

pstuart · on Dec 7, 2017

Silly question: if the KVM is using para-virtualized drivers and there is a vulnerability in same, then the host kernel would still be vulnerable?

bonzini · on Dec 7, 2017

Many of KVM's paravirtualized drivers run in QEMU, and in turn that is usually running heavily confined, for example using SELinux.

So it's true that, as in the famous Theo de Raadt rant, virtualization overall adds to the attack surface compared to containers. But it would also be stupid to ignore that it also introduces very important bottlenecks: to get to the hypervisor, you have to break KVM which is only ~60000 lines of code; getting remote code execution in QEMU might be easier, but then you also have to break the host kernel from a process that has access to almost nothing on the system.

There is also vhost, which is implementing PV drivers in the host kernel. This however is also a small amount of code, and it is generally used only for networking and AF_VSOCK sockets.

mugsie · on Dec 7, 2017

Separate projects like this is how a lot of these "RHEL Enterprise $FOO" are actually made.

RedHat / Suse / Ubuntu / $Vendor take the upstream project, tidy it a bit, package it, get it integrated in their ecosystem, and add an easy installer.

Having it in a vendor neutral foundation means that all the vendors can colaborate, and not have one group with a massive advantage or complete control over a roadmap.

bonzini · on Dec 7, 2017

There are hundreds of engineers working on RHEL (disclaimer, that includes me), so it's not as simple as you put it...

mugsie · on Dec 8, 2017

No, it is not - I over simplified the process a lot. (/me used to work in a vendor, and in a different large Linux distro :) )

I didn't mean to undermine the work that goes into turning a project like this, OpenStack, Kubernetes, Cloud Foundry etc into a real product that users can download and install on random hardware in random configurations, and get a working system - it is a ton of work, and is massively important for getting actual users to install what are very complex distributed systems.

perlgeek · on Dec 7, 2017

One thing that isn't mentioned on front page at least is the management aspect.

Docker became popular because it was pretty easy to use, and to publish and reuse existing containers. Whatever competes with it only stands a chance if it can either reuse the existing container ecosystem, or offer something roughly as good.

paulfurtado · on Dec 7, 2017

Sat through the talk at kubecon yesterday - an important goal of theirs is to not compete with the docker. They said it was compatible with docker, containerd, and cri-o. I believe with docker, it sits at the runc level, so to the end user, you're using docker in the standard fashion, but the underlying isolation mechanism is different. They also said it can be chosen per container so different containers on the same host can use different isolation mechanisms

perlgeek · on Dec 7, 2017

That makes a lot of sense, and probably the road that makes adoption easiest. Thanks!

kuschku · on Dec 7, 2017

It doesn’t have to replace Docker – just becoming a better container engine as backend for Kubernetes will be very useful.

jchw · on Dec 7, 2017

If it's using Hyper runV, my guess is that their intent is to be compatible with OCI and Docker.

mnd999 · on Dec 7, 2017

The British Indian Ocean territory really is becoming a tech hub.

dmytrish · on Dec 7, 2017

So there is no country called "Input/Output", what a bummer.

oblio · on Dec 7, 2017

I'm not sure I get the connection. Also: https://en.m.wikipedia.org/wiki/Depopulation_of_Chagossians_...

CapacitorSet · on Dec 7, 2017

>I'm not sure I get the connection.

.io is the TLD for the British Indian Ocean Territory, technically speaking.

techiferous · on Dec 7, 2017

That place has some sad history: https://en.wikipedia.org/wiki/Depopulation_of_Chagossians_fr...

e_d_e_v · on Dec 7, 2017

How is this better than using rkt with an lkvm stage1[1], which also uses the work done by the Clear Containers team? It looks like Kata packages QEMU as well, which seems a bit overkill.

[1]https://coreos.com/rkt/docs/latest/running-kvm-stage1.html

bonzini · on Dec 7, 2017

> a bit overkill

They also said the same about Xen, that a special purpose microkernel was a better choice than Linux as a hypervisor...

e_d_e_v · on Dec 8, 2017

Right, in many cases, small is beautiful! I think that's what contributed so heavily to the massive success of the Xen platform. Is that what you mean?

bonzini · on Dec 8, 2017

Then why does Kata Containers use KVM?

Xen was successful because it was innovative, and because it worked around the fact that x86 was not virtualizable at the time. But after ten years of healthy competition, the only reason to prefer Xen to KVM would be things like QubesOS.

chungy · on Dec 7, 2017

It's kind of interesting that it's only in the Linux world that containers cannot be thought of as isolated or secure. Seeing it from a jails and zones perspective, rather sad, actually :)

jchw · on Dec 7, 2017

FreeBSD jails are known to not be silver bullets. I've heard many instances of breaking out of a FreeBSD jail.

Generally, treating any OS-level technology as a silver bullet is a huge mistake. Any serious developer would make multiple levels of security that _should_ be sound.

GalacticDomin8r · on Dec 7, 2017

That's quite true. Any serious FreeBSD will readily acknowledge such(eg https://www.freebsd.org/doc/handbook/jails.html), but the project does try to default to sensible security defaults for it's containers eg no raw sockets.

While not applicable to FreeBSD alone, this polemic thread:

https://marc.info/?l=openbsd-misc&m=119318909016582

is a pretty accurate description of container level security and not much has changed. Stuff built on a foundation is always subject to the foundation's qualities.

X86BSD · on Dec 7, 2017

This is the most blatant and clearly incorrect... FUD?..lie?... I have ever heard to date about jails.

Jails are secure. As are SmartOS zones. Whoever you heard that there are “many instances of breaking out of a jail” from is full of sh47. And you would be wise to never listen to them ever again. No really, EVER.

And no, breaking the ps4 was not a jail exploit. The attacker already had elevated privileges. So you would be sunk no matter what.

jchw · on Dec 7, 2017

Sheesh, no need to get so emotional about it. I said instances of breaking out, not instances of jail exploits. I don't know of any jail-specific exploits.

But when we say "elevated privileges" are we talking root inside of a jail? Because if that breaks jails, then a large class of Docker exploits also wouldn't classify as 'exploits' under that criteria. One of the biggest problems with Linux namespaces is the band-aid put over root, via capabilities.

As far as I know, though, the PS4 exploit was more Sony's fault. IIRC, they broke out of the jail by exploiting custom syscalls not in stock FreeBSD. Bugs in syscalls in FreeBSD aren't unheard of though, even if less commonly found than Linux.

My entire point is that good security implies not treating any solution as a panacea, lest you find yourself in a digital Titanic scenario. Multiple layers of solid security beats one layer of solid security.

chatmasta · on Dec 7, 2017

> Bugs in syscalls in FreeBSD aren't unheard of though, even if less commonly found than Linux.

Dangerous assumption.

More likely, there are fewer people looking for vulnerabilities in BSD than in Linux.

jchw · on Dec 7, 2017

Well, I did say

>less commonly found

rather than less common. Impossible to know with 100% certainty what's literally less common.

If I had to guess, I'd guess FreeBSD had less bugs in general, just because the surface is generally smaller, and the system is more homogeneous.

benmmurphy · on Dec 7, 2017

i believe there was an exploit by another team which used badiret. which is pretty hilarious because badiret has been patched ages ago but FreeBSD never told anyone they fixed it.

GalacticDomin8r · on Dec 7, 2017

You mean like this?

https://www.freebsd.org/security/advisories/FreeBSD-SA-15:21...

benmmurphy · on Dec 7, 2017

yeah it was fixed in 2014 and there wasn't an advisory until 2015. https://reviews.freebsd.org/rS275833

hn discussion: https://news.ycombinator.com/item?id=10093862

benmmurphy · on Dec 7, 2017

yeah there are probably not many 'jail' exploits specifically targeted for getting out of jail/exploiting jail primitives. but people just use normal kernel exploits to get out of jail/zones. i would say jails/zones are about as secure as linux containers. ie: about as secure as the linux kernel is.

X86BSD · on Dec 7, 2017

And you would be wrong.

helper · on Dec 7, 2017

The person you are replying to has discovered multiple exploitable bugs in Illumos via DTrace from inside zones:

Here are the first two that pop up if you google his name. http://www.zerodayinitiative.com/advisories/ZDI-16-168/ http://www.zerodayinitiative.com/advisories/ZDI-16-274/

He gave a talk at DTrace conf 2016 about all the security vulnerabilities he personally found in DTrace in SmartOS. Here are the slides: http://slides.com/benmurphy/deck

oblio · on Dec 7, 2017

Does Windows have anything like this?

kuschku · on Dec 7, 2017

Yes – HyperV containers (which Kata is actually inspired by) are much more secure than Linux’ namespaces.

tilpner · on Dec 7, 2017

> Kata Containers combines technology from Intel® Clear Containers and Hyper runV

but I can't find a mention of Hyper-V anywhere (which doesn't mean there was no inspiration). Maybe you confused Hyper runv and Hyper-V here (the naming certainly doesn't help)?

kuschku · on Dec 7, 2017

I might have just been confused due to the naming, but, as far as I can see, they’re using the exact same underlying technology, based on AMD’s and Intel’s virtualization extensions, to replace the sandboxing that is currently handled by kernel namespaces, jails, or HyperV containers (and, in some of these implementations, already uses this technology)

cbzbc · on Dec 7, 2017

runV is a oci compatible drop in replacement for runC that can execute containers on a number of backend virtualisation environments, including Hyper-V and KVM

jchw · on Dec 7, 2017

That's coincidence, though. runV wasn't inspired by Hyper-V.

ohthehugemanate · on Dec 7, 2017

Funny, I never thought about it that way. Namespaced processes being a linux kernel feature in the first place. That's where the whole container thing CAME from in the first place. It's only because the Windows and OSX kernels _don't_ support namespacing, that we have to run docker et al inside a virtual machine on those environments. It is not the container implementation, but the virtual machine, which makes containers "secure" on those platforms.

So put more finely: containers are not secure, anywhere. Virtual machines are. So you should run your containers inside virtual machines if security is important to you. Environments that can't run containers natively are forced into the more secure configuration.

If you're interested in the ongoing work to make containers more secure, Jessie Frazelle has very clear posts on the subject [1][2]. The Bubblewrap project also has a great summary of various approaches being used to "jail" container processes properly. [3]

[1] https://blog.jessfraz.com/post/containers-zones-jails-vms/ [2] https://blog.jessfraz.com/post/getting-towards-real-sandbox-... [3] https://github.com/projectatomic/bubblewrap

qaq · on Dec 7, 2017

"containers are not secure, anywhere" Zones are very secure although you might or might not consider them "containers".

benmmurphy · on Dec 7, 2017

Zones have the same problem that linux containers have which is a massive attack surface in the form of a kernel. And if you think zones are secure: Which OS do you think had more kernel exploits that could be used to escape container/zone in the last 2 years? I think the answer is much closer than you might think.

qaq · on Dec 7, 2017

I am not tracking closely to be honest since I have not being working with Illumos based distros for 5 years +/- when we were using omniOS I do not remember things being too bad. Not sure what % of vuln. are Oracle Solaris specific given that majority of orig. SUN eng. left long time ago Illumos might be in much better shape vs Solaris.

helper · on Dec 7, 2017

http://www.zerodayinitiative.com/advisories/ZDI-16-168/

http://www.zerodayinitiative.com/advisories/ZDI-16-274/

These were both in Illumos found by the person you are replying to.

qaq · on Dec 7, 2017

Good point does not look too horrible as far as track record though :) less than number of docker vuln. and about the same as vmware which is an actual vm

helper · on Dec 8, 2017

Please don't take the first two urls I found with a single google search as a comprehensive survey of Illumos kernel vulnerabilities over the last few years.

thinkpad20 · on Dec 7, 2017

> containers are not secure, anywhere. Virtual machines are.

Can you (or someone else) ELI5 what makes containers insecure? Not a low level Linux or security expert.

tyingq · on Dec 7, 2017

I'm sure there's more, but the most obvious is that they share one running kernel. So, one kernel exploit in one container means you now have all the running containers.

thinkpad20 · on Dec 7, 2017

So containers are as vulnerable as the operating system? It seems like if your kernel has been pwned you’re already SOL? Or couldn’t someone in that position just as easily pwn a VM, or run the same exploit on multiple vms? I’m not sure if I’m missing something

cestith · on Dec 7, 2017

Let's say you break out of a web app on a VM, then as the local user you exploit the kernel. It's the VM host kernel. You have root on the VM. The VM is running on a full virtualization platform, though, so you'd need to break out of the VM to hit other guests or the hypervisor.

Linux containers run a new environment on top of the host's kernel. It's the same kernel in one container as another and the same as in the host. If you manage to break out of the namespace or otherwise exploit the kernel, you're already in some other container's business. Worse, there's a good chance you've exploited the kernel in a way that you can get all the other containers and the host all at once with one exploit.

tyingq · on Dec 7, 2017

Assuming the attacker already knew the IP of every VM, and they were all running the same kernel with the same exploit, and they were all exposed to the internet, it would be similar...although you would have to repeat the steps 'N' times. Or, assuming you could somehow escalate your VM kernel exploit to a higher level hypervisor exploit and get to the top.

That's not usually the case, though.

With containers, you get the kernel exploit, and you're in, for the most part.

tripue · on Dec 7, 2017

Another alternative is using hyper container

tpetry · on Dec 7, 2017

This is a combined work of the people behind Intel Clear containers and Hyper containers.

acobster · on Dec 7, 2017

> It is designed to be architecture agnostic, run on multiple hypervisors and be compatible with the OCI specification for Docker containers

In what sense is this "OCI compatible"? Do they implement the runtime, image format spec, or both? My understanding of containerization and OCI runtimes is that they're fundamentally different from hardware-level virtualization.

bergwolf · on Dec 8, 2017

Both. The hardware virtualization related settings are configured out side of OCI spec but the runtime accept OCI spec and plays with it accordingly. As for image format, Kata runs unchanged docker images.

jeshwanth · on Dec 7, 2017

Whats the difference between unikernels and kata containers?

jchw · on Dec 7, 2017

Different approaches to isolation. A kata container is using Clear Linux to load a feature-complete Linux kernel into tiny VMs (disclaimer: I do not know exactly how it's different from any other VM,) a unikernel is a small bare-metal "library" that gives you minimal OS-like functions to put in a hypervisor to run your application. Unikernels are still more minimal, I'd guess.

ams6110 · on Dec 7, 2017

Here's a recent paper about the unikernel approach

http://cnp.neclab.eu/projects/lightvm/lightvm.pdf