Hacker News new | past | comments | ask | show | jobs | submit login
ZeroVM: Smaller, Lighter, Faster (rackspace.com)
163 points by bretpiatt on Oct 21, 2013 | hide | past | favorite | 79 comments



I like this alternative to titling.

The original headline is preserved, and clarified by the editorial clarification in square brackets. It would be great for HN to adopt this as a solution to the modified headline problem, with the provisio that editorial comment must only be used for the purposes of clarification.


It was nice, until it was removed because fuck context.

ZeroVM: Smaller, Lighter, Faster [RackSpace acquires LiteStack]


Van Lindberg here from Rackspace. If you have any questions, I am around to answer.


* The ZeroVM site makes a big deal about application execution being completely deterministic. How does this interact with applications that require random numbers, such as crypto?

* Is ZeroVM capable of running unmodified Linux binaries? If not, what compiler toolchain is required to get it working? The main advantage of other lightweight virtualization solutions (OpenVZ, LXC) is that it's very easy to take regular binaries (e.g. postgresql) and drop them in a sandbox with minimal fuss.


- It is deterministic based on the inputs. You would need to pass in a seed or read from an external source of randomness to get different values out of a PNRG.

- Binaries need to be recompiled. There are two toolchains, a GCC-based one and an LLVM-based one. We can also compile within the ZeroVM container itself.

We expect that a lot of people will use existing language runtimes (Python, Lua, JS) To avoid compilation.

Over the long term, though, a lot of the power comes from composability. Think Unix pipes, in parallel, across the cloud.


Does the hypervisor use multiple communicating CPUs? If so, how do the races inherent in concurrency not destroy the determinism? Is this a single CPU/thread/fiber hypervisor?


Each container has a single process. Each Individual part is deterministic, so the entire system is composable deterministically.


That's not true if the processes communicate, or contain communicating threads. In one run, an input queue looks like {A, B}. In another, {B, A}. The source of the non-determinism is just entropy bubbling up from the hardware.

EDIT: using completely synchronous I/O (mentioned below) is a very clever solution, but it requires a process to know its inputs ahead of time. This may also cause cluster scalability issues, as now each "round" of inputs is gated by the slowest of the source processes.


All reads from other sessions are blocking. There is no input queue, zerovm processes read and write directly to each other. This way determinism can be preserved even for clusters.


Assuming A and B are produced from separate input processes, they have to be submitted to a "gather" process that takes both of them. If the "gather" process uses select() or non-blocking I/O -- says basically "read from A or B, whichever becomes available first" -- you'll get them gathered in a nondeterminstic order. OTOH if the "gather" process uses synchronous blocking I/O -- "Read one message from A, then read one message from B" -- then you always get (A, B) order.

If the framework you're using requires all I/O to be synchronous, and there's no way for a program to tell time or tell when an action would cause a delay, then there's no way for nondeterminism to develop based on timing.

I don't have any idea if ZeroVM is like this, and a framework that only allows synchronous I/O would have its own problems (basically you'd have to worry a lot about deadlock).

EDIT: To expand on this, you might still be able to do a lot even if cyclic interprocess data flows are forbidden. This is particularly true of database style applications, which are where ZeroVM originated.


* You will need to supply it with random seed. * No, you will need to recompile. We use modified gcc/glibc toolchain.


With the potential of starting a ZeroVM in 5ms, and running it for a very short amount of time, do you see yourselves starting to charge in smaller time increments (ms)?

I work a lot with smallish data, where queries/processing may take 20 minutes or a few hours. Often I can split this work up significantly, but when most places charge by the hour it's rarely worth it. PiCloud are excellent in this area, and Manta looks interesting (but possibly a bit more expensive) but I'm not aware of many others. I love the idea of being able to start, with little overhead, a large number of short lived jobs. Particularly if I can run them locally or my own cluster.

I look forward to seeing more on this.


The platform itself produces accounting data with 1 ms accuracy. How to do accounting in a real world public service is another question. But we aim for a very short-lived (10 seconds) but widely horizontally spread (1000 machines) workloads.


Thanks for the reply, this sounds like just the kind of thing I'm interested in, and I'm happy to see more competition in the field of renting machines for short periods. I look forward to seeing this released and giving it a go.


Hey Van!

How much did you guys pay for them? :p


Are there interpreters already available for ZeroVM? Or at the moment, must everything be compiled?


Lua, Python and C/C++ are available right now. C/C++ is compiled in run-time with LLVM. Porting an interpreter requires a minimal effort, although porting some of the interpreter libraries can be not that easy. I have compiled PHP just for fun, and it was working "out of the box".


I've read the architecture page and I am not clear how multithreading is dealt with if application demands it. If you can expand a bit on that.


If you have a legacy application that demands threads we can emulate threads by essentially using coroutine approach. http://en.wikipedia.org/wiki/Coroutine


What would be the impact of such a coroutine emulation if the threading is used to leverage multi-core hardware for high performance computing such as done by Atlas [1], OpenBLAS [2] or MKL [3]? These libraries are tuned to maximize CPU cache hits. It seems to me that executing each thread task sequentially using coroutines would probably break such optimizations.

[1] http://math-atlas.sourceforge.net/ [2] http://www.openblas.net/ [3] http://software.intel.com/en-us/intel-mkl


From the sounds of it, you'd want to split up your workload and run each subset in a separate zerovm


That's correct. And that's how most of parallel processing frameworks do it anyway.


Probably the most obvious if will work only in rackspace or can be used everywhere (and the cost)


It's open source. https://github.com/zerovm Can be used anywhere. Can be installed on top of Openstack installation.


We see this as bigger than just Rackspace. Anywhere your data is, we want ZeroVM to be there.


... and how much did you buy them for ? :)


How much did you acquire them for?


Perhaps a bit off-topic, but I would like to get to the point where I can make intelligent comparisons between technologies like CoreOS and ZeroVM, and in general better understanding of containerization, virtualization etc. Can someone suggest a list of books that can get me started on that path?


CoreOS and ZeroVM are so young that there's not really much literature about them yet. That said, VMs and containers have been around for decades.

Lots of the recent activity is more about packaging and usability improvements, rather than theoretical improvements.

A quick overview: CoreOS is a super-minimal Linux distribution designed to be used as a base for applications. It's essentially equivalent to the JEOS buzzword from five years ago. It would run inside of Xen or KVM or VMWare.

Xen, KVM, VMWare, Virtualbox, you probably already know about- they provide a virtual machine, the operating system running inside of it (theoretically) can't tell it's not on its own hardware. Xen uses a 'hypervisor', which is essentially a very tiny, custom kernel. KVM uses the Linux kernel as the hypervisor, which makes a lot of sense- you don't have to reimplement all the years of hardware support and scheduling work they've done. VMWare and Virtualbox run as applications on whatever OS you provide. You lose out on some opportunities for clever performance hacks this way, but there are other advantages. VMWare ESX[i] is more like Xen & KVM, but I don't really know that much about it.

Containers (BSD jails, Solaris zones, LXC, and of course, HN's lovechild Docker) let you provide VM-like isolation and resource management between processes or groups of processes, but you only run one kernel. This means much less duplicated effort and memory, and Docker's AUFS lets you deduplicate your storage too. There are slightly more security concerns about this approach than full VMs, the Linux kernel (and others, but let's be honest about the target audience) has a long and ugly history of local privilege escalations.

ZeroVM is based on Google Chrome's NaCL, and that's about all I know about it. I would expect VM-like security (it validates machine code), and an environment that requires serious porting from POSIX. That said, if you use Python, Ruby, Mono-compatible .NET, or Go, the heavy lifting has already been done for you.


ESX and ESXi are bare metal hypervisors, unlike KVM which is kind of like a quasi-bare metal hypervisor which happens to sit in the Linux kernel. ESX has been EOL'd, however it relied on something called the "Console OS" to bootstrap itself until the vmkernel would take over and start scheduling tasks. The Console OS was actually a modified Red Hat Advanced Server (later Enterprise Server) instance which once the system was booted would act as a kind of "privileged guest". You could log in to it and do sysadmin-y tasks like add users, install RPMs etc.

ESXi, on the other hand, was written to do away with the Console OS entirely, but it still has a fairly rudimentary shell. Many of the utilities are based on busybox and the idea is it should be stripped down with only really minimal functionality. It also sported something called the Direct Connect UI (DCUI) which is a curses based interface for doing things like settings up an admin password, reviewing logs and changing security settings.


Thanks for the clarification.


As far as I read it, you are wrong about CoreOS. It's meant to be run as a host OS, not as a guest. It provides a minimal Linux Hypervisor you can use to run containers built for docker.


Yes, CoreOS wants to be a host OS, but it's also a ripping good guest OS, because of its minimalism. I'm expecting most people to basically stick their app and nothing else (like, say, a full ubuntu environment) inside of the container.

Hypervisor is the wrong word- that's what you'd call Xen, VMWare ESX or the host KVM kernel.


Thanks, krakensden, for the clarifications. I think reading a few books on operating systems and on the Linux kernel would be a good start.


That said, if you use [...] or Go, the heavy lifting has already been done for you.

Not really. Go used to have a nacl port, but that was years ago. It was abandoned when the nacl people decided to use a different method for isolating code.

Porting Go to use the new method would require writing another compiler, like {5,6,8,}{c,g}.


Late edit: There are talks of reviving the nacl port (not pnacl), something changed in nacl land?


For some reason I thought ZeroVM is based on vx32, not NaCL: http://pdos.csail.mit.edu/~baford/vm/ Does anyone know if there is something similar based on vx32?


Based on "Why ZeroVM?" (http://zerovm.org/wiki/Why_ZeroVM), a large part of the motivation for ZeroVM is based on the premise that regular VMs require a full OS and are therefore unacceptably fat. However, there are multiple platforms for running unmodified applications directly on VMs without requiring a traditional OS, e.g. the work I've been involved with: https://github.com/anttikantee/rumpuser-xen/

Determinism, OTOH, sounds interesting at least on paper. Is there any experience from tests with real applications in real world scenarios?


Big congrats to Cam and the whole team! I work for one of the other companies from their TechStars class, they were a blast to have in San Antonio.


ZeroVM is LXC but with NaCl.


almost,

LXC starts as a general purpose Linux container with everything built in, and adds more isolation as development continues

ZeroVM starts with no general purpose Linux, and will add support as development continues

ie, LXC will work with what you have now, ZeroVM will eventually work with what you have now, but shims will have to be developed for everything, either in your code, or in ZeroVm's

IMO the future endpoint will have similar functionalty in both projects, but LXC will see more testing and use /now/


Huge congrats to the LiteStack team. I was in their TechStars class and those guys are super smart.


I've read the architecture doc (http://zerovm.org/wiki/Architecture) and I loved it.

But, when you say tantalizing things like 'erlang-on-c', you raise the question: what does the clustering control plane look like?

One of the great things about erlang is that the cluster's got supervisors that receive execution-level messages (e.g. 'EXIT') and can then take whatever action they feel like. Is that control plane level exposed to ordinary containers?

And the other great thing about erlang is that the messaging model is either synchronous if you care (with return receipts) or asynchronous if you don't (fire and forget) -- and that richness turns out to have a bunch of good use cases. What's the ZeroVM story there?

And the other great thing about erlang is being able to trace out messages, especially when your synchronous architecture just took a dump on the sheets and is staring at you belligerently. Does ZeroVM have introspection figured out yet?


The control plane is on top of ZeroMQ, to allow for various arrangements of components as well as the ability to observe the flow of inputs/outputs.


Thanks for the response. It'd be great if the wiki were fleshed out with an overview of how that works and for the other questions to be addressed as well, for those of us examining it from other backgrounds.


How is does this solution compare to Google App Engine?


Why the downvote? This solution requires an app recompile, no?


So... how do I use this with my Rails app, and to what end?

It looks like interesting technology, but I need a more concrete example.


I think for stuff like your Rails app, you'd want to wait until there is support lower level in your stack.

But imagine that you're writing a commenting system, and want to sanitise stuff received from a user. Sanitising data is error prone, so you isolate the code in a new zerovm. If someone finds a way to exploit anything in your sanitising code, they might be able to write broken sanitised HTML out, but they won't be able to e.g. send queries to your database, or write to your disk, because the zerovm simply doesn't have permission.

And imagine the web server spawning a new zerovm for every request, that only has permission to talk to the inbound network connection and pass messages between that and a Rails zerovm for that request. If there's an exploit in the HTTP parser, that vm could be exploited, but it'd die at the end of the request, and would have no permissions to talk to the database server or write to disk.

And imagine the Rails zerovm similarly being split into pieces: Request handling might be done in one; authentication might be done in one.

The lower the startup costs, the more you can afford to chop the app into pieces and the more you can leverage that to benefit in terms of security (by reducing the privileges of each individual component) and scalability (by allowing distribution of the VMs across CPUs and across servers)


How would ZeroVM instances talk to any server with persistent storage (e.g. a key/value store) in a deterministic way? get('top_stories') will change over time.


why not just use docker.io? what are the differences between zerovm, docker, and warden?


Unlike ZeroVM, Docker is not a security solution, it is only useful for managing administrative domains within a machine. (To preempt a massively pointless, ~20 year old conversation, Google "chroot security" and "jail security" and suchlike to understand why). On the other hand ZeroVM starts with statically verifying any code that executes adheres to a fixed protocol, and that protocol only allows invoking a small set of rigorously defined service stubs.

This may sound vaguely similar to how Linux containers and the syscall interface work, but it involves orders of magnitude fewer LOC written from the outset with a robust security design in mind. Compare that to the thousands of LOC daily churn in the Linux kernel, often written by people who are usually too busy fighting with shitty hardware to care about how their driver ioctl might be accidentally exposed to UID 0 running in a container, and even if they notice, might not even care.


I found the architecture page helpful in understanding what this thing really is: http://zerovm.org/wiki/Architecture


What does an instance look like. These are very lightweight instances of what?


It looks like hardened *nix process. It has no access to anything it's not permitted to access. And it has no notion of network, time or machine it's running on, although it can communicate with other instances (even on remote machines) via ipc. It can be suspended, resumed, relocated and so on without it ever noticing.


Thank you. When you instantiate a zerovm instance you give it the associated code as well? And which IPC method can it use? Is zerovm the library you use, is there such a thing as a separate zerovm instance, or is it just the way we are used to talking about virtualization?


You give it an x86 (or ARM) binary to execute. NaCL is also working on an LLVM version, that would get compiled to the specific machine at runtime.

IPC is super limited: https://github.com/zerovm/zerovm/blob/master/doc/api.txt

You get nothing but /dev/stdin, /dev/stdout, and /dev/stderr by default. You can optionally make other resources (network, files) available, through a similar api.


When we instantiate we give zerovm an executable image (a file) and any other files this executable will need (can be arranged in a sort of "VM image" which is a regular tar file). Sessions (instances) can communicate by unix pipes. Yes we have notion of "instance" it is a running zerovm process. Each session runs in a separate process.


I'd compare ZeroVM to Manta ( http://www.joyent.com/products/manta ) instead of Docker & Warden.

AFAIK the idea as is to have as light of a container as possible so you can afford to throw your app at the data instead of throwing data at the app.


That's correct. Manta is the closest thing.


Am I the only one here who finds this approach somehow similar to Plan 9?


Yes, because after many years of using Plan 9 daily I see no similarity.


I cannot see what this has to do with security. At the end of the day, it is the data that attackers are after and the app needs to be able to access it whether it is virtualised or not.


Each part of your application needs access to some sub-part of your data, but if you isolate your app at the OS level and run a whole app server inside a VM, every part of your application can at least in theory access all your data.

If you sub-divide your app in separate zerovms, whether per-request, or split it up further into functional responsibilities, then you substantially reduce the attack surface by ensuring that an exploit against any one part of your application can only exploit the specific subsets of data it is allowed to work on.

You can do this without zerovm too, but the more you reduce the cost and difficulty of spawning a new vm or container, the more finely grained you can subdivide your application, and hence the fewer privileges each subset of your app will have.


This is wishful thinking at the moment. I understand perfectly well what that means but data is data. An application typically has access to all data and the fact that you run it through a VM doesn't change anything.

I can find this technology useful only in areas where you want untrusted 3rd-party code to run without worrying about what it will do.


Typically - yes, but we can change that. Application does not need to access all data, it happens because today any web application serves millions of requests and thousands of users, it needs to access all the data of all users at any time. When using ZeroVM you can serve one user and one request by one VM instance. Then you may explicitly define what data is accessible to that one VM instance. Yes, you can implement such controls in your application yourself, but we are just doing it for you, uniformly, on infrastructure level.

And about "3rd party code". If we have two developers, each works on a different module of the same application, isn't their code is "3rd party" to each other?


It is only "wishful thinking" is as much as people are usually lazy because the effort required to sandbox small pieces of code is prohibitive in most current platforms. But larger systems are already often layered in ways that layer access to data anyway (though often not intentionally for security).


Good luck with this approach. I am not saying you should throw away this - awesome technology btw - but rather impractical in many ways. You will find very little use-cases where you can apply this with direct benefit. In most cases this wont change a thing - only perhaps on highly specialised software.


Probably because it reduces the attack surface, instead of having to worry about a 0-day in ssh or similar, you can just worry about your application. Also if someone else on the machine gets compromised, I assume you are isolated from that also since the attacker is still contained inside that container.


Not only that. Each request is isolated in its own container. This way one user of your application cannot gain access to data of another user by simply exploiting an application bug.


That's assuming you lock down access at the level of the application instance. If you create an instance with permission to read all data, it doesn't matter that it's been created by a single request… it's still got permission to read everything. Not saying a single request instance isn't a win for security, but you'll have to build your app around this concept to get that win.


To make application truly multi-tenant you will need to adopt "share nothing" concept anyway. We just supply you with a "share nothing" infrastructure.


I am not able to wrap my head around this.

If I have to run a database, say postgresql, how do I run it? Inside the ZeroVM or outside? To run the DB I would need to give it file system access?

Now if there is a security hole in postgresql, how is it guaranteed that files other than DB files are never accessed?


If you have to run a single database for multi-tenant application you're in for some real pain. For example: how will you shard it? How will you load-balance it? ZeroVM approach to cloud is that "cloud is the database". ZeroVM sessions have transactional qualities: deterministic, isolated, can be rolled back, etc. Essentially we integrate distributed storage with "stored procedures" and "triggers", this is what ZeroVM cloud looks like.


Typically it takes a single request to dump various data via SQL injection. Just saying.


why not a different libc, such as musl?


I've never understood people's fascination with replacing glibc. There's almost never anything to gain- and glibc has the advantage of being really, really well tested.


"I've never understood people's fascination about SpaceX. There's almost nothing to gain- and Russian Proton has the advantage of being really, really well tested."

On a serious note: the primary disadvantage of glibc is that it's really, really hard to change (and build times are slow). While it's already here, sometimes you want to port it to a new platform or a new ABI, and the adventure begins.


As someone who tried to port code in the early days of Bionic, I am filled with grumbles.

And glibc hasn't been slow for a year or two, since they defenestrated Ulrich Drepper.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: