I was recently looking into an equivalent for V8 isolates. I'd like something like this for python, but it looks like micro VMs is my best bet here.
For anyone working in this field or having hands-on experience: is weave's ignite a choice if I'd want to execute long-running python scripts of users? I remember that there was a lot of overhead for I/O. Or should I just go with raw Firecracker like Stan does in this article?
The scripts would be long-running, but don't require much computational power. Just a few arithmetic operations on an array every second. The array is being fed in via WebSocket.
gVisor gives you all the container tooling, which may or may not be useful.
And, because I'm a shill, we actually shipped an API specifically for this kind of use case. So if you'd rather not build it all yourself, we can help: https://fly.io/blog/fly-machines/
I'm a fan of Firecracker, but your use case might be a better fit for plain old containers because the tooling is currently more mature. If it's the isolation that attracts you to Firecracker, gVisor is an option: https://gvisor.dev/.
For the open-source windmill project, we need to support sandboxing of typescript (deno) and python. For deno we could have just relied on v8 isolate and deno layer of isolation. But for Python we could not anyway so we had to come up with a common solution. We chose nsjail in the end and it works really well. All the config files are here: https://github.com/windmill-labs/windmill/tree/main/nsjail and this is how it is spawned from within the Rust worker: https://github.com/windmill-labs/windmill/blob/main/backend/...
Happy to expand more of my experience of making this work at scale.
If you take a look at the Skybison Python runtime, I would be happy to chat and help you poke around integrating it: https://github.com/tekknolagi/skybison
You want to run short lived Python in a trusted environment? Can you go into more detail about your specific use case? How much CPU, memory and IO do you need to do? Is it chatty over the life of the execution or does have all of its data up front?
How do web sites like http://cpp.sh/ run code? Wouldn't it be enough to forbid/intercept system calls somehow and set limits with systemd-run or do you really need a VM?
One thing to bear in mind is that these sites use super-paranoid security because it has been proved time and time again that it is necessary. I wouldn't look at any particular solution for running arbitrary code from a user and assume that it's actually 100% correct. I think this can help remove some of the mystery of how they do it, which is that there is very likely some way in which they actually aren't doing it. Once you remove that idea from the possibility space, the ways it is done start making much more sense. (And the idea becomes much more scary.)
Docker + heavily restricted user + firewalls.. seems to get you much of the way there. I am aware that some work was done back in the pre-Docker day with Ruby's online sandbox to neuter Ruby's ability to make certain syscalls, but I imagine Docker, eBPF, or even using WebAssembly makes it a lot easier now.
Make sure to benchmark your workload first -- gVisor's I/O subsystem is a lot slower than the Linux kernel's, so a VM can be materially faster if you're doing a lot of filesystem operations or file I/O.
One of the systems I built at a former employer supported both gVisor and Firecracker for isolation, and the gVisor version was 10-50x slower for a specific class of workload that did ~millions of stat() calls at startup.
Isn't web assembly in the browser suited for these kinds of problems? You could run the code, spit the benchmarks and save that (maybe saving the code and the time of the submit for validation.)?
Even for languages that can compile to wasm (Go, Rust), the compilation toolchain doesn't necessarily run in the browser. AFAIK there's no good way to run LLVM in wasm yet. You could do the compilation on the server and then send it back down to the browser to run, though.
Theoretically, it should be possible to do it all in the browser, just a lot more work given the state of things now.
LLVM just does pure computation, really, so it's not hard to port to wasm - much simpler than say Python (which has also been ported several times). The only challenges with LLVM are the build system (which has self-execution), working around some issues like clang wanting to open a subprocess, and adding some ifdefs.
You are much closer to Clang/LLVM than I am, how likely is it that these changes would get upstreamed? I see the clang.wasm ports, but really I'd like to get in-tree builds of each commit. I find clang.wasm extremely useful, but I all I have are snapshots and I don't have the skills (maybe the patience) to get main continuously building to wasm.
Hmm, I'm not a regular LLVM committer myself, but I think there's a chance.
The history here has been several ports "for fun", so no one has tried to upstream anything. But if you have a real use case that could benefit from this, we should talk with the LLVM people and see. Feel free to file an issue and cc me and we'll find the right people.
I've been meaning to take a look at Bottlerocket[^1] as an alternative to a custom spin of Kubernetes but we haven't really had a chance to dig into it. The folks over at Fly[^2] have built an awesome edge platform out of Firecracker, and ultimately, where I want to take the next generation of our internal compute offering. I am eagerly looking forward to any and all presentations they do on their work.
I TIL - firecracker, thank you for that! :)
Great article and project, i didn't understand the idea of using RabbitMQ and fetching for new events in a loop. I am not an architect, but i was thinking that real-time databases, are used for those purposes?
About the containers - if you are not running them in privileged mode, you should be pretty secured, especially by limiting what kind of binaries the containers have.
Containers share kernels between tenants; that's the point of the design. The shared kernel is a huge security problem; it's easy to rattle off kernel LPEs that no realistic syscall filter would have prevented and that would pop a container runtime.
It is not similarly easy to do that with a lightweight hypervisor.
My argument is empirical. I can name recent LPEs that bypass MAC and sandbox policies; it is not easy to name comparable hypervisor escapes. Shared-kernel container escapes are found so often they're not even all that memorable.
> Shared-kernel container escapes are found so often they're not even all that memorable.
Agreed. I realized I inverted your hypervisor comment. Hypervisors have the compact contract that has any reasonable chance at being audited. Container security is basically a screen door.
With a real-time database, each worker could subscribe to be notified when the set of tasks changed, but it would require a separate locking mechanism in the application to ensure that each task is only attempted by one worker. (Imagine a scenario in which a task arrives when there are multiple workers idle.)
With RabbitMQ, you ensure* that each task is only attempted by one worker at a time, and you don't have to do anything special to ensure that at the application level.
*I'm simplifying a bit, there are edge cases where e.g. you lose a worker that has already started a task but not completed it.
Interestingly enough I picked RabbitMQ for the same reasons for a distributed execution system I designed at work. RabbitMQ provides a lot of nice to have features and flexibility when it comes to delivering messages between endpoints. The fact that you get access control, easy encryption, multiple protocols (mqtt/amqp/stomp etc), excellent reliability and scalability rolled into a mature system is pretty awesome. The concepts are easy to understand and get real code working. The interfaces with RabbitMQ are simple enough that you could replace it with something else without having to untangle a whole lot. The other killer feature for me was async event listening which cuts out a lot of overhead from polling / long polling.
I think they're referring to firebase which sells itself as a "realtime database" or rephrased, a database designed with "realtime" (automatically updating) websites/apps in mind.
The use of the word "realtime" for web tends to trigger a lot of systems developers who use the term to indicate that you can predict the actual real world time that something will take to compute, typically used in automotive and robotic settings. That being said, I didn't invent the word and words have multiple meanings and contexts. In this context it simply means push-delivered data that is pushed when updated. /disclaimer
It's a term that Firebase uses. From [their documentation](https://firebase.google.com/docs/database): "The Firebase Realtime Database is a cloud-hosted database. Data is stored as JSON and synchronized in realtime to every connected client."
I believe "realtime" in this case pertains to the synchronization of data amongst clients through websockets. This is how I've seen the term "realtime database" most commonly used.
Illumos Zones are shared-kernel isolation, which isn't safe for multitenant untrusted workloads. They're a great way to segregate components of microservice ensembles from a single tenant, to reduce blast radius.
Citation needed. I/we ran illumos zones for multitenant untrusted workloads for over a decade -- and at Oxide, we use it as a containment mechanism on top of that provided by the virtual machine. Yes, there have existed vulns (just as in any software), but (in my experience) not at a higher rate than seen in hypervisors or in the CPU itself. (Specifically, the biggest single vuln we had in a decade+ of running zones in a public cloud was -- by far -- Meltdown.)
I didn't give a statistic in this quote, so I won't cite one. Illumos Zones are shared-kernel isolation: all the tenants share a kernel attack surface. That's not sufficient for untrusted cotenants; any kernel LPE compromises the whole scheme. The kernel's attack surface is drastically larger than (say) the KVM attack surface.
Well, it's not just the KVM attack surface -- it's KVM + QEMU, and there have emphatically been escapes. Yes, the shared kernel is an issue -- but so is a shared hypervisor or a shared CPU. It's a risk to be mitigated, and it's a gross exaggeration to say that it "isn't safe for multitenant untrusted workloads."
It's QEMU for you, or was at the time, but I'm not talking about what the right engineering decision is in 2015, I'm talking about what makes sense today, with memory-safe hypervisors.
For untrusted multitenant workloads in 2022, for arbitrary code and without a language-level sandbox, a shared-kernel workload isolation system might be malpractice. Again: you can easily rattle off the LPEs that would have broken a Linux shared-kernel scheme (of any realistic sort) over the last couple years.
Someone else dunked on you for a Joyent bug from a bunch of years ago. I didn't. Security researchers were dunking on shared-kernel isolation even back then, but it was a much harder decision back when the only alternative was expensive, memory-unsafe legacy hypervisors, and I would have had a hard time weighing guest escape vs kernel LPE back then too.
But this time and the last time we bounced off each other on this, we weren't talking about 2015; we're talking about today, when there are multiple memory-safe hypervisor options. I don't think it's an open question anymore.
I don't think we're really comparing hardware virtualization to OS-based virtualization (we at Oxide run HW-based virt inside of OS-based virt so that's a bit of a false dichotomy to begin with), but rather whether or not it's "malpractice" (your term) to run a multitenant workload on illumos zones. And to be clear: we're not talking about Linux here; we're talking about illumos -- for which OS-based virtualization has an entirely different design center. (And indeed, a design center that was -- from the beginning -- designed for securing multitenant workloads.) So your experiences with Linux are of limited relevance here, frankly.
Yes, very familiar with that one! Not only is this one of the very few zone escapes over our years in production (responsibly disclosed, thankfully!), but the bug itself was introduced by yours truly -- and is part of what gave me religion on Rust. To be clear, my assertion was not that any particular body of software is invulnerable, but rather taking issue with the assertion that zones-based infrastructure "isn't safe for multitenant untrusted workloads"; we ran exactly that for a decade. I also very much stand by my assertion that Meltdown was a greater source of vulnerability than zones -- and if one wishes to assert that a shared kernel makes zones unsafe, than one also must say that a shared microprocessor is unsafe. For some folks, that will be a completely reasonable assertion, but for most, they will understand that a shared microprocessor -- past vulnerabilities aside -- can in fact be made safe for multitenant use.
I don't think you should have to apologize for a 2016 Joyent bug.
I also don't understand how you can coherently argue that people should have "religion about Rust", but also put their faith in C-language OS kernels any more than they have to.
Further, I don't understand how Meltdown helps your argument at all here, since both isolation strategies are susceptible. Memory safety also doesn't protect you from control-plane SSRF vulnerabilities, but you can immediately see why "control-plane SSRF vulnerabilities mean memory-safety is overrated" is a bogus argument.
I'm not arguing that people should have religion about Rust -- merely explaining that this particular issue was central to my own internalization of some of the very subtle unsafety in C. In terms of Meltdown: I am saying that -- as a practical matter -- it was much more serious for us than the extraordinarily small number of vulnerabilities we have had in zones over the years.
The scripts would be long-running, but don't require much computational power. Just a few arithmetic operations on an array every second. The array is being fed in via WebSocket.