I never really realized that Firecracker VM is a full-blown machine and not just some sort of a Linux container tech. At first, it may sound like an ineffective approach, but if you take a closer look on a real-world usage example such as fly.io, you will be surprised: micro-VMs are very small and capable.
Thanks to KVM, and to the minimal hardware support (no PCI, no ACPI, etc), Firecracker's source is rather simple and even relatively readable for non-experts.
Firecracker's source is rather simple and even relatively readable for non-experts.
... as long as they're experienced at writing Rust. As a Rust newbie it took me a long time to figure out simple things like "where is foo implemented", due to the twisty maze of crates and uses directives.
I totally get why this code is written in Rust, but it would have made my life much easier if it were written in C. ;-)
> it took me a long time to figure out simple things like "where is foo implemented"
Out of curiosity, what development setup do you use?
I imagine that with vanilla EMacs or vanilla Vim you’d have to do quite a bit of spelunking to answer that sort of question.
With a full-blown IDE like for example JetBrains CLion with Rust plug-in installed, it is most of the time a matter of right-click -> go to definition / go to type declaration. (Although heavy use of Rust macros can in some cases confuse the system and make it unable to resolve definitions/declarations.)
And with JetBrains CLion you still have Vim keybindings available as a plug-in.
I switched from Vim to CLion + plug-ins years ago and haven’t looked back since. (Vanilla Vim is still on my servers though so that when I ssh in and want to edit some config files or whatever I can do so in the terminal.)
My development environment for this was mostly "nano in an SSH session". (Among other reasons, I can't even build Firecracker locally.) For FreeBSD work that's just fine since grep can find things for me. It didn't work so well for Firecracker.
Figuring out how to get all the crate source code extracted was the first step, yes. But when after that, the object oriented nature meant that there were often many different foo functions, so I had to dig through the code to figure out what type of object I was dealing with and which type each of the different foo implementations dealt with.
Whereas in FreeBSD I just grep for ^foo and the one and only line returned is where foo is implemented -- because if there's different versions of foo, they have different names.
Namespaces sound good in principle but they impose a mental load of "developers need to know what namespace they're currently in" -- which is fine for the original developer but much harder for someone jumping into the code for the first time.
If you go to the web site for the crate (or the standard library), and find the doco for the module / function / trait / ..., you find an handy "source" button. It will take you straight to the definition.
Yes! After a bit of playing around, I can follow Rust reasonably well... but only with an IDE, which I've never used for C or even really for the bits of Java I wrote. I understand that many more experienced Rust developers are similarly IDE-reliant.
I do think it's a deliberate tradeoff, having e.g. .push() do something useful for quite a few similar (Vec-like) data structures means you can often refactor Rust code to a similar data structure by changing one line... but it certainly doesn't make things as grep-friendly as C.
In comparison, I remember spending much less time finding "where is foo implemented" in Rust than in C++, and also found Rust std to be much more readable than C's when I wasn't familiar with each language. But I can see how rust with all the procedural macros, crates, traits could become a maze for people most familiar with C, and I probably don't feel that because of my C++ background.
There's no way an "enterprise grade" cloud vendor like AWS would allow co-tenancy of containers (for ECS, Lambda etc) from different customers within a single VM - it's the reason Firecracker exists.
> There's no way an "enterprise grade" cloud vendor like AWS would allow co-tenancy of containers (for ECS, Lambda etc) from different customers within a single VM - it's the reason Firecracker exists.
I won't speak for AWS, but your assumption about what "enterprise grade" cloud vendors do is dead wrong. I know, because I'm working on maintaining one of these systems.
The "process sandbox" wars are over. Everybody lost, hypervisors won. That's it. It feels incredibly wasteful after all. Hypervisors don't share mm, scheduler, etc. It's a lot of wasted resources. Google came in with gvisor at the last minute to try to say "no, sandboxes aren't dead. Look at our approach with gvisor". They lost too and are now moving away from it.
Really? Has gvisor ever been popped? Has there ever even been a single high-profile compromise caused by a container escape? Shared hosting was a thing and considered "safe enough" for decades and that's all process isolation.
Can't help but feel the security concerns are overblown. To support my claim; Well, Google IS using gvisor as part of their GKE sandboxing security..
I don't know what "popped" means here, but so far as I know there's never been a major incident caused by a flaw in gvisor. But gvisor is a much more intricate and carefully controlled system than standard Linux containers. Obviously, there have been tons of container escape compromises.
It doesn’t look like the moved away from gVisor due to security reasons.
“We were able to achieve these improvements because the second generation execution environment is based on a micro VM. This means that unlike the first generation execution environment, which uses gVisor, a container running in the second generation execution environment has access to a full Linux kernel.”
The reason you go with process isolation over VM isolation is performance. If you share a kernel, you share memory managers and pages, scheduler, limits, groups, etc. If you get better performance running VMs vs running processes, then what was even your isolation layer for?
But at the end of the day, there is a line in the sand around hypervisors vs proc/kernel isolation models. I challenge you to go to a financial or medical institute and tell their CTO "yeah, we have this super bullet proof shared-kernel-inproc isolation model"
The first question you'd get is "Why is this not just part of upstream linux?" Answer that question and realize why you should just use a hypervisor.
Obviously there might be many reasons for that, but as someone who worked on a similar gvisor tech for another company, it's dead in the water. No security expert or consultant will ever sign off on a process isolation model. Despite of architecture, audits, reviews, etc. There is just too much surface area for anyone to feel comfortable signing off on hostile multi-tenants with process isolation regardless of the sandboxing tech.
Not saying that there are no bugs in hypervisors, but the surface area is so so much smaller.
The first sentence pretty much sums it up: "Cloud Run’s new execution environment provides increased CPU and network performance and lets you mount network file systems." It's not a secret that performance is slower under gvisor and there are compatibility issues: https://gvisor.dev/docs/architecture_guide/performance/
Disclaimer: I work on this product but wasn't involved in this decision.
gvisor isn't simply a process isolation model. Security experts will certainly sign off on gvisor for some multitenant workloads. The reason Google is moving from it, to the extent they are, is that hypervisors are more performant for more common workloads.
I read "we got tired of reimplementing Linux kernel syscalls and functionality" as the reason. Like network file systems. The Cloud Run client base kept asking for more and more features, and they punted to just running the Linux kernel.
I have seen zero evidence of this; but if it's true I would love to learn more. The real action is in side channel vulnerabilities bypassing all manner of protections.
But this is because the workloads they execute changed, right? Http only before, to more general code today. I didn't see anything there that said gvisor was inferior, only that a new requirement was full kernel api access. For latency sensitive ephemeral and constrained workloads gvisor/seccomp can make a lot of sense and in the case of google handle multi-tenancy.
Now if workloads become less ephemeral and more general purpose, tolerance for startup latency goes up, annd probability of bespoke needs goes up making VM more palatable.
gVisor uses KVM or ptrace as its sandbox layer, and there's some indications that Google's internal fork uses an unpublished kernel mechanism, perhaps by extending seccomp (EDIT: It seems this has made its way to the outside world since I last looked. `systrap` is now default: https://gvisor.dev/docs/architecture_guide/platforms/ ). It's fake-kernel-in-userspace then sandboxed by seccomp.
Saying gVisor is "ultimately enforced by a normal kernel" is about as misleading & accurate as "KVM is enforced by a normal kernel" -- it is, but it's a very narrow boundary, not the usual syscall ABI.
I think bryan cantrill founded a company (joyent? or triton?) to do just that several years ago. It may have been based on solaris/smartos zones which is that exact use case w/ very secure/isolated containers.
althou it came with linux binary compat (of unknown quality) i think the solaris thing was just too off putting for most customers and the company did not do very well
Triton is now being developed by MNX Solutions and seems to be doing quite well.
We run Triton and SmartOS in production and the linux compatibility works via lx-zones just fine. Only some of the linux-locked software, which usually means docker, needs to go inside a bhyve VM.
Well they're not a "full-blown" machine, in that they do cut out a lot of things unnecessary for Lambda's (and incidentally, fly.io's) use case. ACPI is one example given in the article.
But yes, they do virtualize hardware not the kernel. I'm willing to bet you could swap out vanilla containerd with firecracker-containerd for most users and they wouldn't notice a difference given they initialize so fast.
The difference is mostly noticeable in that the guest kernel takes up some RAM[1]. If you were really packing things tight, wasting some megabytes per container could start to hurt.
[1]: And for the non-hyperscalers with less tuning, you may be buffering I/O pages both in the guest and the host.
I'm very surprised the standard isn't to build a microkernel that emulates Linux userspace (or *NIX userspace) and is tailored towards the subset of virtual hardware that Firecracker and QEMU provide. I don't get the impression that implementing a new target for a PL is all that difficult, so if you create a psuedo-OS like WASI/WASM and send PRs to the supported languages you could cut out most of the overhead.
The "hardest" part is probably sufficiently emulating Linux userspace accurately: it's a big surface area. That's why I think creating a pseudo-OS target is the best route.
No, I'm not. Gvisor is a security layer around Linux containers that emulates and constrains syscalls. It specifically runs on top of a container platform and kernel. What I'm suggesting is a stripped down Linux-like kernel that is really good at running exactly one process. I'm describing a microkernel.
gvisor emulates a big chunk of the Linux system call interface, and, remarkably, a further large chunk of Linux kernel state. It's architecturally similar to uml (though not as complete; like Firecracker, it's optimized to a specific set of workloads).
gvisor is not like a seccomp-bpf process sandbox that just ACLs system calls.
Ok, I oversimplified a bit. Regardless, I'm suggesting something that still runs in emulated hardware isolation and implements drivers for Firecracker/QEMU's subset of hardware.
gvisor does emulate some hardware. See, for instance, its network stack.
At any rate: why is this better than just using KVM and Firecracker? The big problem with gvisor is that the emulation you're talking about has pretty tough overhead.
Let's you get the best of both gvisor and Firecracker: efficient use of resources (ie. not running a full Linux kernel + scheduler, and most importantly, network stack for every lambda) while getting the isolation that comes from virtualization. You can achieve this in one of 2 ways: make a new kernel and add support for targeting it in the supported languages, or strip the Linux kernel down and reimplement the parts that aren't optimized for you short-lived VM lifecycle (scheduler, network stack, etc.).
Stripping the Linux kernel down is what people do with Firecracker. I'm curious what savings you see in the Linux networking stack. You could compile it out and just rely on vsocks, but now you're breaking everyone's code and you're not winning anything on performance.
Perhaps I'm off base (I'm not an expert in this area), but I recall reading that one of the major challenges with Lambda was the latency that initializing the network stack introduces. Perhaps that's been solved by now, but my naive idea is to have the guest not really run it's own network stack (at least the MAC/IP portion of it) and instead delegate the entire stack (IP and all) to the virtual device, which can be implemented by Firecracker/QEMU/whatever. I guess at that point, the amount of mangling you'd need to do to the kernel probably isn't worth it and you should just use Gvisor... ah oh well.
Regardless, I'm still surprised microkernels aren't more popular in this space, but perhaps the losing the ecosystem of Linux libs/applications is a non-starter.
Even if the idea wasn't fruitful, the conversation was fun. Thanks for engaging and challenging my bad ideas!
Edit: I've also realized I was thinking of Unikernels, not microkernels and I've been calling it the wrong thing all night. *sigh*
FWIW, Linux itself has plenty of support for TCP offload engines. I don't think Firecracker uses that at the moment, but there's no reason why it has to be that way if that's a true bottleneck in the system.
I think the Firecracker team has stated that PCIe bypass wasn't something they wanted to do, so I don't see how they'd open up to other accelerators' bypass method. But seems like building a vmm from the rust-vmm toolbox is 'not that hard' and there are some PCIe bypass crates already, so... Have fun?
I'm not saying an actual NIC with TCP offload, but instead something like adding TCP offload to the virtio nic if for some reason initialization of the network stack was a latency problem. If the VM is only running at layer 3/4, most init latency problems disappear.
This is exactly what I'm referring to. You have a pool of virtual NICs on the host in user-space created by the VM runtime that get assigned to the guest on provision-time, which just passes through any commands/syscalls (bind/connect/etc.) via a funky driver interface. You'd have to mangle the guest kernel syscalls or libc though, it might be really ugly.
> You'd have to mangle the guest kernel syscalls or libc though, it might be really ugly.
You wouldn't have to. There's patches for hardware TCP offload using the normal socket syscall ABI. The kernel net stack maintainer is pretty ideologically against them so they're not mainlined, but plenty of people have been running them in production for decades and they're quite mature.
> Linux itself has plenty of support for TCP offload engines
Could you link to any specific Linux kernel source that implements support for TCP offload? AFAIK networking subsystem maintainers were always opposed to accommodate TCP offload because it is A Bad Idea.
At its simplest, TCP offload can be just letting the hardware chunk a large packet into smaller ones on the wire. I don't think anything trying to offload more than that has really seen much daylight outside of very proprietary "smart NICs".
That's not what a microkernel is. A microkernel is a kernel that pushes services traditionally included in the kernel, such as networking, out into userspace.
The closest things to what you're describing are unikernels and NetBSD rump kernels.
It's a unikernel really only if you rip out any security boundaries inside the VM and link the kernel into the app. If you still maintain a syscall boundary inside the VM, it's just another kernel.
3. Narrow set of calls to outside using Linux syscalls
This thing could be envisioned as
1. Linux syscall ABI as-is, same mechanism[a]
2. Reimplementation of parts Linux kernel
3. virt-io hardware drivers for calls to outside
So the middle part of the sandwich could look the same.
Also, I think it's worth saying that I think the work in maintaining #2 there is exactly why Google Cloud Run migrated away from gVisor. People just kept asking for more and more kernel features.
[a]: Alternatively, #1 could be replaced with unikernel like linking directly with #2.
Me personally, I think HTTP/3's move away from TCP could be really interesting for this sort of stuff. The responsibilities of the kernel could be hugely simplified if all you had were UDP/IP directly hardcoded to virtio (no need for routing, address configuration, ARP, etc), no paging etc, and the only filesystems were EROFS & tmpfs. Of course, Cloud Run's move away from gVisor shows that Enterprise clients would hate it.
How does this compare to Linux on Firecracker? I can find some numbers with a basic internet search, but I'm not sure if these numbers are comparable for various reasons (they're a few years old, it's unclear if they use the same method to measure boot times, or even have the same definition of "boot time").
FWIW, this is basically the same material -- after my BSDCan talk the FreeBSD Journal said "hey, that was a great talk, can you turn it into an article for us", and after the FreeBSD Journal article was published ;login: asked if they could republish it.
It's funny how many one second pauses turn out to be less than necessary. How many sysadmins took meaningful action because the system paused when they had an invalid machine uuid?
Probably a significant proportion of the sysadmins who experienced that one-second pause.
The "print a message telling the user that we're rebooting, then wait a second to let them read the console before we go ahead and reboot", on the other hand...
Maybe I'm just different. I've watched a great many openbsd boot sequences, which tend to have a great many pauses, and I've never paid any special attention to the lines that come before pauses vs lines that come after pauses.
I suspect that FreeBSD has fewer pauses than OpenBSD... especially after the work I've done over the past few years to speed it up.
If anyone in the OpenBSD world is interested in speeding up your boot process I'd be happy to share tips and code. It's a bit daunting to start with but with some good tools it becomes a lot of fun.
Thank you! That would be me. I am clueless on amd64 on how to speed up the boot process and maybe change a few fonts during boot.
I understand the reasons for no "how-tos". But sometimes they make sense for people like me. I wouldn't mind delving a bit deeper given some direction.
The first thing you'll want to do is port my TSLOG code to the OpenBSD kernel and start instrumenting things there. Send me an email, there's too much detail to get into for an HN comment.
if the machine boots in 20ms, I think that message is actually useful, because something would reboot your machine and you'd think you got logged out because you blinked
I'm not wanting to sound snoody. What use-cases do firecracker instances and the likes chime?
I use FreeBSD for everything from my colocated servers, to my own PC. By no means am I developer; seasoned Unix Admin at best. Bare-metal forever but welcome to the future. Especially anything that contributes to the OS.
However I hear buzz words like Lambda and Firecracker and really have no idea where the usage is. I get docker, containers, barely understand k8s but why do you need to spin up a VM only to tear it down compared to where you could just spin up a VM and use it when you really need to. Always there, always when.
Is it purely a cloud experience, cost saving exercise?
Instances of an application are created as part of the request/response lifecycle.
Allows you to build a compute plane where any node in the plane can service the traffic for any application.
Any one application can dynamically grow to consume the available free compute of the plane as needed in response to changes in traffic patterns.
Applications use no resources when they aren't handling traffic.
Growing the capacity of the compute plane means bringing more nodes online.
Can't come up with a use case for this beyond managing many large-scale deployments. If you aren't working "at scale" this is something that would sit below a vendor boundary for you.
The main use case if for seldom used APIs. If I run a service where the API isn't used often, but I need it quick when it is, Lambdas or something like it are perfect.
As it turns out, a lot of APIs for phone apps fit this category. You don't want a machine sitting around idle 99% of the time to answer those API calls.
If you rarely need an API but set something up like this just to rarely use it, it seems one needs to write their own code for this functionality and not go through hoops to run someone else's. That just sounds so bizarre.
Not someone else's API, your own. You make an app. It needs an API that you create (perhaps to sync up scores or something). You don't want to run a machine full time just to accept one score update a day.
Just about every single company can benefit from scaling as traffic is never consistent 24/7. Most don't bother as the effort outweighs the savings, but the potential is there. Things like lambda and firecracker make it much easier.
It's partly a cost saving exercise, but also: running "chroot /var/empty /some/shitty/code" or putting "chroot /var/empty /some/shitty/code" in inetd.conf is useful. (On today's super-fast machines,) Firecracker starts fast enough to support such interactive uses, while giving you the extra security of a VM (i.e. greatly restricts what parts of the kernel and/or localhost the shitty code can talk to).
> why do you need to spin up a VM only to tear it down compared to where you could just spin up a VM and use it when you really need to. Always there, always when
Firecracker has a much amaller overhead compared to regular VMs - which makes the (time and compute) costs of spinning up new VMs really low. This can be an advantage, depending on how chunky your workloads are - the less chunky they are - the more they can take advantage of finer-grained scaling.
IoT devices can execute short lived actions by calling remote Functions. The provider wants complete isolation and wipes these micro VMs after every few seconds and let's the user pay for use. The response from these can be anything, voice, data or API responses.
FaaS, function as a service. Depending on how software is packaged and the expectations the richness a VM, like Firecraker, provides may be useful. Many of these tradeoffs are for velocity, I can run X easily on Y.
> However I hear buzz words like Lambda and Firecracker and and really have no idea where the usage is.
Sometimes you just want to slap some lines pf code together and run them from time to time, and don’t need a whole server (physical or virtual) for that.
Sometimes you have no idea if you’ll have to run a piece of code 100 times a day or 10’000’000 times a day.
Sometimes you don’t feel like paying a whole month for and maintaining a whole instance for a cronjob that lasts 20 seconds, and maybe it runs once a week.
It's a shame neither AWS nor macOS on ARM support nested virtualization. It would would make it far easier to develop and deploy Firecracker based tech.
Toyed around with firecracker a bit. Does what it promises on boot times, but still a pretty gnarly experience. e.g. After doing a victory dance for getting it to boot I was rather deflated to find out that getting networking takes another lengthy tutorial
I think there is definitively room for someone to add a lot of value to this by creating some automation tools. It would be really nice to be able to download a single binary, fire it up, have both a web interface and an API available, be able to configure it quickly, have it download whatever it needs for you etc.
The loader+initrd+userspace time there sounds unreasonable.
I've had libvirt bog standard qemu-kvm (not a microvm) creating a new Ubuntu VM from a disk image & booting to a login prompt in under 10 seconds for more than a decade. This is without fiddling with virtio, doing hardware scans for PCI, VGA, SATA and such, and booting via Grub (your "loader"). Those should be pretty comparable!
Ah, yes, you can ignore the loader + initrd. I'm multi-boot so the loader has 2 seconds timeout, and initrd is waiting for me to input my LUKS password.