> To add GPU support, the Google team introduced nvproxy which works using the same principles as described above for syscalls: it intercepts ioctls destined to the GPU and proxies a subset to the GPU kernel module.
This does still expose the host's kernel to a potentially malicious workload, right?
If so, could this be mitigated by (continuously) running a QEMU VM with GPUs passed through via VFIO, and running whatever Workers need within that VM?
The Debian ROCm Team faces similar challenge, we want to do CI [1] for our stack and all our dependent packages, but cannot rule out potentially hostile workloads. We spawn QEMU VMs per test (instead of the model described above) but that's because our tests must also be run against the relevant distribution's kernel and firmwares.
Incidentally, I've been monitoring the Firecracker VFIO GitHub issue linked in the article. Upstream does not have a use case for and thus no resources dedicated to implement this, but there's a community meeting [2] coming up in October to discuss the future of this feature request.
I’ve been looking at distributed CI and for now I’m just going to be running workloads queued by the owner of the agent. That doesn’t eliminate hostile workloads but it does present a similar surface area to simply running the builds locally.
I’ve been thinking about QEMM or firecracker instead of just containers for a more robust solution. I have some time before anyone would ask me about GPU workloads, but do you think firecracker is on track to get there or would I be better off learning QEMM?
Amazon/AWS has no use case for VFIO in Firecracker. They're open to the community adding support and have a community meeting soon, but I wouldn't get my hopes up.
QEMU can work -- I say can, because it doesn't work with all GPUs. And with consumer GPUs, VFIO is generally not an officially supported use case. We got it working, but with lots of trial and error, and there are still some problematic corner cases.
What would you say is the sort of time horizon for turnkey operation of one commonly available video card, half a dozen, and OEM cards in high end laptops (eg, MacBook Pro)? Years? Decades? Heat death?
I don't think I fully understand your question. If, with turnkey operation you mean virtualization, enterprise GPUs already officially support it now, and it already works with consumer GPUs, at least the discrete ones.
> If the calls first pass through a memory safe language as what gvisor does, isn’t the attack surface greatly reduced?
The runtime may be memory safe, but I'm thinking of the GPU workloads which nvproxy seems to pass on to the device via the host's kernel. Say I find a security issue in the GPU's driver, and manage to exploit it with some malicious CUDA workload.
Would having a VM inbetween help in that case? It seems like protecting against malicious GPU workloads requires the GPU to off virtualization to avoid this exploit.
This is helpful in explaining why AWS hasn't been excited to ship this use case in firecracker.
It would probably not stop all theoretically possible attacks, but it would stop many of them.
Say you find a bug in the GPU driver that let's you execute arbitrary code as root. That still all happens within the VM. To attack the host, you'd still need to break out of the VM, and if the VM is unprivileged (which I assume it is), you'd next need gain privileges on the host.
There are other channels -- perhaps you can get the GPU to do something funky on PCI level, perhaps you can get the GPU to crash the host -- but VM isolation does add a solid layer of protection.
Im not familiar with this cases specifics, but AWS also has an approach of virtualizing actual hardware interfaces (like nvme/pcie) to the host through dedicated hardware/firmware. I wouldnt be surprised if their solution was to map physical devices (partitions of) as a “hardware” device on the host and pass it directly through to the fire cracker instances. Especially if they can isolate multiple firecracker/lambda instances of a customer to a single physical device.
This does still expose the host's kernel to a potentially malicious workload, right?
If so, could this be mitigated by (continuously) running a QEMU VM with GPUs passed through via VFIO, and running whatever Workers need within that VM?
The Debian ROCm Team faces similar challenge, we want to do CI [1] for our stack and all our dependent packages, but cannot rule out potentially hostile workloads. We spawn QEMU VMs per test (instead of the model described above) but that's because our tests must also be run against the relevant distribution's kernel and firmwares.
Incidentally, I've been monitoring the Firecracker VFIO GitHub issue linked in the article. Upstream does not have a use case for and thus no resources dedicated to implement this, but there's a community meeting [2] coming up in October to discuss the future of this feature request.
[1]: https://ci.rocm.debian.net
[2]: https://github.com/firecracker-microvm/firecracker/issues/11...