> *To add GPU support, the Google team introduced nvproxy which works using the ...

hinkley · 2024-09-27T16:35:26 1727454926

I’ve been looking at distributed CI and for now I’m just going to be running workloads queued by the owner of the agent. That doesn’t eliminate hostile workloads but it does present a similar surface area to simply running the builds locally.

I’ve been thinking about QEMM or firecracker instead of just containers for a more robust solution. I have some time before anyone would ask me about GPU workloads, but do you think firecracker is on track to get there or would I be better off learning QEMM?

ckastner · 2024-09-27T17:04:48 1727456688

Amazon/AWS has no use case for VFIO in Firecracker. They're open to the community adding support and have a community meeting soon, but I wouldn't get my hopes up.

QEMU can work -- I say can, because it doesn't work with all GPUs. And with consumer GPUs, VFIO is generally not an officially supported use case. We got it working, but with lots of trial and error, and there are still some problematic corner cases.

hinkley · 2024-09-27T20:08:40 1727467720

What would you say is the sort of time horizon for turnkey operation of one commonly available video card, half a dozen, and OEM cards in high end laptops (eg, MacBook Pro)? Years? Decades? Heat death?

ckastner · 2024-09-27T21:25:01 1727472301

I don't think I fully understand your question. If, with turnkey operation you mean virtualization, enterprise GPUs already officially support it now, and it already works with consumer GPUs, at least the discrete ones.

ec109685 · 2024-09-27T16:10:12 1727453412

If the calls first pass through a memory safe language as what gvisor does, isn’t the attack surface greatly reduced?

It does seem however that Firecracker + GPU support (or https://github.com/cloud-hypervisor/cloud-hypervisor) is most promising though.

It’s surprising that AWS doesn’t have a need for Lambda but with GPU’s to motivate them to bring GPU’s to firecracker.

ckastner · 2024-09-27T16:25:56 1727454356

> If the calls first pass through a memory safe language as what gvisor does, isn’t the attack surface greatly reduced?

The runtime may be memory safe, but I'm thinking of the GPU workloads which nvproxy seems to pass on to the device via the host's kernel. Say I find a security issue in the GPU's driver, and manage to exploit it with some malicious CUDA workload.

ec109685 · 2024-09-27T19:43:42 1727466222

Would having a VM inbetween help in that case? It seems like protecting against malicious GPU workloads requires the GPU to off virtualization to avoid this exploit.

This is helpful in explaining why AWS hasn't been excited to ship this use case in firecracker.

ckastner · 2024-09-27T21:32:57 1727472777

It would probably not stop all theoretically possible attacks, but it would stop many of them.

Say you find a bug in the GPU driver that let's you execute arbitrary code as root. That still all happens within the VM. To attack the host, you'd still need to break out of the VM, and if the VM is unprivileged (which I assume it is), you'd next need gain privileges on the host.

There are other channels -- perhaps you can get the GPU to do something funky on PCI level, perhaps you can get the GPU to crash the host -- but VM isolation does add a solid layer of protection.

donavanm · 2024-09-28T04:22:20 1727497340

Im not familiar with this cases specifics, but AWS also has an approach of virtualizing actual hardware interfaces (like nvme/pcie) to the host through dedicated hardware/firmware. I wouldnt be surprised if their solution was to map physical devices (partitions of) as a “hardware” device on the host and pass it directly through to the fire cracker instances. Especially if they can isolate multiple firecracker/lambda instances of a customer to a single physical device.