> I don't understand - what do you suggest as an alternative to Gvisor? Their pe...

amscanne · 2024-06-06T04:21:33 1717647693

A decent while ago, I was the original author of that performance guide. I tried to lay out the set of performance trade-offs in an objective and realistic way. It is shocking to me that you’re spending so much time commenting on a few figures from there, ostensibly w/o reading it.

System call overhead does matter, but it’s not the ultimate measure of anything. If it were, gVisor with the KVM platform would be faster than native containers (looking at the runsc-kvm data point which you’ve ignored for an unknown reason). But it is obviously more complex than that alone. For example, let’s click down and ask — how is it even possible to be faster? The default docker seccomp profile itself installs an eBPF filter that slows system calls by 20x! (And this path does not apply within the guest context.) On that basis, should you start shouting that everyone should stop using Docker because of the system call overhead? I would hope not, because looking at any one figure in isolation is dumb — consider the overall application and architecture. Containers themselves have a cost (higher context switch time due to cgroup accounting, costs to devirtualize namespaces in many system calls, etc.) but it’s obviously worth it in most cases.

The redis case is called out as a worst case — the application itself does very little beyond dispatching I/O, so almost everything manifests as overhead. But if you’re doing something that has 20% overhead, you need hard security boundaries, and fine-grained multi-tenancy can lower costs by 80% it might make perfect sense. If something doesn’t work for you because your trade-offs are different, just don’t use it!

parhamn · 2024-06-06T04:55:02 1717649702

> it is shocking to me that you’re spending so much time commenting on a few figures from there

You give me too much credit! They were copy pastes to the same responder who responded to me in a few places in the thread. I did that to avoid spending too much time responding!

> because looking at any one figure in isolation is dumb

So the self-reported performance figures are bad, the are hundreds of web pages and support pages reporting slow performance and low startup time from their first hand experience, there are Google hosted documentation pages about how to improve app performance for cloudrun (probably the largest user and creators of Gvisor, can I assume they know how to run it?) including gems like "delete temporary files" and a blog post recommending "using global variables" (I'm not joking). And the accusation is "dumb" cherry-picking? Huh?

Also, if I'm not wrong CloudRun GCP's main (only? besides managed K8s) PaaS container runtime. Presenting it as a general container runtime with ultra fast scaling when people online are reporting 30 second startup times for basic python/node apps, is a joke. These tradeoffs should also be highlighted somewhere in these sales pages, but they're not.

This is the last I'm responding to this thread. Also my apologies to the Coder folks for going off topic like this.

amscanne · 2024-06-06T05:44:20 1717652660

I don’t think copy/pasting the same response everywhere is better.

IIRC, CloudRun has multiple modes of operation (fully-managed and in a K8s cluster) and different sandboxes for the fully-managed environment (VM-based and gVisor-based). Like everything, performance depends a lot of the specifics — for example, the network depends a lot more on the network path (e.g. are you using a VPC connector?) than it does the specific sandbox or network stack (i.e. if you want to push 40gbps, spin up a dedicated GCE instance.) Similarly, the lack of a persistent disk is a design choice for multiple reasons — if you need a lot of non-tmpfs disk or persistent state, CloudRun might not be the right place for the service.

It sounds like you personally had a bad experience or hit a sharp edge, which sucks and I empathize — but I think you can just be concrete about that rather than projecting with system call times (I’d be happy to give you the reason gen1 sandbox would be slow for a typical heavy python app doing 20,000 stats on startup — and it’s real but not really system calls or anything you’re pointing at,… either way you could just turn on gen2 or use other products, e.g. GCE containers, GKE autopilot, etc.).

I’m not sure what’s wrong with advice re: optimizing for a serverless platform (like global variables). I don’t really think it would be sensible to recompute/rebuild application state on any serverless platform on any provider.