Predictive CPU isolation of containers at Netflix (2019)

high_na_euv · 2024-04-09T09:55:02 1712656502

It seems like we are more and more getting away from OSes managing our resources

Runtimes/vms implement memory management, varius threading techniques and things like we see here

Maybe in the future we will entirely skip OS's overhead and run apps directly on HW and they will manager themselves more efficiently (their runtimes/vms like jvm clr)

dist1ll · 2024-04-10T23:47:26 1712792846

That's definitely a trend we're moving towards for extremely high-performance software. General-purpose operating systems often don't sit at the right level of abstraction, and lack the flexibility for certain demanding workloads. Kernel-bypass networking is the gold standard for low-latency, high throughput networking. Serverless platforms often rely on userspace schedulers and userspace page fault handlers.

That's one of the reasons unikernels seem to be a promising way forward. It opens up a bunch of opportunities, including language-based safety, opportunities for compile-time optimizations, and just seems to mirror more closely how we wish to run & deploy modern applications (declarative, immutable and ideally with a bare minimum of dependencies).

cj · 2024-04-09T11:24:09 1712661849

Is it node that still limits all processes to 2gb or something by default? (I think their rationale was “it’s a v8 flag so we don’t touch it”)

piyh · 2024-04-10T04:07:43 1712722063

TIL

https://nodejs.org/api/cli.html#--max-old-space-sizesize-in-...

actionfromafar · 2024-04-09T12:18:43 1712665123

Upside is that it makes sure your stuff can be deployed on 32-bit.

lexicality · 2024-04-09T12:29:59 1712665799

Does that come in handy often?

hinkley · 2024-04-11T16:03:59 1712851439

And yet pointer compression is turned off by default.

yencabulator · 2024-04-10T21:06:55 1712783215

More like "kernel programming is hard, let's put fancier logic and RPC in userspace". Which sounds perfectly sane.

mochomocha · 2024-04-10T21:52:57 1712785977

(I'm the author of the blog post)

Beyond "kernel programming is hard", there are a few other reasons why it made sense for us:

- observability & maintenance: much easier to implement and ship this type of changes in userspace than rolling out a kernel fork. We also built custom AB infra to be able to evaluate these optimizations.

- the kernel is really good at making reasonable decisions at high-frequency based on a limited amount of data and heuristics. But these decisions are far from optimal in all scenarios. In contrast in user-space we can make better decisions based on more data (or ML predictions), but do so less frequently.

tptacek · 2024-04-10T23:15:26 1712790926

Meh, not really? This seems more analogous to memory allocator optimization, where your libc malloc() is "optimized" to give adequate performance to all sorts of different allocation patterns, but you can do much better if you know a priori what your application's actual pattern will be. Just swap out "malloc()" with "the process scheduler" here.

Rohansi · 2024-04-10T23:47:48 1712792868

Worth watching The Birth and Death of JavaScript for more information on this hypothetical future. In theory you could get rid of system call and virtual memory overhead to make something like JavaScript run at "fully native" speed. Because removing the OS-related overhead counters the loss for being JavaScript. This is only really a viable future for managed languages because the runtime would need to ensure memory safety, isolation between "processes", etc. which they mostly do already anyway.

https://www.destroyallsoftware.com/talks/the-birth-and-death...

exe34 · 2024-04-09T10:39:28 1712659168

Could build them on top of unikernels.

gpderetta · 2024-04-09T10:48:44 1712659724

wash, rinse, repeat.

shadowpho · 2024-04-09T03:13:13 1712632393

This is amazing, they use ML to predict utilization on the fly

dang · 2024-04-10T21:55:56 1712786156

Predictive CPU Isolation of Containers at Netflix - https://news.ycombinator.com/item?id=20096699 - June 2019 (1 comment)

yencabulator · 2024-04-10T21:11:08 1712783468

See also ghOSt by Google (2021):

https://storage.googleapis.com/pub-tools-public-publication-...

https://github.com/google/ghost-userspace

https://github.com/google/ghost-kernel

vient · 2024-04-11T01:15:02 1712798102

Also sched-ext which seems close to be mainlined and is already a default scheduler in CachyOS:

https://github.com/sched-ext/scx

Sparkyte · 2024-04-09T10:56:21 1712660181

Kind of an old article. It is pretty straight forward thing to do. If you spend enough time accurately load testing your environments you can dial in the container resources and shave thousands of dollars. Lots of places are too scared of under allocating. Limit and request exist for a reason. Limit is for surge and request is what is always guaranteed. It is okay to exceed your request as long as you balance add a scaling policy to balance out the surge. And be cautious with request and limit on memory not all applications benefit from this.

burutthrow1234 · 2024-04-09T12:43:23 1712666603

They're automatically predicting the limit _and_ figuring out binpacking into hyperthreaded CPUs and NUMA cores. K8s just pushes your supplied values down to the kernel, which is exactly what they're saying is inefficient.

Sparkyte · 2024-04-10T09:41:11 1712742071

It is indeed inefficient so this is more like a process lasso approach to the resource management?

hinkley · 2024-04-11T16:08:19 1712851699

If the number of servers needed for service A is proportional to the number of servers needed for service B-Z, then your whole cluster scales up and down together and you have a situation where the max cluster size is hit regularly instead of almost never. For private servers that’s a big problem. But if you’re a large enough customer for a cloud provider it can still be a problem.

You save money still, but you don’t solve your capacity problems by doing so.