In world where most work loads are containerized, and where each container can b...

jeffbee · on July 5, 2021

k8s, by default, is oblivious to NUMA topology. You have to enable unreleased features and configure them correctly, which is the unwanted complexity to which I referred earlier. Simply aligning your containers to NUMA domains does not solve the problem that your arriving network frames or your NVMe completion queues can still be on the wrong domain. Isn't it simpler to just have 1 socket and not need to care? The number of cores available on a single socket system is pretty high these days, and in general the 1S parts are cheaper and faster.

rektide · on July 6, 2021

generally a huge fan of kubernetes but it's stunning what a did-it-ourselves dirtbag k8s opted to be every step of the way with regard to scheduling.

Facebook has really really good talks about managing process scheduling at scale, talking about how they leverage cgroups to do the right thing.

kubernetes seems to not give a fuck. they have their own resource systems they cooked up. shit gets scheduled in a huge massive cgroup. any order or control is userland, totally ignorant to the kernel control. there's not hierarchies, no priorities, everything is absolute, schedule or die. it's such a ginormous piece of shit, so in unbelievably willfully ignorant to all the good kernel technology that exists. it tries to make sure the kernel never has a role & that's just a huge mistake, just deeply tragic.

one noteable side effect ofany is that while the the kernel has many ways to make multi-tenant scheduling fairly reasonable, kubernetes has a variety of wild hair brained schemes, all of which detour around how easy the job would be if different pods could be scheduled in different cgroups. but that's somehow too blindingly obvious for kubernetes, which instead tries to mediate what to run entirely by itself.

slashdev · on July 5, 2021

Yeah, it makes a lot of sense to go with single socket servers unless you can't scale horizontally (e.g. database server). Why deal with the complexity when you can just side step it.

dragontamer · on July 5, 2021

Why would you switch from a 100GBps NUMA connection (800 gigabits per second) over NUMA fabric into a 10 Gbps Ethernet fabric?

If you are scaling horizontally, NUMA is the superior fabric than Ethernet or Infiniband (100Gbps)

Horizontal scaling seems to favor NUMA. 1000 chips over Ethernet is less efficient than 500 dual socket nodes over Ethernet. Anything you can do over Ethernet seems easier and cheaper over NUMA instead.

slashdev · on July 5, 2021

I'm talking mostly abour scaling things like app servers where they might not need any communication.

But in general if you can't scale horizontally at 10 gbps, you're in for a world of hurt. Numa gets you to 8x scale at best on very expensive very exotic hardware. And then you hit the wall.

dragontamer · on July 5, 2021

I'm mostly talking about 2 socket servers, which are IIRC more common than even single socket servers.

Dual socket is a cheap, easy, and common. If only to recycle fans, power supplies and racks, it seems useful.

slashdev · on July 6, 2021

And single socket is equally cheap, except it takes twice the rack space - but it also gives you redundancy. One server can fail and you can carry on.

The advantage of memory bandwidth vs Ethernet for scaling to x2 really doesn't matter. If it did, you're not horizontally scalable and at best you buy a little time before you hit the wall.

If the price difference isn't much, I would heavily prefer single socket.

kortilla · on July 5, 2021

Your scaling architecture sucks if it depends on that kind of throughput. If you need that you’ve only can kicked your way to more capacity without a real scaling fix.

dragontamer · on July 5, 2021

Depend on? Heavens no.

Dual socket has numerous advantages in density and rack space. The fact that performance is better is pretty much icing on the cake.

It's easier to manage 500 dual socket servers than 1000 single socket servers. Less equipment, higher utilization of parts, etc. Etc.

To suggest dual socket NUMA is going away is... just very unlikely to me. I don't see what the benefits would be at all. Not just performance, but also routine maintenance issues (power, Ethernet, local storage, etc etc)

wmf · on July 5, 2021

This is correct if your software is NUMA-optimized (or if auto-NUMA works well for you) but if it isn't you can end up with slowdowns.

dragontamer · on July 5, 2021

Surely that can be fixed with just a well placed numactl command to set node affinity and CPU affinity.

The root article is discussing rewriting code to fit on FPGAs. If NUMA is too complex then... I dunno. The FPGA argument seems dead on arrival.

wmf · on July 5, 2021

Does any container runtime/orchestrator perform this optimization yet? Why wait?

solarkennedy · on July 5, 2021

Titus (Netflix's container orchestrator that I work on) does this via: https://github.com/Netflix-Skunkworks/titus-isolate

syoc · on July 5, 2021

Kernel scheduling is NUMA aware and will localize workloads. Threads will mostly have their RAM on the sticks local to their node. The core the thread is delegated to is also more likely to be the core local to the disk or NIC being used for IO.

This is at least my experience, though I am no expert.