Computing Performance on the Horizon

martinpw · on July 5, 2021

Slide 26 is interesting - arguing that cloud providers have an advantage for future CPU design since they can analyze so many real world customer workloads directly.

In previous roles I have worked with CPU vendors who have been very keen on getting access to profiling data from our workloads for design optimization, and lamenting the fact that it was hard to get such data and they were often limited to synthetic benchmark workloads when tuning new designs.

So this argument does sound like a valid one, and does imply AWS etc will have significant advantages in future designs.

brendangregg · on July 5, 2021

Thanks, I think it's the start of a new "cloud CPU" era.

Cloud vendors have already split workloads into different instance types, so directly analyzing their workloads and developing CPUs for each instance type will lead to further performance wins. This may prompt the creation of even more instance types just to further separate workload types for future CPU specialization.

In the future, products like AWS outposts may become far more desirable over commodity hardware, as customers know it provides access to specialized CPUs and their performance. It's a path for cloud computing vendors to own the datacenters as well.

(Note that my predictions are not based on any internal knowledge: I'm just describing what I personally would be doing if I were a cloud vendor.)

pm90 · on July 6, 2021

In hindsight, I wonder why Intel or other chip makers didn’t invest more in datacenters/hosting companies just to get this valuable data.

brendangregg · on July 6, 2021

Yes, this was in my predictions: Processor vendors may offer their own clouds (either built or acquired) to get the same level of workload access.

thechao · on July 5, 2021

He sort of implies that “just better hardware” will peter out in the 2030s. I think he’s calling it at least 50 years too soon. Here’s why: (1) I think logic designers are still faffing about in term of optimizing their designs; and (2), I think there’s a lot of smart people thinking “incrementally” through what we’d consider paradigm shifts in HW implementation. That is, our fabs will just naturally segue into 3D, spintronics, etc. I think he even mentions 3D circuits? One thing a lot of people miss is that layout of the design is materially different in 3D vs 2D: in 2D layout is NP-hard (complete) without efficient polynomial approximations; in 3D layout is low-order polynomial. The reduction in layout complexity will allow us to design things that are unthinkable right now, due to layout constraints & wire congestion.

yjftsjthsd-h · on July 5, 2021

> in 2D layout is NP-hard (complete) without efficient polynomial approximations; in 3D layout is low-order polynomial

Any chance you could explain to a novice why 3D is easier? To my naive intuition, it would have seemed like the more room to maneuver is offset by having more stuff to route.

blt · on July 6, 2021

Intuitively, we feel that larger solution spaces should make a problem harder because there are more possible solutions to consider. And of course this is true if exhaustive search is the only algorithm.

But CS is full of problems where the smaller space is NP-hard and the larger one isn't. Integer linear programming is a prominent example.

To resolve the intuition, we can think of the larger space as an "unconstrained" problem where the solution space is somehow natural and nice for the problem. From this perspective, the smaller space looks like an added constraint. Adding constraints usually makes problems harder.

dragontamer · on July 6, 2021

> Adding constraints usually makes problems harder.

Adding constraints can also make a problem easier. Which is why you've carefully worded your sentence in this manner :-)

You clearly know about those edge cases, but I felt it necessary to elaborate on this point: Additional constraints can make a problem easier OR harder, depending on the problem and the search space. Its very non-intuitive.

thechao · on July 5, 2021

Try laying out a square with all the vertices connected in a plane: it’s not possible. You must move one of the wires “up” a layer. Which wire? Great layout minimizes layer transitions while also bunching together related HW blocks. The NP-completeness proof is related to work Knuth’s student (Plass?) did on laying out images in TeX.

hobofan · on July 5, 2021

Aren't "big" 3D circuits unfeasible due to temperature limitations, though?

a1369209993 · on July 5, 2021

> Aren't "big" 3D circuits unfeasible due to temperature limitations, though?

Less so than you'd think, since (as long as you can keep leakage current under control) heat is only generated when the circuits are active (barring leakage current, when bits are being flipped). So you can have arbitrarily large amounts of increasingly-rarely-used circuitry for various purposes.

The naive-but-easy-to-understand example would be having separate, optimal circuitry for each machine instruction - the total number of gates is O(N*M), but the number of gates activated on each clock cycle (and thus the amount of waste heat generated) is only O(N), so you can keep adding new, perfectly-hardware-accelerated instructions up to the limits of physical space. In practice it's more complicated and less of a nonissue, but it's not "big circuits are useful in proportion to their surface area, not their volume", it's more "big circuits are less useful than their volume alone would suggest". (You do hit physical limits like the Bekenstein bound[0] eventually, but that's far enough out that we mostly don't care yet.)

0: https://en.wikipedia.org/wiki/Bekenstein_bound

nradov · on July 5, 2021

How are we going to cool those 3D chips?

thechao · on July 5, 2021

The commenter above is correct: just stop toggling HW. We already do this to a great extent; we’re limited in the number of custom implementations because we can’t wire everything together. 3D chips will have a lot more “dark” logic than current chips, but will be orders-of-magnitude more efficient (& thus powerful) due to deep customization.

Also, remember the argument of my timeline is ~50–80 years out from now.

nradov · on July 5, 2021

Dark logic isn't doing any useful work so what's the point? Sure you can include specialized circuitry for a bunch of rare cases, but that won't lift overall system performance much and will kill manufacturing yields.

thechao · on July 5, 2021

The GPU in your computer consists of mostly-dark highly specialized circuits; it certainly seems to improve my computing experience.

ClumsyPilot · on July 6, 2021

This is totally wishfull thinking. Customised logix means customised code.

The GPU has an h265 and h264 hardware video encoder that has no support on Mac and ia a bitch to get working on linux. Most software doesn't support it most software does not support GPU compute either, and we had that for a decade now.

Fuck, we had SIMD in every CPU for like 30 years, and out of 50 most popular programming languages how many even support it? 5?

thechao · on July 6, 2021

Engineers will use the HW when it’s their only choice for year-over-year performance gains — I like to think of it as the “great unwinding”.

RantyDave · on July 6, 2021

I would argue that most software doesn't need GPU compute.

AprilArcus · on July 6, 2021

could it be fabbed around a lattice of thermally conductive elements, like a menger sponge?

star-trek-fleet · on July 5, 2021

It's already happening. Intel design now are significantly influenced by Google and Amazon's data center needs.

brendangregg · on July 5, 2021

Sure, and I have regular meetings with processor vendors who want to understand our workload to better serve it. But there's no easy or fast way for them to get low-level data, including processor trace (cycle logs), across many customers. This can be vast amounts of data: Tbytes. If you are working on processors at a cloud vendor and want to know some low-level CPU detail, you could answer it immediately across a million customer workloads.

handrous · on July 5, 2021

An interesting application of a now-familiar pattern: get lots of users, spy on them at massive scale, use those data to dominate some other market in a way that, at most, a single digit count of companies in the world could conceivably compete with (because none but they have anything like the data that you do). See also: everything to do with "AI".

uluyol · on July 5, 2021

Amazon, Microsoft, and Google have massive applications and systems that they run, some of which they sell as a service. They have plenty of workload data without having to poke around user VMs.

handrous · on July 5, 2021

You disagree with slide 26, then?

dragontamer · on July 5, 2021

FPGAs from Xilinx are very complicated. They are no longer homogeneous 4-LUTs or 6LUTs with dedicated multipliers here and there.

Today's FPGAs are VLIW minicores capable of SIMD execution with custom routing and some LUTs thrown around. They've stepped towards GPU style architecture while retaining the custom logic portions.

FPGAs remain so difficult to use, I find it unlikely that they'd be mainstream in any capacity. GPUs seem like the easier way to get access to HBM + heavy compute, but either way the HBM future is eminent.

------------

GPUs have big questions about ease of use and practicality as it is, even with widespread acceptance of their compute potential. FPGAs are much less known, it's hard for me to imagine a mainstream future of them.

Since memory bounds remains the biggest issue and not compute performance, I bet that the easiest to use accelerator with mass production and cheap access to the highest speed HBM is going to be the winner. GPUs are the current frontrunner, but the Fujitsu ARM CPU has easy access to HBM and could be a wildcard.

POWER10 will be using high performance GDDR6. Not quite HBM, but it signals that IBM is also concerned with the memory bandwidth problem in the near future.

CPUs could very well switch to HBM in some scenarios.

------------

If I were to guess the future: I think that AMD and NVidia have proven that today's systems need high speed routers to practically scale

AMD has their IO die on EPYC. NVidia has NVLink and NVSwitch. That seems to be how to get more dies / sockets without additional NUMA hops.

More efficient networks of chips with explicit switching / routing topologies is the only way to scale. The exact form of this network is still a mystery, but that's my big bet for the future.

HBM is probably the future for high performance. DDR5 for cheaper bulk RAM but HBM on high performance CPUs / GPUs / FPGAs is going to be key.

---------

The insight into RAM bottlenecks is interesting but seems to be point in favor of SMT. If your core is 50% waiting on RAM, then SMT into another thread to perform work while waiting on RAM.

volta83 · on July 5, 2021

> If your core is 50% waiting on RAM, then SMT into another thread to perform work while waiting on RAM.

If your core is 50% waiting on RAM, then SMT into another thread, and that other thread will want some memory to work on, so it will also wait on RAM. On Top of it, this second thread now puts extra pressure on the memory subsystem, might cause cache evictions for the other thread, etc etc etc

The moment that you include the memory subsystem into the SMT picture, SMT goes from a "no brainer; waiting on memory? do other work" to a "uhhh... i don't know if this makes things better or worse".

dragontamer · on July 5, 2021

Not quite.

DDR4 and DDR5 have 50ns (single socket) to 150ns (dual socket) latency.

For a 3GHz processor, that's 150 to 450 cycles.

On any latency bound problem, SMT helps. However, what you say is true on bandwidth bound problems. Given the shear number of pointer hopping that happens in typical OOP code these days (or python / JavaScript), I expect SMT to be of big help to typical applications.

DDR5 will double bandwidth in the near future. But that's not enough: HBM and GDDR6 have a possible future because you can only solve the bandwidth problem with more hardware. No tricks like SMT can help.

zozbot234 · on July 5, 2021

Memory bandwidth is just barely trying to keep up with cores, frequencies and IPC amounts. Bandwidth available per core is still going to drop. So newer development workflows that optimize for this bottleneck are going to be very relevant.

oscardssmith · on July 6, 2021

One thing that I think this will mean is the end of blas. Blas served us well for about 50 years, but one of the big problems it has is that it leads to code that takes multiple passes over memory. I think the future lies in systems that do code generation to better fuse loops of arbitrary code together. LoopVectorization.jl for example is able to generate blas level code for arbitrary computation, and as such can often be faster than blas since you don't have to use one of the specific hand optimized kernels.

dragontamer · on July 5, 2021

Memory bandwidth does improve at least.

Latency hasn't improved for the last 30 years. Tricks like SMT which can help mitigate the latency issue seem like the way forward.

Robotbeat · on July 5, 2021

FPGAs are hard to work with in part because the tools are extremely proprietary. But that is starting to change. Open source FPGA tools are becoming more common and more powerful.

jeffbee · on July 5, 2021

A delightful set of slides and references that tickles many of my pet topics. In particular, one I’d love to hear more about is why so many deployments are still choosing 2-socket servers by default when managing them is such a pain in the neck and the performance when you do it badly is so poor. Live the life of the future, today: choose single sockets!

rektide · on July 6, 2021

> In particular, one I’d love to hear more about is why so many deployments are still choosing 2-socket servers by default

bit of a guess but connectivity has been coming at a stupidly high premium for too long. there are sweet sweet blade chassis with price optimized less than full power 1p designs but 10g is still kind of novel there. bigger form factors are starting to see 25gbit at not-astronomical prices & switches in some rare cases are reasonable too.

power supplies, storage, networking... a computer has a lot of not-entirely-ancillary needs. having multiple chips sharing the perhipetals should make sense, should be cheap. it's not though. the SMP tax is huge huge huge.

thing is we don't need smp. we just need multihost peripherals. we needs nic's that like the grouphug ocp board can support 4 separate modes via pcie srv-io, a nice that can present multiple different virtual functions that different hosts can use. NVME similarly could be multiport- was was. power supplies are shared in ocp designs, with big bus rails, some 48v.

I'm notad yet but sure seems dead obvious to me the future of multi-socket is non-coherency. build a big board with a couple different isolated computers on it, but connected via shared nic or nics to the top-of-rack. we get close with the 3 per width ope compute systems but those each need to be self contained, and there's an obvious leap in efficiency to be had by merging those three separate computers onto a single motherboard, while sharing some network, maybe storage devices. also like throw in some gratis pcie ntb maybe for a medium speed (32gbps on pcie 4.0 x16) direct server to server interconnect. ideally add another ntb unit on most chips so we can make a little mediums speed nearly free ring, or other topology.

choose single sockets but choose many of them, each sharing some common peripherals.

soulbadguy · on July 5, 2021

In world where most work loads are containerized, and where each container can be pinned to numa region doesn't it really matter?

jeffbee · on July 5, 2021

k8s, by default, is oblivious to NUMA topology. You have to enable unreleased features and configure them correctly, which is the unwanted complexity to which I referred earlier. Simply aligning your containers to NUMA domains does not solve the problem that your arriving network frames or your NVMe completion queues can still be on the wrong domain. Isn't it simpler to just have 1 socket and not need to care? The number of cores available on a single socket system is pretty high these days, and in general the 1S parts are cheaper and faster.

rektide · on July 6, 2021

generally a huge fan of kubernetes but it's stunning what a did-it-ourselves dirtbag k8s opted to be every step of the way with regard to scheduling.

Facebook has really really good talks about managing process scheduling at scale, talking about how they leverage cgroups to do the right thing.

kubernetes seems to not give a fuck. they have their own resource systems they cooked up. shit gets scheduled in a huge massive cgroup. any order or control is userland, totally ignorant to the kernel control. there's not hierarchies, no priorities, everything is absolute, schedule or die. it's such a ginormous piece of shit, so in unbelievably willfully ignorant to all the good kernel technology that exists. it tries to make sure the kernel never has a role & that's just a huge mistake, just deeply tragic.

one noteable side effect ofany is that while the the kernel has many ways to make multi-tenant scheduling fairly reasonable, kubernetes has a variety of wild hair brained schemes, all of which detour around how easy the job would be if different pods could be scheduled in different cgroups. but that's somehow too blindingly obvious for kubernetes, which instead tries to mediate what to run entirely by itself.

slashdev · on July 5, 2021

Yeah, it makes a lot of sense to go with single socket servers unless you can't scale horizontally (e.g. database server). Why deal with the complexity when you can just side step it.

dragontamer · on July 5, 2021

Why would you switch from a 100GBps NUMA connection (800 gigabits per second) over NUMA fabric into a 10 Gbps Ethernet fabric?

If you are scaling horizontally, NUMA is the superior fabric than Ethernet or Infiniband (100Gbps)

Horizontal scaling seems to favor NUMA. 1000 chips over Ethernet is less efficient than 500 dual socket nodes over Ethernet. Anything you can do over Ethernet seems easier and cheaper over NUMA instead.

slashdev · on July 5, 2021

I'm talking mostly abour scaling things like app servers where they might not need any communication.

But in general if you can't scale horizontally at 10 gbps, you're in for a world of hurt. Numa gets you to 8x scale at best on very expensive very exotic hardware. And then you hit the wall.

dragontamer · on July 5, 2021

I'm mostly talking about 2 socket servers, which are IIRC more common than even single socket servers.

Dual socket is a cheap, easy, and common. If only to recycle fans, power supplies and racks, it seems useful.

slashdev · on July 6, 2021

And single socket is equally cheap, except it takes twice the rack space - but it also gives you redundancy. One server can fail and you can carry on.

The advantage of memory bandwidth vs Ethernet for scaling to x2 really doesn't matter. If it did, you're not horizontally scalable and at best you buy a little time before you hit the wall.

If the price difference isn't much, I would heavily prefer single socket.

kortilla · on July 5, 2021

Your scaling architecture sucks if it depends on that kind of throughput. If you need that you’ve only can kicked your way to more capacity without a real scaling fix.

dragontamer · on July 5, 2021

Depend on? Heavens no.

Dual socket has numerous advantages in density and rack space. The fact that performance is better is pretty much icing on the cake.

It's easier to manage 500 dual socket servers than 1000 single socket servers. Less equipment, higher utilization of parts, etc. Etc.

To suggest dual socket NUMA is going away is... just very unlikely to me. I don't see what the benefits would be at all. Not just performance, but also routine maintenance issues (power, Ethernet, local storage, etc etc)

wmf · on July 5, 2021

This is correct if your software is NUMA-optimized (or if auto-NUMA works well for you) but if it isn't you can end up with slowdowns.

dragontamer · on July 5, 2021

Surely that can be fixed with just a well placed numactl command to set node affinity and CPU affinity.

The root article is discussing rewriting code to fit on FPGAs. If NUMA is too complex then... I dunno. The FPGA argument seems dead on arrival.

wmf · on July 5, 2021

Does any container runtime/orchestrator perform this optimization yet? Why wait?

solarkennedy · on July 5, 2021

Titus (Netflix's container orchestrator that I work on) does this via: https://github.com/Netflix-Skunkworks/titus-isolate

syoc · on July 5, 2021

Kernel scheduling is NUMA aware and will localize workloads. Threads will mostly have their RAM on the sticks local to their node. The core the thread is delegated to is also more likely to be the core local to the disk or NIC being used for IO.

This is at least my experience, though I am no expert.

syoc · on July 5, 2021

Rack space can be quite expensive. Sometimes you need a lot of computing power in one or two rack units.

Would be interested in what the management pains are. I agree that 2 socket machines require more thought in a lot of scenarios, especially IO heavy workloads.

zaroth · on July 5, 2021

In my very limited experience it seems like space is much less an issue than power density.

You can fit far more kW/U than the datacenter can possibly cool.

In the commodity space that I rent, I ran out of power before filling even half the rack. I’m sure higher power/cooling density is possible to obtain, but I would think you’re primarily paying for that versus square footage?

zozbot234 · on July 5, 2021

What do you need that power density for? It's a rack, not a supercomputer. (I sure hope it's not "mining coins" or anything like that.)

wmf · on July 5, 2021

It's not really about needing power density; high density can easily happen accidentally. 40 1S servers in a rack could be 20 kW and 40 2S servers could be 30+ kW.

jeffbee · on July 5, 2021

The OpenCompute "Delta Lake" machine mentioned in the article occupies only one third of 1RU and peaks at 400W. You will certainly be power/cooling limited, rather than volume limited, with that kind of density.

toast0 · on July 6, 2021

All things considered, managing fewer hosts is nicer than more hosts.

In some hardware generations, dual socket has pretty good cost and complexity tradeoffs. And if you benefit from having a large dataset in memory on a single machine, dual socket often gets you twice the DIMM sockets and therefore twice the ram. Quad socket has been very expensive (and not great performance) for quite some time, so that's usually out.

Single socket Epyc looks pretty impressive though; although I'm retired and probably won't get to work with those anytime soon.

ksec · on July 5, 2021

>for storage including new uses for 3D Xpoint as a 3D NAND accelerator;

3D XPoint's future is not entirely certain. Intel with their new CEO has remained rather quiet on the subject. Micron are pulling the plug on it and sold the Fab to Texas Instrument. The problem is there isn't a clear path forward with the technology, it make some sense when NAND and DRAM price were high in 2016 - 2019. Once they dropped to a normal level with newer DDR5 and faster SLC NAND or ZNAND with lower latency than XPoint's cost benefits becomes unclear. I guess we will know once Intel's Optane P5800X [1] is out with review. It is quite a beast.

>Multi-Socket is Doomed

Are there really no use-case where 128 Core+ with NUMA offer some advantage?

>Slower Rotational

Seagate [2] is actually working on dual Actuator HDD, think of it as something like internal RAID 0. The rational being as HDD gets bigger the time to fill up those drive increases as well.

>ARM on Cloud

Marvell partly confirms all HyperScalers have intention to build their own ARM CPU. But Google just announced their Tau instances [3], effectively cutting their cost / pref by 50%. Where each vCPU is an entire physical CPU core rather than a x86 thread.

Not much mention on GPGPU.

[1] https://www.intel.com/content/www/us/en/products/docs/memory...

[2] https://www.anandtech.com/show/16544/seagates-roadmap-120-tb...

[3] https://cloud.google.com/blog/products/compute/google-cloud-...

infogulch · on July 5, 2021

> Are there really no use-case where 128 Core+ with NUMA offer some advantage?

Are there any use cases where 128+ core single socket wouldn't be preferred to a 128+ core multiple socket design that is burdened by NUMA?

AMD has been showing us that integrating the interconnects into the CPU package directly and letting it handle all the issues is a better design.

dragontamer · on July 5, 2021

When a hypothetical 128-core single socket comes out, will there be no workload that prefers to use a 2x128-core dual socket instead?

AMD CPUs remain largely dual-socket compatible. Today's 64-core EPYCs can be dual-socketed into 2x64-core beasts.

It just seems silly to me that if you're building say 200 computers in 10x racks (20-computers per 10x 40U racks) that you'd prefer single socket over dual-socket. If you're scaling up and out so much, what exactly is the problem with dual socket? Its not costs: dual socket remains cost-effective on a per-core basis over single-socket. Dual-sockets cuts the number of computers you need to work with in half. Etc. etc.

brendangregg · on July 6, 2021

I don't have a workload I'd prefer to see on 2x128-core: We're already microservices running across a pool of instances, and would prefer a bigger pool of faster instances than a smaller pool of slower ones at the same cost. Once we get a workload running on 100+ cores, I often see a lot of lock contention anyway. Going bigger usually makes that worse (worse ROI).

As for datacenter size/cost, it's a good point, but what if two 1-Socket servers could take up the same space as one 2-Socket server? :-) That may never happen, but some level of space optimization will, so it's not a simple doubling of size. E.g., Facebook's work in the OCP with 1-socket sleds (or blades):

https://www.opencompute.org/documents/ocp-yosemite-v3-platfo...

dragontamer · on July 6, 2021

> As for datacenter size/cost, it's a good point, but what if two 1-Socket servers could take up the same space as one 2-Socket server? :-) That may never happen, but some level of space optimization will

Oh it certainly exists. Computers are space-optimized to the point of nonsense. IIRC, most people don't even bother to use widely available 1U servers because you run out of power before you fill up 40U racks.

Hyperscalers, such as Google / Netflix / Amazon are a bit different of course (IIRC, you work at one right?), since they can specially build their data centers to have far denser power-delivery and actually support 40-computers or even 80-computers per rack. But more typical offices simply do not have the power-density to run 1U nodes or smaller (Ex: Supermicro 2xNodes in 1U nodes or Supermicro 4xNode in 2U).

In effect: modern computer systems usually run out of power before they run out of rack space. Especially when you consider that every Watt-delivered turns into Heat (Watts) generated, which then requires a more powerful air-conditioner to keep the room within operating specs.

So you're right that modern datacenters probably don't care about size. Space is relatively cheap, power-lines are expensive! 2x Sockets for 2x per 1U == 160 CPUs per 40U rack. 10 such racks would use over a Megawatt of power once we factor in air conditioning, so a typical building just won't handle that.

-------------

> I don't have a workload I'd prefer to see on 2x128-core: We're already microservices running across a pool of instances, and would prefer a bigger pool of faster instances than a smaller pool of slower ones at the same cost. Once we get a workload running on 100+ cores, I often see a lot of lock contention anyway. Going bigger usually makes that worse (worse ROI).

But that "lock contention" you're measuring is something like 250ns to 500ns over a channel that's 800Gbit/sec thick.

EDIT: To be more specific: I'm talking about the MESI messages going over the NUMA fabric.

In contrast, a packet over 10 Gbit Ethernet is basically two orders of magnitude less bandwidth and an order of magnitude more latency (maybe 2500 to 5000 nanoseconds of latency?).

If the application were truly limited by the communication paradigm between the NUMA Fabric, switching to Ethernet or InfiniBand would only slow it down further.

EDIT: Case in point: we don't do spinlocks over Ethernet. I mean, we could in theory (RDMA a region of memory over Ethernet and then hold it as a Spinlock), but we all know its a bad idea. We do spinlocks at L3 and/or the NUMA Fabric level (and maybe we'll do it over PCIe 5.0 / CCIX level, as cache-coherent I/O becomes possible).

brendangregg · on July 7, 2021

Ah, thanks for the details, I wasn't referring to NUMA-induced lock contention, but rather lock contention in general. I've seen a workload hit 64 CPUs with 80% of CPU time in lock contention (and others in the 10-30%). Now, while that means the developer has a big problem to fix, it also has me wondering about single sockets getting big enough -- 128 CPUs is already hard to use well. Back when I first saw multi-socket systems with 2-4 total CPUs, getting the extra CPUs online was all goodness. But adding another 128 CPUs to my already 128-CPU system, well...is there a point where we can say, in general, that we already have enough cores? In the talk I referred to the 850,000-core GPU, and how I couldn't see that ever working as general purpose CPUs in the software of today. With 3D stacking, I think we'll reach a practical core limit on a single socket, and just won't need the complexity (including NUMA) of multi socket anymore.

dragontamer · on July 7, 2021

But its the same in GPU-land.

There are plenty of tasks which max out at one block / threadgroup of 1024 CUDA-threads. cudaMemcpy is a silly example (probably maxes out memory bandwidth at just 64 or 32 cores used), but there are plenty of tasks that simply don't scale to the full use of a GPU.

Just because some tasks (many tasks?) fail to use more than 32-GPU cores doesn't mean that GPU-parallelism is useless. It just means that when you program those particular tasks, only use 32-GPU cores!! Then use the GPU-cores on _other_ tasks (possibly in parallel).

IIRC, cudaMalloc, and many other primitives in the CUDA framework, has been shown to have very little parallelism at all. You need to work at keeping this "sequential-code" outside of your inner loops. (Runs on CUDA-stream #0, which for older hardware at least is sequentially scheduled)

----------

1. Some tasks can effectively use infinite cores (SIMD-threads really for GPUs... but same idea since a SIMD-lane can largely emulate a thread as long as you're careful about branch divergence)

2. Some tasks can be parallelized at the application / operator level. Run many applications in parallel ("Makefile parallelism")

3. Some tasks (memcpy) are so memory-bound that parallelism will never help.

4. Some tasks have a better solution that becomes feasible with more compute power.

---------

Lets take a CPU example: H.264 encoding. IIRC, this task barely scales to 8 cores and has diminishing returns beyond that.

But an example of #2 would be Youtube: you have one encoding machine that handles transcoding in parallel. You don't run just 1 instance of the problem (using 8 cores), you run 32 in parallel, and each of those 32-instances can effectively use 8-cores for H264 encoding.

And #4 can still happen: H264 is pretty easy for modern computers with little chance of parallelism. Switching to H.265 or even to AV1 will increase the compute power needed, and allow scaling of single tasks up to 16 to 64 cores. Now your 2x128 hypothetical machine can only run 4 transcoding sessions at a time.

-----------

The dual-socket machine for a transcoding cluster is still superior over a single-socket machine. 10Gbit Ethernet is more than sufficient to handle 4x AV1 sessions (especially because AV1 is slower than realtime), so right there we've cut the number of Ethernet cables in half, which means we've cut the number of 10 Gbit Switches in half.

Having the two sockets share one Ethernet port is an efficiency gain even if you don't have any task-communication going on: if only for the I/O sharing capability of the NUMA Fabric.

drewg123 · on July 6, 2021

There are a few problems with dual sockets:

- The flip side of cuts the computers you need to work with in half is that it doubles the blast radius in case of PSU/fan/mobo/etc failure

- If you're interested in I/O, dual sockets can be problematic because few motherboards are "balanced" with an equal number of PCIe slots local to each socket.

- NUMA makes everything harder. Even after the work that I've done to make NUMA useful for Netflix's Open Connect (CDN) on FreeBSD, I'd very much rather just use flat machines wherever I can. NUMA gives lots of opportunities for comically bad performance if any little thing is placed incorrectly.

dragontamer · on July 6, 2021

I do appreciate the difficulty of getting software configured correctly.

But I'm of the opinion that software configuration is quicker and easier than redeveloping algorithms to become faster on FPGAs or GPUs.

Its really odd to have a talk about how FPGAs are part of a hypothetical mainstream future (when so few people even know how to code in Verilog, let alone know how to synthesize a systolic array or other obscure parallel architecture). And then turn around and say that Dual-Socket computers are too hard to configure.

Verilog / FPGAs aren't magic. They're just highly configurable logic gates + some preconfigured ALUs that allow for alternative parallel structures. These alternative parallel structures (most commonly a systolic array) are often highly specific to a task. But ultimately: the mode of compute still needs to be super-parallel to beat a CPU.

Remember: CPUs have higher clock-speeds than FPGAs. That's why FPGAs have mini-ALUs inside of them (ex: multipliers), because ASIC beats configurable logic in every spec that matters (GHz, power-efficiency, mm^2 on die).

hackermeows · on July 5, 2021

Nice talk , covers a lot of base . He predicts Unikernels are dead, Containers will keep growing and lighthweight vms will take over after that.

ineedasername · on July 6, 2021

3d CPU stacking seems interesting where surface area is a limited resource, but otherwise it seems like it would significantly complicate cooling things efficiently. Or isy assumption wrong?

wmf · on July 6, 2021

You're right; you don't want to stack hot silicon on top of other hot silicon.

rektide · on July 5, 2021

Some random contemporary musings, that touch some of these topics: I really hope we have a rad eBPF based QUIC/HTTP3 front-end/reverse-proxy router in the next 5 years.

QUIC is so exciting and I just want it to be both fast & a supremely flexible way for a connection from a client to talk to a host of backend services. We'll definitely see some classic userland based approaches emerge, but gee, really hungry for

For context, I was at the park two days ago, thinking about replacing a Node timesync[1] over websockets thing with a NTP-over-WebTransport (QUIC) implementation. There werent any H3 front-ends (which I kind of need because I just have some random colo & VPS boxes), and even if there were I was worried about adding latency (which a BPF based solution would significantly reduce, while letting me re-use ports 80/443).

Especially as we see more extreme-throughput/HBM memory systems arrive, it's just so neat that we have a multiplexed transport protocol. Figuring out how to use that connection (semi stateless "connection", because QUIC is awsome) to talk to an array of services is an ultra-interesting challenge, and BPF sure seems like the go-to tech for routing & managing packets in the world today. QUIC, with it's multiplexing, adds the complexity that it is now subpackets that we want to route. I hope we can find a way to keep a lot of that processing in the kernel.

[1] https://www.npmjs.com/package/timesync

rbanffy · on July 5, 2021

I have an enormous respect for Brandon Gregg, but this "one socket ought to be enough for anyone" is something I saw too many people get burned with.

I mean, it should, but who knows what the next version of Slack will need...

hinkley · on July 5, 2021

Still feels to me like we should be going the other way - kick more and more things off of the motherboard and support them with discrete - potentially customized - processors of their own.

Between io_uring and current or future facilities of eBPF, we have a lot of tools on deck for pipelining IO operations, and once you have a way to pipeline IO operations, the latency is not the only bottleneck. Then it’s a matter of how much bandwidth you can push between two processes, or processors.

nine_k · on July 6, 2021

OK, it must be a compute-heavy load that does little random memory access, but works mostly on compact in-cache structures and maybe does sustained sequential memory accesses. With that, it's not suitable to offload to the GPU.

What could it be? Serious question.

dragontamer · on July 6, 2021

Database with a large portion of data in-memory.

* Second socket increases the memory channels and RAM available: 16-channel dual-EPYC with 8TB of RAM will be faster than 4TB of RAM on single-EPYC 8-channel.

* SQL optimizers automatically search for sequential scans, because sequential scans are faster.

* While JOIN can be done in GPU space, GPUs have extremely low memory capacity (only 80GB on the latest A100 that costs $10,000+). CPU will be faster because you can keep a much larger dataset hot in RAM. Your 80GB of VRAM on a GPU means nothing if your dataset is in the multi-TB range. (8TB of CPU-RAM on the other hand, serves as a reasonable cache)

rbanffy · on July 7, 2021

More sockets add memory controllers, but we can also think about moving HBM closer to the cores as a L4 cache or scratch memory that’s not expected to be synchronised with other cores/sockets.