Slide 26 is interesting - arguing that cloud providers have an advantage for future CPU design since they can analyze so many real world customer workloads directly.
In previous roles I have worked with CPU vendors who have been very keen on getting access to profiling data from our workloads for design optimization, and lamenting the fact that it was hard to get such data and they were often limited to synthetic benchmark workloads when tuning new designs.
So this argument does sound like a valid one, and does imply AWS etc will have significant advantages in future designs.
Thanks, I think it's the start of a new "cloud CPU" era.
Cloud vendors have already split workloads into different instance types, so directly analyzing their workloads and developing CPUs for each instance type will lead to further performance wins. This may prompt the creation of even more instance types just to further separate workload types for future CPU specialization.
In the future, products like AWS outposts may become far more desirable over commodity hardware, as customers know it provides access to specialized CPUs and their performance. It's a path for cloud computing vendors to own the datacenters as well.
(Note that my predictions are not based on any internal knowledge: I'm just describing what I personally would be doing if I were a cloud vendor.)
He sort of implies that “just better hardware” will peter out in the 2030s. I think he’s calling it at least 50 years too soon. Here’s why: (1) I think logic designers are still faffing about in term of optimizing their designs; and (2), I think there’s a lot of smart people thinking “incrementally” through what we’d consider paradigm shifts in HW implementation. That is, our fabs will just naturally segue into 3D, spintronics, etc. I think he even mentions 3D circuits? One thing a lot of people miss is that layout of the design is materially different in 3D vs 2D: in 2D layout is NP-hard (complete) without efficient polynomial approximations; in 3D layout is low-order polynomial. The reduction in layout complexity will allow us to design things that are unthinkable right now, due to layout constraints & wire congestion.
> in 2D layout is NP-hard (complete) without efficient polynomial approximations; in 3D layout is low-order polynomial
Any chance you could explain to a novice why 3D is easier? To my naive intuition, it would have seemed like the more room to maneuver is offset by having more stuff to route.
Intuitively, we feel that larger solution spaces should make a problem harder because there are more possible solutions to consider. And of course this is true if exhaustive search is the only algorithm.
But CS is full of problems where the smaller space is NP-hard and the larger one isn't. Integer linear programming is a prominent example.
To resolve the intuition, we can think of the larger space as an "unconstrained" problem where the solution space is somehow natural and nice for the problem. From this perspective, the smaller space looks like an added constraint. Adding constraints usually makes problems harder.
> Adding constraints usually makes problems harder.
Adding constraints can also make a problem easier. Which is why you've carefully worded your sentence in this manner :-)
You clearly know about those edge cases, but I felt it necessary to elaborate on this point: Additional constraints can make a problem easier OR harder, depending on the problem and the search space. Its very non-intuitive.
Try laying out a square with all the vertices connected in a plane: it’s not possible. You must move one of the wires “up” a layer. Which wire? Great layout minimizes layer transitions while also bunching together related HW blocks. The NP-completeness proof is related to work Knuth’s student (Plass?) did on laying out images in TeX.
> Aren't "big" 3D circuits unfeasible due to temperature limitations, though?
Less so than you'd think, since (as long as you can keep leakage current under control) heat is only generated when the circuits are active (barring leakage current, when bits are being flipped). So you can have arbitrarily large amounts of increasingly-rarely-used circuitry for various purposes.
The naive-but-easy-to-understand example would be having separate, optimal circuitry for each machine instruction - the total number of gates is O(N*M), but the number of gates activated on each clock cycle (and thus the amount of waste heat generated) is only O(N), so you can keep adding new, perfectly-hardware-accelerated instructions up to the limits of physical space. In practice it's more complicated and less of a nonissue, but it's not "big circuits are useful in proportion to their surface area, not their volume", it's more "big circuits are less useful than their volume alone would suggest". (You do hit physical limits like the Bekenstein bound[0] eventually, but that's far enough out that we mostly don't care yet.)
The commenter above is correct: just stop toggling HW. We already do this to a great extent; we’re limited in the number of custom implementations because we can’t wire everything together. 3D chips will have a lot more “dark” logic than current chips, but will be orders-of-magnitude more efficient (& thus powerful) due to deep customization.
Also, remember the argument of my timeline is ~50–80 years out from now.
Dark logic isn't doing any useful work so what's the point? Sure you can include specialized circuitry for a bunch of rare cases, but that won't lift overall system performance much and will kill manufacturing yields.
This is totally wishfull thinking. Customised logix means customised code.
The GPU has an h265 and h264 hardware video encoder that has no support on Mac and ia a bitch to get working on linux. Most software doesn't support it most software does not support GPU compute either, and we had that for a decade now.
Fuck, we had SIMD in every CPU for like 30 years, and out of 50 most popular programming languages how many even support it? 5?
Sure, and I have regular meetings with processor vendors who want to understand our workload to better serve it. But there's no easy or fast way for them to get low-level data, including processor trace (cycle logs), across many customers. This can be vast amounts of data: Tbytes. If you are working on processors at a cloud vendor and want to know some low-level CPU detail, you could answer it immediately across a million customer workloads.
An interesting application of a now-familiar pattern: get lots of users, spy on them at massive scale, use those data to dominate some other market in a way that, at most, a single digit count of companies in the world could conceivably compete with (because none but they have anything like the data that you do). See also: everything to do with "AI".
Amazon, Microsoft, and Google have massive applications and systems that they run, some of which they sell as a service. They have plenty of workload data without having to poke around user VMs.
FPGAs from Xilinx are very complicated. They are no longer homogeneous 4-LUTs or 6LUTs with dedicated multipliers here and there.
Today's FPGAs are VLIW minicores capable of SIMD execution with custom routing and some LUTs thrown around. They've stepped towards GPU style architecture while retaining the custom logic portions.
FPGAs remain so difficult to use, I find it unlikely that they'd be mainstream in any capacity. GPUs seem like the easier way to get access to HBM + heavy compute, but either way the HBM future is eminent.
------------
GPUs have big questions about ease of use and practicality as it is, even with widespread acceptance of their compute potential. FPGAs are much less known, it's hard for me to imagine a mainstream future of them.
Since memory bounds remains the biggest issue and not compute performance, I bet that the easiest to use accelerator with mass production and cheap access to the highest speed HBM is going to be the winner. GPUs are the current frontrunner, but the Fujitsu ARM CPU has easy access to HBM and could be a wildcard.
POWER10 will be using high performance GDDR6. Not quite HBM, but it signals that IBM is also concerned with the memory bandwidth problem in the near future.
CPUs could very well switch to HBM in some scenarios.
------------
If I were to guess the future: I think that AMD and NVidia have proven that today's systems need high speed routers to practically scale
AMD has their IO die on EPYC. NVidia has NVLink and NVSwitch. That seems to be how to get more dies / sockets without additional NUMA hops.
More efficient networks of chips with explicit switching / routing topologies is the only way to scale. The exact form of this network is still a mystery, but that's my big bet for the future.
HBM is probably the future for high performance. DDR5 for cheaper bulk RAM but HBM on high performance CPUs / GPUs / FPGAs is going to be key.
---------
The insight into RAM bottlenecks is interesting but seems to be point in favor of SMT. If your core is 50% waiting on RAM, then SMT into another thread to perform work while waiting on RAM.
> If your core is 50% waiting on RAM, then SMT into another thread to perform work while waiting on RAM.
If your core is 50% waiting on RAM, then SMT into another thread, and that other thread will want some memory to work on, so it will also wait on RAM. On Top of it, this second thread now puts extra pressure on the memory subsystem, might cause cache evictions for the other thread, etc etc etc
The moment that you include the memory subsystem into the SMT picture, SMT goes from a "no brainer; waiting on memory? do other work" to a "uhhh... i don't know if this makes things better or worse".
DDR4 and DDR5 have 50ns (single socket) to 150ns (dual socket) latency.
For a 3GHz processor, that's 150 to 450 cycles.
On any latency bound problem, SMT helps. However, what you say is true on bandwidth bound problems. Given the shear number of pointer hopping that happens in typical OOP code these days (or python / JavaScript), I expect SMT to be of big help to typical applications.
DDR5 will double bandwidth in the near future. But that's not enough: HBM and GDDR6 have a possible future because you can only solve the bandwidth problem with more hardware. No tricks like SMT can help.
Memory bandwidth is just barely trying to keep up with cores, frequencies and IPC amounts. Bandwidth available per core is still going to drop. So newer development workflows that optimize for this bottleneck are going to be very relevant.
One thing that I think this will mean is the end of blas. Blas served us well for about 50 years, but one of the big problems it has is that it leads to code that takes multiple passes over memory. I think the future lies in systems that do code generation to better fuse loops of arbitrary code together. LoopVectorization.jl for example is able to generate blas level code for arbitrary computation, and as such can often be faster than blas since you don't have to use one of the specific hand optimized kernels.
FPGAs are hard to work with in part because the tools are extremely proprietary. But that is starting to change. Open source FPGA tools are becoming more common and more powerful.
A delightful set of slides and references that tickles many of my pet topics. In particular, one I’d love to hear more about is why so many deployments are still choosing 2-socket servers by default when managing them is such a pain in the neck and the performance when you do it badly is so poor. Live the life of the future, today: choose single sockets!
> In particular, one I’d love to hear more about is why so many deployments are still choosing 2-socket servers by default
bit of a guess but connectivity has been coming at a stupidly high premium for too long. there are sweet sweet blade chassis with price optimized less than full power 1p designs but 10g is still kind of novel there. bigger form factors are starting to see 25gbit at not-astronomical prices & switches in some rare cases are reasonable too.
power supplies, storage, networking... a computer has a lot of not-entirely-ancillary needs. having multiple chips sharing the perhipetals should make sense, should be cheap. it's not though. the SMP tax is huge huge huge.
thing is we don't need smp. we just need multihost peripherals. we needs nic's that like the grouphug ocp board can support 4 separate modes via pcie srv-io, a nice that can present multiple different virtual functions that different hosts can use. NVME similarly could be multiport- was was. power supplies are shared in ocp designs, with big bus rails, some 48v.
I'm notad yet but sure seems dead obvious to me the future of multi-socket is non-coherency. build a big board with a couple different isolated computers on it, but connected via shared nic or nics to the top-of-rack. we get close with the 3 per width ope compute systems but those each need to be self contained, and there's an obvious leap in efficiency to be had by merging those three separate computers onto a single motherboard, while sharing some network, maybe storage devices. also like throw in some gratis pcie ntb maybe for a medium speed (32gbps on pcie 4.0 x16) direct server to server interconnect. ideally add another ntb unit on most chips so we can make a little mediums speed nearly free ring, or other topology.
choose single sockets but choose many of them, each sharing some common peripherals.
k8s, by default, is oblivious to NUMA topology. You have to enable unreleased features and configure them correctly, which is the unwanted complexity to which I referred earlier. Simply aligning your containers to NUMA domains does not solve the problem that your arriving network frames or your NVMe completion queues can still be on the wrong domain. Isn't it simpler to just have 1 socket and not need to care? The number of cores available on a single socket system is pretty high these days, and in general the 1S parts are cheaper and faster.
generally a huge fan of kubernetes but it's stunning what a did-it-ourselves dirtbag k8s opted to be every step of the way with regard to scheduling.
Facebook has really really good talks about managing process scheduling at scale, talking about how they leverage cgroups to do the right thing.
kubernetes seems to not give a fuck. they have their own resource systems they cooked up. shit gets scheduled in a huge massive cgroup. any order or control is userland, totally ignorant to the kernel control. there's not hierarchies, no priorities, everything is absolute, schedule or die. it's such a ginormous piece of shit, so in unbelievably willfully ignorant to all the good kernel technology that exists. it tries to make sure the kernel never has a role & that's just a huge mistake, just deeply tragic.
one noteable side effect ofany is that while the the kernel has many ways to make multi-tenant scheduling fairly reasonable, kubernetes has a variety of wild hair brained schemes, all of which detour around how easy the job would be if different pods could be scheduled in different cgroups. but that's somehow too blindingly obvious for kubernetes, which instead tries to mediate what to run entirely by itself.
Yeah, it makes a lot of sense to go with single socket servers unless you can't scale horizontally (e.g. database server). Why deal with the complexity when you can just side step it.
Why would you switch from a 100GBps NUMA connection (800 gigabits per second) over NUMA fabric into a 10 Gbps Ethernet fabric?
If you are scaling horizontally, NUMA is the superior fabric than Ethernet or Infiniband (100Gbps)
Horizontal scaling seems to favor NUMA. 1000 chips over Ethernet is less efficient than 500 dual socket nodes over Ethernet. Anything you can do over Ethernet seems easier and cheaper over NUMA instead.
I'm talking mostly abour scaling things like app servers where they might not need any communication.
But in general if you can't scale horizontally at 10 gbps, you're in for a world of hurt. Numa gets you to 8x scale at best on very expensive very exotic hardware. And then you hit the wall.
And single socket is equally cheap, except it takes twice the rack space - but it also gives you redundancy. One server can fail and you can carry on.
The advantage of memory bandwidth vs Ethernet for scaling to x2 really doesn't matter. If it did, you're not horizontally scalable and at best you buy a little time before you hit the wall.
If the price difference isn't much, I would heavily prefer single socket.
Your scaling architecture sucks if it depends on that kind of throughput. If you need that you’ve only can kicked your way to more capacity without a real scaling fix.
Dual socket has numerous advantages in density and rack space. The fact that performance is better is pretty much icing on the cake.
It's easier to manage 500 dual socket servers than 1000 single socket servers. Less equipment, higher utilization of parts, etc. Etc.
To suggest dual socket NUMA is going away is... just very unlikely to me. I don't see what the benefits would be at all. Not just performance, but also routine maintenance issues (power, Ethernet, local storage, etc etc)
Kernel scheduling is NUMA aware and will localize workloads.
Threads will mostly have their RAM on the sticks local to their node. The core the thread is delegated to is also more likely to be the core local to the disk or NIC being used for IO.
This is at least my experience, though I am no expert.
Rack space can be quite expensive.
Sometimes you need a lot of computing power in one or two rack units.
Would be interested in what the management pains are.
I agree that 2 socket machines require more thought in a lot of scenarios, especially IO heavy workloads.
In my very limited experience it seems like space is much less an issue than power density.
You can fit far more kW/U than the datacenter can possibly cool.
In the commodity space that I rent, I ran out of power before filling even half the rack. I’m sure higher power/cooling density is possible to obtain, but I would think you’re primarily paying for that versus square footage?
It's not really about needing power density; high density can easily happen accidentally. 40 1S servers in a rack could be 20 kW and 40 2S servers could be 30+ kW.
The OpenCompute "Delta Lake" machine mentioned in the article occupies only one third of 1RU and peaks at 400W. You will certainly be power/cooling limited, rather than volume limited, with that kind of density.
All things considered, managing fewer hosts is nicer than more hosts.
In some hardware generations, dual socket has pretty good cost and complexity tradeoffs. And if you benefit from having a large dataset in memory on a single machine, dual socket often gets you twice the DIMM sockets and therefore twice the ram. Quad socket has been very expensive (and not great performance) for quite some time, so that's usually out.
Single socket Epyc looks pretty impressive though; although I'm retired and probably won't get to work with those anytime soon.
>for storage including new uses for 3D Xpoint as a 3D NAND accelerator;
3D XPoint's future is not entirely certain. Intel with their new CEO has remained rather quiet on the subject. Micron are pulling the plug on it and sold the Fab to Texas Instrument. The problem is there isn't a clear path forward with the technology, it make some sense when NAND and DRAM price were high in 2016 - 2019. Once they dropped to a normal level with newer DDR5 and faster SLC NAND or ZNAND with lower latency than XPoint's cost benefits becomes unclear. I guess we will know once Intel's Optane P5800X [1] is out with review. It is quite a beast.
>Multi-Socket is Doomed
Are there really no use-case where 128 Core+ with NUMA offer some advantage?
>Slower Rotational
Seagate [2] is actually working on dual Actuator HDD, think of it as something like internal RAID 0. The rational being as HDD gets bigger the time to fill up those drive increases as well.
>ARM on Cloud
Marvell partly confirms all HyperScalers have intention to build their own ARM CPU. But Google just announced their Tau instances [3], effectively cutting their cost / pref by 50%. Where each vCPU is an entire physical CPU core rather than a x86 thread.
When a hypothetical 128-core single socket comes out, will there be no workload that prefers to use a 2x128-core dual socket instead?
AMD CPUs remain largely dual-socket compatible. Today's 64-core EPYCs can be dual-socketed into 2x64-core beasts.
It just seems silly to me that if you're building say 200 computers in 10x racks (20-computers per 10x 40U racks) that you'd prefer single socket over dual-socket. If you're scaling up and out so much, what exactly is the problem with dual socket? Its not costs: dual socket remains cost-effective on a per-core basis over single-socket. Dual-sockets cuts the number of computers you need to work with in half. Etc. etc.
I don't have a workload I'd prefer to see on 2x128-core: We're already microservices running across a pool of instances, and would prefer a bigger pool of faster instances than a smaller pool of slower ones at the same cost. Once we get a workload running on 100+ cores, I often see a lot of lock contention anyway. Going bigger usually makes that worse (worse ROI).
As for datacenter size/cost, it's a good point, but what if two 1-Socket servers could take up the same space as one 2-Socket server? :-) That may never happen, but some level of space optimization will, so it's not a simple doubling of size. E.g., Facebook's work in the OCP with 1-socket sleds (or blades):
> As for datacenter size/cost, it's a good point, but what if two 1-Socket servers could take up the same space as one 2-Socket server? :-) That may never happen, but some level of space optimization will
Oh it certainly exists. Computers are space-optimized to the point of nonsense. IIRC, most people don't even bother to use widely available 1U servers because you run out of power before you fill up 40U racks.
Hyperscalers, such as Google / Netflix / Amazon are a bit different of course (IIRC, you work at one right?), since they can specially build their data centers to have far denser power-delivery and actually support 40-computers or even 80-computers per rack. But more typical offices simply do not have the power-density to run 1U nodes or smaller (Ex: Supermicro 2xNodes in 1U nodes or Supermicro 4xNode in 2U).
In effect: modern computer systems usually run out of power before they run out of rack space. Especially when you consider that every Watt-delivered turns into Heat (Watts) generated, which then requires a more powerful air-conditioner to keep the room within operating specs.
So you're right that modern datacenters probably don't care about size. Space is relatively cheap, power-lines are expensive! 2x Sockets for 2x per 1U == 160 CPUs per 40U rack. 10 such racks would use over a Megawatt of power once we factor in air conditioning, so a typical building just won't handle that.
-------------
> I don't have a workload I'd prefer to see on 2x128-core: We're already microservices running across a pool of instances, and would prefer a bigger pool of faster instances than a smaller pool of slower ones at the same cost. Once we get a workload running on 100+ cores, I often see a lot of lock contention anyway. Going bigger usually makes that worse (worse ROI).
But that "lock contention" you're measuring is something like 250ns to 500ns over a channel that's 800Gbit/sec thick.
EDIT: To be more specific: I'm talking about the MESI messages going over the NUMA fabric.
In contrast, a packet over 10 Gbit Ethernet is basically two orders of magnitude less bandwidth and an order of magnitude more latency (maybe 2500 to 5000 nanoseconds of latency?).
If the application were truly limited by the communication paradigm between the NUMA Fabric, switching to Ethernet or InfiniBand would only slow it down further.
EDIT: Case in point: we don't do spinlocks over Ethernet. I mean, we could in theory (RDMA a region of memory over Ethernet and then hold it as a Spinlock), but we all know its a bad idea. We do spinlocks at L3 and/or the NUMA Fabric level (and maybe we'll do it over PCIe 5.0 / CCIX level, as cache-coherent I/O becomes possible).
Ah, thanks for the details, I wasn't referring to NUMA-induced lock contention, but rather lock contention in general. I've seen a workload hit 64 CPUs with 80% of CPU time in lock contention (and others in the 10-30%). Now, while that means the developer has a big problem to fix, it also has me wondering about single sockets getting big enough -- 128 CPUs is already hard to use well. Back when I first saw multi-socket systems with 2-4 total CPUs, getting the extra CPUs online was all goodness. But adding another 128 CPUs to my already 128-CPU system, well...is there a point where we can say, in general, that we already have enough cores? In the talk I referred to the 850,000-core GPU, and how I couldn't see that ever working as general purpose CPUs in the software of today. With 3D stacking, I think we'll reach a practical core limit on a single socket, and just won't need the complexity (including NUMA) of multi socket anymore.
There are plenty of tasks which max out at one block / threadgroup of 1024 CUDA-threads. cudaMemcpy is a silly example (probably maxes out memory bandwidth at just 64 or 32 cores used), but there are plenty of tasks that simply don't scale to the full use of a GPU.
Just because some tasks (many tasks?) fail to use more than 32-GPU cores doesn't mean that GPU-parallelism is useless. It just means that when you program those particular tasks, only use 32-GPU cores!! Then use the GPU-cores on _other_ tasks (possibly in parallel).
IIRC, cudaMalloc, and many other primitives in the CUDA framework, has been shown to have very little parallelism at all. You need to work at keeping this "sequential-code" outside of your inner loops. (Runs on CUDA-stream #0, which for older hardware at least is sequentially scheduled)
----------
1. Some tasks can effectively use infinite cores (SIMD-threads really for GPUs... but same idea since a SIMD-lane can largely emulate a thread as long as you're careful about branch divergence)
2. Some tasks can be parallelized at the application / operator level. Run many applications in parallel ("Makefile parallelism")
3. Some tasks (memcpy) are so memory-bound that parallelism will never help.
4. Some tasks have a better solution that becomes feasible with more compute power.
---------
Lets take a CPU example: H.264 encoding. IIRC, this task barely scales to 8 cores and has diminishing returns beyond that.
But an example of #2 would be Youtube: you have one encoding machine that handles transcoding in parallel. You don't run just 1 instance of the problem (using 8 cores), you run 32 in parallel, and each of those 32-instances can effectively use 8-cores for H264 encoding.
And #4 can still happen: H264 is pretty easy for modern computers with little chance of parallelism. Switching to H.265 or even to AV1 will increase the compute power needed, and allow scaling of single tasks up to 16 to 64 cores. Now your 2x128 hypothetical machine can only run 4 transcoding sessions at a time.
-----------
The dual-socket machine for a transcoding cluster is still superior over a single-socket machine. 10Gbit Ethernet is more than sufficient to handle 4x AV1 sessions (especially because AV1 is slower than realtime), so right there we've cut the number of Ethernet cables in half, which means we've cut the number of 10 Gbit Switches in half.
Having the two sockets share one Ethernet port is an efficiency gain even if you don't have any task-communication going on: if only for the I/O sharing capability of the NUMA Fabric.
- The flip side of cuts the computers you need to work with in half is that it doubles the blast radius in case of PSU/fan/mobo/etc failure
- If you're interested in I/O, dual sockets can be problematic because few motherboards are "balanced" with an equal number of PCIe slots local to each socket.
- NUMA makes everything harder. Even after the work that I've done to make NUMA useful for Netflix's Open Connect (CDN) on FreeBSD, I'd very much rather just use flat machines wherever I can. NUMA gives lots of opportunities for comically bad performance if any little thing is placed incorrectly.
I do appreciate the difficulty of getting software configured correctly.
But I'm of the opinion that software configuration is quicker and easier than redeveloping algorithms to become faster on FPGAs or GPUs.
Its really odd to have a talk about how FPGAs are part of a hypothetical mainstream future (when so few people even know how to code in Verilog, let alone know how to synthesize a systolic array or other obscure parallel architecture). And then turn around and say that Dual-Socket computers are too hard to configure.
Verilog / FPGAs aren't magic. They're just highly configurable logic gates + some preconfigured ALUs that allow for alternative parallel structures. These alternative parallel structures (most commonly a systolic array) are often highly specific to a task. But ultimately: the mode of compute still needs to be super-parallel to beat a CPU.
Remember: CPUs have higher clock-speeds than FPGAs. That's why FPGAs have mini-ALUs inside of them (ex: multipliers), because ASIC beats configurable logic in every spec that matters (GHz, power-efficiency, mm^2 on die).
3d CPU stacking seems interesting where surface area is a limited resource, but otherwise it seems like it would significantly complicate cooling things efficiently. Or isy assumption wrong?
Some random contemporary musings, that touch some of these topics: I really hope we have a rad eBPF based QUIC/HTTP3 front-end/reverse-proxy router in the next 5 years.
QUIC is so exciting and I just want it to be both fast & a supremely flexible way for a connection from a client to talk to a host of backend services. We'll definitely see some classic userland based approaches emerge, but gee, really hungry for
For context, I was at the park two days ago, thinking about replacing a Node timesync[1] over websockets thing with a NTP-over-WebTransport (QUIC) implementation. There werent any H3 front-ends (which I kind of need because I just have some random colo & VPS boxes), and even if there were I was worried about adding latency (which a BPF based solution would significantly reduce, while letting me re-use ports 80/443).
Especially as we see more extreme-throughput/HBM memory systems arrive, it's just so neat that we have a multiplexed transport protocol. Figuring out how to use that connection (semi stateless "connection", because QUIC is awsome) to talk to an array of services is an ultra-interesting challenge, and BPF sure seems like the go-to tech for routing & managing packets in the world today. QUIC, with it's multiplexing, adds the complexity that it is now subpackets that we want to route. I hope we can find a way to keep a lot of that processing in the kernel.
Still feels to me like we should be going the other way - kick more and more things off of the motherboard and support them with discrete - potentially customized - processors of their own.
Between io_uring and current or future facilities of eBPF, we have a lot of tools on deck for pipelining IO operations, and once you have a way to pipeline IO operations, the latency is not the only bottleneck. Then it’s a matter of how much bandwidth you can push between two processes, or processors.
OK, it must be a compute-heavy load that does little random memory access, but works mostly on compact in-cache structures and maybe does sustained sequential memory accesses. With that, it's not suitable to offload to the GPU.
* Second socket increases the memory channels and RAM available: 16-channel dual-EPYC with 8TB of RAM will be faster than 4TB of RAM on single-EPYC 8-channel.
* SQL optimizers automatically search for sequential scans, because sequential scans are faster.
* While JOIN can be done in GPU space, GPUs have extremely low memory capacity (only 80GB on the latest A100 that costs $10,000+). CPU will be faster because you can keep a much larger dataset hot in RAM. Your 80GB of VRAM on a GPU means nothing if your dataset is in the multi-TB range. (8TB of CPU-RAM on the other hand, serves as a reasonable cache)
More sockets add memory controllers, but we can also think about moving HBM closer to the cores as a L4 cache or scratch memory that’s not expected to be synchronised with other cores/sockets.
In previous roles I have worked with CPU vendors who have been very keen on getting access to profiling data from our workloads for design optimization, and lamenting the fact that it was hard to get such data and they were often limited to synthetic benchmark workloads when tuning new designs.
So this argument does sound like a valid one, and does imply AWS etc will have significant advantages in future designs.