We published a paper where we captured the same kind of insights (deep numa hierarchies including cache levels, numa nodes, packages) and used them to tailor spinlocks to the underlying machine: https://dl.acm.org/doi/10.1145/3477132.3483557
It looks kinda like the color scales are normalized to just-this-CPU's latency? It would be neater if the scale represented the same values among CPUs. Or rather, it would be neat if there were an additional view for this data that could make it easier to compare among them.
I think the differences are really interesting to consider. What if the scheduler could consider these designs when weighing how to schedule each task? Either statically or somehow empirically? I think I've seen sysfs info that describes the cache hierarchies, so maybe some of this info is available already. That nest [1] scheduler was recently shared on HN, I suppose it may be taking advantage of some of these properties.
You might just be going down a deep rabbit hole with these questions :)
You can definitely do things like that, including pinning an application to a certain set of cores for performance. When you have NUMA it's pretty much mandatory for low latency applications, since each CPU gets its own memory slots, significantly lowering latency cost. It's physically closer, for one.
It gets really fun when you get into network traffic, since you can bind a NIC to specific cores alongside kernel bypass stuffs and interrupt handling. Sometimes it even makes sense to have one NIC per CPU!
It's a lot of setup work though. Have a look into CPU shielding as well. The Linux scheduler cares more about fairness than it does latency, so you have to be very intentional in your setup.
The phrase for this is NUMA, and NUMA scheduling is the term for scheduling that accounts for it. Mostly, it avoids moving threads between different NUMA domains.
This works much better on my POWER9 system. I have two CPUs in my machine with the same core count but different cache configurations according to hwloc and this showed that they really are different.
I was wondering what real-life situations this benchmark matters the most in, then I remembered... A few years ago I was working on a uni research project trying to eek out the most performance possible in an x86 software-defined EPC, basically the gateway that sits between the LTE cell tower intranet and the rest of the Internet. The important part for me to optimize was the control plane, which handles handshakes between end users and the gateway (imagine everyone spam-toggling airplane mode when their LTE drops). Cache coherence latency was a bottleneck. The control plane I developed had diminishing returns in throughput up to like 8 cores on a 12-core CPU in our dual-socket test machine. Beyond that, adding cores actually slowed it down significantly*. Not a single-threaded task but not embarrassingly parallel either. The data plane was more parallel, and it ran on a separate NUMA node. Splitting either across NUMA nodes destroyed the performance.
* which in hindsight sounds like TurboBoost was enabled, but I vaguely remember it being disabled in tests
> So what did you do to get the most performance out of the system? pin the threads to specific CPUs?
No, they were already pinning to cores. Kernel was also in low-latency mode.
The project they handed to me was just the data plane, written in Rust. They also gave me a partial reference implementation of the control plane, meant for research purposes rather than performance. I had to add a lot of missing features to get it up to par with the supposedly industry-standard benchmark, which didn't exactly match spec so I had to reverse-engineer. Then I had to mate it with the data plane. (They originally suggested I build a control plane from scratch in Rust, but the lack of an ASN1 codegen lib for it made this infeasible within the time I had, considering also that I had 0 systems experience or familiarity with the protocols.) I don't remember all the optimizations, but the ones that still come to mind:
1. Fixing all their memory leaks, obviously. Kinda hard cause they were in C code auto-generated by a Python script. There was even a code comment // TODO free memory after use.
2. Improving the ASN1 en/decoding to take advantage of memcpy in some places. This is because asn1 has different alignment modes, some byte-aligned and some bit-aligned. The control plane used byte-aligned, but the standard ASN1.c lib understandably assumed bit-aligned in either case for simplicity's sake since it worked either way. So I added an optimized path for byte-aligned that used memcpy instead of inspecting each byte for the end markers. This was in a tight loop and made the biggest difference; basically every string copy got faster. The relevant function even knew it was in byte-aligned mode, so it was a simple fix once I figured it out; I tried to make a PR to improve this for everyone else, but forget why I couldn't.
3. Playing with different arrangements of passing messages between threads and different ways of locking. I forget all the ones we tried. Using "parking lot" locks instead of the default ones in the Rust portion helped, also more optimistic locks in other places instead of mutexes, I forget where and why. Since then I've come across the general concept of optimistic vs pessimistic locking a lot as something that makes or breaks performance, particularly in systems that handle money.
4. As I said, playing with the number of threads for each different setup in #3.
5. Playing with NIC settings. We were using Intel's DPDK library and optimized NIC drivers.
6. Making a custom `malloc` implementation that used a memory pool, was thread-scoped, and was optimized for repeated small allocs/deallocs specifically for a portion of the reused code that had a weird and inefficient pattern of memory access. I got it to be faster than the built-in malloc, BUT it was still break-even with DPDK's custom pooled malloc, so I gave up.
7. Branch hints. Tbh didn't make a big difference, even though this was pre Meltdown/Spectre.
8. Simplifying the telemetry. Idk if this helped performance, more of a rant... It's good enough to have some counters that you printf every 60s or something, then parse the logs with a Python script. That's very non-prone to bugs, anyone can understand it, and you can easily tell there's no significant impact on performance. It's overkill in this case to have a protobuf/HTTP client sending metrics to a custom sidecar process, complicating your builds, possibly impacting performance, and leaving no simple paper trail from each test. I respected the previous guy's engineering skills more than my own, but once I found a bug in that code, I took it out.
If anyone is interested, here are the results on my M1 Pro running Asahi Linux:
Min: 48.3
Max: 175.0
Mean: 133.0
I’ll try to copy the exact results once I have a browser on Asahi, but the general pattern is most pairs have >150ns and a few (0-1; 2-3,4,5; 3-4,5; 4-5; 6-7,8,9; 7-8,9; 8-9) are faster at about 50ns.
Edit: The results from c2clat (a little slower, but the format is nicer) are below.
It would be interesting to have a more detailed understanding of why these are the latencies, e.g. this repo has ‘clusters’ but there is surely some architectural reason for these clusters. Is it just physical distance on the chip or is there some other design constraint?
I find it pretty interesting where the interface that cpu makers present (eg a bunch of equal cores) breaks down.
Rings are great for latency on low core count situations. The LCC Intel chips all have a massive 512-bit ring bus (2x256-bit in each direction) internally which delivers crazy fast core to core latency. However, this quickly starts to break down under higher core counts. Intel gets around this to some extent with its P and E cores with 4 E cores occupying the same slot on the ring bus as a P core. However, once you start getting over 14-16 slots on the ring bus it starts to get overwhelmed.
So when you switch to mesh buses the interconnect takes up way more space. So one has to compromise between bus width and the amount of area one is using for the interconnects. Typically this means running reduced width buses around the mesh which limits core to core bandwidth. Not so much a big deal if you're running a server, more a problem though if you're trying to run interactively with a user. Unless of course you're Apple and just devote a truckload of die space to dump a fucking mammoth amount of interconnect between your dies.
There's also ancillary concerns as well like fabrication yield. For instance AMD runs chiplets probably because they can mix and match yields and they naturally segment the market. Get a CCX with 3 working cores? Pair it with another and you have a 6C/12T CPU. Get a CCX with 2 working cores? Pair it with another and you get a 4C/8T. Intel either gets a working die or they don't.
The problem here is the interconnect between the CCXs is relatively slow. Dog slow compared to the ring bus. Even running the Infinity Fabric's fclock at 1.8GHz only nets you 57.6GB/sec between CCXs and five times the latency of the ring bus. When you look at a Ryzen 3300 (2x2 CCX) and a Ryzen 3300X (1x4 CCX) the difference in performance is non-trivial and that's the Infinity Fabric dragging performance down. In comparison an Intel core's L3 cache on a 3GHz ring bus (i.e. non-turbo) pulls down at 96GB/sec. Sure you're still ultimately limited by DRAM but if stuff is staying in LLC it's a hell of a performance boost. In Zen 3 AMD even went to 8 core CCXs which gave the whole thing a huge performance boost. Part of that was because the smaller lithography gave them more area to play with so they could fit everything plus the interconnects onto the chiplet size they needed.
So yeah, I hope that little greatly oversimplified, surface level look was helpful.
Most of this cross-core overhead diversity is gone on skylake and newer chips because Intel moved from a ring topology to mesh design for their l3 caches.
But TL;DR modern big processors are not one big piece of silicon but basically "SMP in a box", a bunch of smaller chiplets interconnected with eachother. That helps with yield ("bad" chiplet costs you just 8 cores, not whole 16/24/48/64 core chip). Those also usually come with their own memory controllers.
And so you basically have NUMA on a single processor with all of the optimization challenges for it
Some of it is simple distance. Some of it is architectural choices because of the distance. A sharing domain that spans a large distance performs poorly because of the latency. Therefore domains are kept modest, but the consequence is crossing domains has an extra penalty.
I am currently working on my master's degree on computer science and studying on this exact topic.
In order to measure core-to-core latency, we should also learn how the cache coherence works on Intel. I am currently experimenting with microbenchmarks on Skylake microarchitecture. Due to the scalability issues with ring interconnect on CPU dies in previous models, Intel opted for 2D mesh interconnect microarchitecture in recent years. In this microarchitecture, CPU die is split into tiles each accommodating cores, caches, CHA, snoop filter etc. I want to emphasize the role of CHA here. Each CHA is responsible for managing coherence of a portion of the addresses. If a core tries to fetch a variable that is not in its L1D or L2 cache, the CHA managing the coherence of the address of the variable being fetched will be queried to learn whereabouts of the variable. If the data is on the die, the core currently owning the variable will be told to forward that variable to the requesting core. So, even though the cores that communicate with each other are physically contiguous, the location of the CHA that manages the coherence of the variable they will pass back and forth also is important due to cache coherence mechanism.
Please see my reply/response ref: "ffwd: delegation is (much) faster than you think" in this HN post. It's a scheme to leverage high core-to-core bandwidth to make coordinating shared data structures easier. I'd definitely be interested in your 2 cents.
I suspect what you are seeing is the preferred cores. With TVB/OCTVB (thermal velocity boost/overclocking thermal velocity boost) those can boost higher than the other cores.
From my 12900ks (its a boosted 12900k)
Core 6-7 is rated for 5.5ghz.
Freq MHz DTS C VID mv PC Eff Fr UCcode 0x1f VR Volt Limit 2500
----------------------------------------------------
800 24 1307 52 517 Uncore 3600 IA AC LL 0.5999
800 20 1282 52 47 Power 23.465 IA DC LL 1.0996
800 20 1314 52 70 Current Limit 0.0000 SA AC LL 0.0000
800 20 1308 52 69 iccmax 1023 SA DC LL 0.0000
800 20 1303 52 88 PL1 32760 iccmax dis True
800 18 1293 52 69 PL2 32760 TAU 33
800 20 1286 55 39 Memory 5200 PPP_OVR True
800 22 730 55 37 EE_Turbo_Dis True RTH_Dis True
800 22 1377 40 60 Dis_Ring_EE False HWGuidedSch True
800 22 1172 40 73 IA_CEP_Dis True Dynamic_Mem True
Full_Range_Multi False SA_Freq_OVR False
TSC_Dis_HW False Banding_Ratio 0
PVD_Ratio_thresh 0 SA_CEP_Dis True
FLL OC Mode 3
Core Voltage Adapt 0 Ring Voltage Adapt 0 Core PLL V 900
Core Voltage Offset 0 Ring Voltage Offset 0 Ring PLL V 900
L2 Voltage Adapt 0 L2 Voltage Offset 0 AVX512 Offset 0
SA Voltage Offset 0 Ring VID 0 MC PLL V 900
AVX Offset 0 AVX2 V Guardband 0 AVX512 V Guardband 0
SA Voltage Manual 1150
Turbo Ratio Limit 55, 55, 52, 52, 52, 52, 52, 52 Core OCMB Max Ratio 0 Ring Min Ratio 8
Turbo Limit Cores 1, 2, 3, 4, 5, 6, 7, 8 Ring OCMB Max Ratio 0 Ring Max Ratio 47
Atom Ratio Limit 40, 40, 40, 40, 40, 40, 40, 40 Atom Limit Cores 255, 255, 255, 255, 255, 255, 255, 255Atom OCMB Max Ratio 0
OS Max Ratio 34 HWP Min Ratio 43 HWP Max Ratio 255
Max Possible Core 40 Max Possible Ring 47 UCLK 2600
Num cores: 24
Using RDTSC to measure time: false
Num round trips per samples: 5000
Num samples: 300
Showing latency=round-trip-time/2 in nanoseconds:
I mean why CPU=8, not CPU=7 or CPU =9 has the fastest access.
Anyway, I have asked this in [stackoverflow](https://stackoverflow.com/questions/73767563) and get downvotes too. T_T
Fails to build from source with Rust 1.59 so I tried the C++ `c2clat` from elsewhere in the thread. Quite interesting on Alder Lake, because the quartet of Atom cores has uniform latency (they share an L2 cache and other resources) while the core-to-core latency of the Core side of the CPU varies. Note that the way these are logically numbers is 0,1 are SMT threads of the first core and so forth through 14-15. 16-19 are Atom cores with 1 thread each.
It's probably the ring bus topology. As you pointed out E cores are directly connected to L2 and only one stop away from each other on the ring bus. P cores can be up to 6 stops away from each other because there's 4 on each side, an E core cluster on each side, and a system agent (PCIe/DRAM) or iGPU stop in between each side.
The data seems problematic, there's a few 0's here and there and some strange noise in the rest. Increase the number of samples / iterations with `core-to-core-latency 30000 1000 --csv > output.csv`
I've been doing some latency measurements like this, but between two processes using unix domain sockets. I'm measuring more on the order of 50uS on average, when using FIFO RT scheduling. I suspect the kernel is either letting processes linger for a little bit, or perhaps the "idle" threads tend to call into the kernel and let it do some non-preemptable book keeping.
If I crank up the amount of traffic going through the sockets, the average latency drops, presumably due to the processes being able to batch together multiple packets rather than having to block on each one.
I only use AF_UNIX sockets when I need to pass open file handles between processes. I generally prefer message queues: https://linux.die.net/man/7/mq_overview
You might want to measure yourself, because the table of results there doesn't make a lot of sense to me. 4-5 of those methods should be dominated by context switch latency and thus clustered together tightly.
I looked at the sources. The pipe and fifo benchmarks also send a signal on every message, so they're not measuring what you'd expect.
This is a fascinating insight into a subsystem which we take from granted and naively assume is homogeneous. Thank you so much for sharing.
A request to the community - I am particularly interested in the Apple M1 Ultra. Apple made a pretty big fuss about the transparency of their die-to-die interconnect in the M1 Ultra. So, it would be very interesting to see what happens with it - both on Mac OS and (say, Asahi) Linux.
This paper describes a mechanism for client threads pinned to a distinct cores to delegate a function call to distinguished server thread pinned to its own core all on the same socket.
This has a multitude of applications the most obvious one making a shared data structure MT safe through delegation rather than saddling it with mutexes or other synchronization points especially beneficial with small critical sections.
The paper's abstract concludes claiming "100% [improvement] over the next best solution tested (RCL), and multiple micro-benchmarks show improvements in the 5–10× range."
The code does delegation without CAS, locks, or atomics.
The efficacy of such a scheme rests on two facets, which the paper explains:
* Modern CPUs can move GBs/second between core L2/LLC caches
* The synchronization between requesting clients and responding servers depends on each side spinning on shared memory address looking for bit toggles. Briefly, servers only read client request memory which the client only writes. (Clients each have their own slot). And on the response side client's read the servers shared response memory, which only the server writes. This one-side read, one-side write is supposed to minimize the number of cache invalidations and MESI syncs.
I spent some time testing the author's code and went so far as writing my own version. I was never able to make it work with anywhere near the throughput claimed in the paper. There's also some funny "nop" assembler instructions within the code that I gather is a cheap form of thread yielding.
In fact this relatively simple SPCP MT ring buffer which has but a fraction of the code:
In my experiments then CPU spun too quickly so that core-to-core bandwidth was quickly squandered before the server could signal response or the client could signal request. I wonder if adding select atomic reads as with the SPSC ring might help.
They mention that they measure "cache coherence protocol" throughput. So yes, this happens when two or more cores need to work on the same memory and they need to synchronize/update what is in their L1 and L2 (and probably L3 as well?) caches.
So cache invalidation? If core 1 writes to something core 2 has in L123? But who is responsible for that "correction", core 1 or core 2 or a combination?
On the totally basic principle core 1 will tell core 2 (say non-dirty state) to 'invalidate' it. At the same time core 1' and '2' can read a different value for that memory address/cache line before '2' carries the task. The protocol is more complex than that, the cache lines have different states. The basic protocol is known as MESIF[0] for Intel.
Core1 would send a invalidation to get exclusive (write) access to that cache line. This would be routed to Core2. Which removes the cache line from its cache.
If core2 now accesses that cache line again, it would not find it in its L1, or L2, and at L3 there is the cacheline directory routing the request to core1.
In HFT, we typically pin processes to run on a single isolated core (on a multicore machine). That allows the process to avoid a lot of kernel and other interrupts which could cause the process to not operate in a low latency manner.
If we have two of these processes, each on separate cores, and they occasionally need to talk to each other, then knowing the best choice of process/core location can keep the system operating in the lowest latency setup.
So, an app like this could be very helpful for determining where to place pinned processes onto specific cores.
There's also some common rules-of-thumb such as, don't put pinned processes that need to communicate on cores that are separated by the QPI, that just adds latency. Make sure if you're communicating with a NIC to find out which socket has the shortest path on the PCI bus to that NIC and other fun stuff. I never even thought about NUMA until I started to work with folks in HFT. It really makes you dig into the internals of the hardware to squeeze the most out of it.
In general this makes sense, but I think you need to be careful in some cases where the lowest latency between two logical "cores" is likely to be between those which are SMT siblings on the same physical core (assuming you have an SMT-enabled system). These logical "cores" will be sharing much of the same physical core's resources (such as the low-latency L1/L2 and micro-op caches), so depending on the particular workload, pinning two threads to these two logical "cores" could very well result in worse performance overall.
Doesn't this leave some performance on the table? Each core has more ports than a single thread could reasonably use, exactly because two threads can run on a single core
In terms of throughput, technically yes, you are leaving performance on the table. However, in HFT the throughput is greatly limited by IO anyways, so you don't get much benefit with it enabled.
What you want is to minimize latency, which means you don't want to be waiting for anything before you start processing whatever information you need. To do this, you need to ensure that the correct things are cached where they need to be, and SMT means that you have multiple threads fighting each other for that precious cache space.
In non-FPGA systems I've worked with, I've seen dozens of microseconds of latency added with SMT enabled vs disabled.
Maybe 10 years ago that as the common things, but there are so many exta resources (esp registers) that is is now giving up almost half the chip. If you can be cache friendly enough, the extra cycles will make up for it.
No, this is not true at all. "The extra cycles" is the exact thing you want to avoid in HFT. It doesn't matter how much throughput of processing you can put through a single core if you enable SMT, because somewhere in the path (either broker, exchange, or some switch in between) you will eventually be limited in throughput that it becomes irrelevant.
The only thing that matters at that point is latency, and unless you are cache-friendly enough to store your entire program in a single core's cache twice over, you would be better off disabling SMT altogether. And even if you were able to do that, it would not matter as a single thread would be done processing a message by the time the next one comes in. At least at the currently standard 10-25Gbps that the exchanges can handle.
In HFT, we're fine giving up half the registers in a core if it means we get an extra few microseconds of latency back.
What kind of "talking" are we talking about? I thought most IPC works somehow via shared memory under the hood rather than CPU cores actually communicating, how would you even do that?
"Shared memory" is really more of a description of the memory model that is exposed to the programmer, rather than the hardware.
Under the hood, there are caches -- sometimes memory addresses live in a cache above you because you put them there, sometimes they live in a cache above you because a neighboring core that shares your cache put them there, sometimes they live in RAM, sometimes they live in another cache on your chip and you have to ask for them through the on-chip network. The advice I have been given (as a non-HFT guy) is just to try not to mess around to much with the temporal locality, pin threads to cores, and let the hardware handle the rest.
It's mentioned in the readme - this is measuring the latency of cache coherence. Depending on architecture, some sets of cores will be organized with shared L2/L3 cache. In order to acquire exclusive access to a cache line (memory range of 64-128ish bytes), caches belonging to other sets of cores need to be waited on to release their own exclusive access, or to be informed they need to invalidate their caches. This is observable as a small number of cycles additional memory access latency that is heavily dependent on hardware cache design, which is what is being measured
Cross-cache communication may simply happen by reading or writing to memory touched by another thread that most recently ran on another core
Check out https://en.wikipedia.org/wiki/MOESI_protocol for starters, although I think modern CPUs implement protocols more advanced than this (I think MOESI is decades old at this point)
AMD processors also use a hierarchical coherence directory, where the global coherence directory on the IO die enforces coherence across chiplets and a local coherence directory on each chiplet enforces coherence on-die http://www.cs.cmu.edu/afs/cs/academic/class/15740-f03/www/le...
The example code uses an atomic store instruction in order to write values from threads to a memory location, and then an atomic read to read them. The system guarantees that a read of a previously written location is consistent with a subsequent write, i.e. "you always read the thing you just wrote" (on x86, this guarantee is called "Total Store Ordering.") Reads and writes to a memory location are translated to messages on a memory bus, and that is connected to a memory controller, which the CPUs use to talk to the memory they have available. The memory controller is responsible for ensuring every CPU sees a consistent view of memory according to the respective platform memory ordering rules, and with respect to the incoming read/write requests from various CPUs. (There are also caches between the DRAM and CPU here but they are just another layer in the hierarchy and aren't so material to the high-level view, because you can keep adding layers, and indeed some systems even have L1, L2, L3, and L4 caches!)
A CPU will normally translate atomic instructions like "store this 32-bit value to this address" into special messages on the memory bus. Atomic operations it turns out are already normally implemented in the message protocol between cores and memory fabric, so you just translate the atomic instructions into atomic messages "for free" and let the controller sort it out. But the rules of how instructions flow across the memory bus is complicated because the topology of modern CPUs is complicated. They are divided, partitioned into NUMA domains, have various caches that are shared or not shared between 1-2-or-4-way clusters, et cetera. They must still obey the memory consistency rules defined by the platform, and all the caches and interconnects between them. As a result, there isn't necessarily a uniform measurement of time for any particular write to location X from a core to be visible to another core when it reads X; you have to measure it to see how the system responds, which might include expensive operations like flushing the cache. It turns out two cores that are very far away will just take more time to see a message, since the bus path will likely be longer -- the latency will be higher for a core-to-core memory write where the write will be visible consistently.
So when you're designing high performance algorithms and systems, you want to keep the CPU topology and memory hierarchy in mind. That's the most important takeaway. From that standpoint, these heatmaps are simply useful ways of characterizing the baseline performance of some basic operations between CPUs, so you might get an idea of how topology affects memory latency.
Hm. Use icelake, with an aggregator process sitting in core 11 and have all the others run completely on input alone and then report to core 11. (Core 11 from that heatmap appears to be the only cpu with a sweetheart core having low latency to all other cores.) I wonder how hard is to write a re-writer to map an executable to match cpu architecture characteristics. Something like graph transformations to create clusters (of memory addresses) that are then mapped to a core.
I think that's one of the Intel chips where the cores are connected on a 2D grid rather than the more hierarchical or single ring interconnects you see in the other systems.