What's the latency comparison between the memory bus and PCIe?

malfist · 2024-04-04T17:26:55 1712251615

I dug into this for a hn comment months ago, but I think it's 2-3 orders of magnitude difference in latency. RAM is measured in nanoseconds, PCIe is measured in microseconds

pclmulqdq · 2024-04-04T17:54:31 1712253271

The difference is smaller than that if you optimize it. The memory bus is about 50 ns in latency, and PCIe you can get down to sub 500 ns.

malfist · 2024-04-04T20:42:53 1712263373

I don't think you've checked those numbers. SSD access is in the order of 10-20 microseconds (10,000 - 20,000 ns) and memory bus access is ~10-15 nanoseconds.

Here's the comment I made a couple months ago when I looked up the numbers:

I keep hearing that, but that's simply not true. SSDs are fast, but they're several orders of magnitude slower than RAM, which is orders of magnitude slower than CPU Cache.

Samsung 990 Pro 2TB has a latency of 40 μs

DDR4-2133 with a CAS 15 has a latency of 14 nano seconds.

DDR4 latency is 0.035% of one of the fastest SSDs, or to put it another way, DDR4 is 2,857x faster than an SSD.

L1 cache is typically accessible in 4 clock cycles, in 4.8 ghz cpu like the i7-10700, L1 cache latency is sub 1ns.

pclmulqdq · 2024-04-04T21:12:25 1712265145

I have absolutely checked those numbers, and I have written PCIe hardware cores and drivers before, as well as microbenchmarking CPUs pretty extensively.

I think you're mixing up a few things: CAS latency and total access latency of DRAM are not the same, and SSDs and generic PCIe devices are not the same. Most of SSD latency is in the SSD's firmware and accesses to the backing flash memory, not in the PCIe protocol itself - hence why the Intel Optane SSDs were super fast. Many NICs will advertise sub-microsecond round-trip time for example, and those are PCIe devices.

Most of DRAM access latency (and a decent chunk of access latency to low-latency PCIe devices) comes from the CPU's cache coherency network, queueing in the DRAM controllers, and opening of new rows. If you're thinking only of CAS latency, you are actually missing the vast majority of the latency involved in DRAM operations - it's the best-case scenario - you will only get the CAS latency if you are hitting an open row on an idle bus with a bank that is ready to accept an access.

malfist · 2024-04-04T22:17:31 1712269051

I will defer to your experience, seems you have more on depth knowledge on this than I do.

anonymousDan · 2024-04-04T23:22:14 1712272934

CAS meaning Compare and Swap?

erik_seaberg · 2024-04-04T23:39:51 1712273991

https://en.wikipedia.org/wiki/CAS_latency is how long it takes to read a word out of the active row. Async DRAM had "row access strobe" and "column access strobe" signals.

pclmulqdq · 2024-04-05T04:27:20 1712291240

Synchronous DRAM (SDR and DDRx) still has a RAS wire (row address strobe) and CAS wire (column address strobe), but these are accessed synchronously - and also used to encode other commands.

DDR DRAM is organized into banks (for DDR4/5 these are grouped in bank groups), rows, and columns. A bank has a dedicated set of read/write circuits for its memory array, but rows and columns within a bank share a lot of circuitry. The read/write circuits are very slow and need to be charged up to perform an access, and each bank can only have one row open at a time. The I/O circuits are narrower than the memory array, which is what makes the columns out of the rows.

The best-case scenario is that a bank is precharged and the relevant row is open, so you just issue a read or write command (generally a "read/write with auto-precharge" to get ready for the next command) and the time from when that command is issued to when the data bus starts the transaction is the CAS latency. If you add in the burst size (which is 4 cycles for a double-data-rate burst length of 8) plus the signal propagation RTT to the memory, you get your best-case access latency.

The worst-case scenario for DDR is then that you have just issued a command to a bank and you need to read/write a different row on that bank. To do that, you need to wait out the bank precharge time and the row activation time, and then issue your read or write command. This adds a lot of waiting where the bank is effectively idle. Because of that wait time, memory controllers are very aggressive about reordering things to minimize the number of row activations, so you may find your access waiting in a queue for several other accesses, too.

Also, processors generally use a hash function to map logical addresses to physical DRAM banks, rows, and columns, but they set it up to optimize sequential access: address bits roughly map to (from least to most significant): memory channel -> bank -> column -> row. It is more complicated than that in reality, but that's not a bad way to think about it.