> Unfortunately, the laws of physics driving DRAM cells have not improved much o...

spintin · 2024-04-22T17:09:24 1713805764

HBM is slower than DDR per pin, the speed gain is from a hugely parallel bus.

Parallel means latency if you have non "embarrassingly parallelizable" tasks?

Tuna-Fish · 2024-04-22T17:38:48 1713807528

The smallest transfer done from memory is a single cache line, which on most desktop machines is 64 bytes, or 512 bits. You could imagine a memory bus that was 512 bits wide and transferred a cache line per clock, and this would improve latency when compared to a serial bus with higher clock speed. HBM doesn't do that, though, instead every HBM3 module has 16 individual 64-bit channels, with 8n prefetch (that is, when you send a single request to a single channel, it will respond with 512 bits over 8 cycles).

dist-epoch · 2024-04-22T18:11:26 1713809486

DDR5 has 2 independent 32-bit lanes. Multiple transfers are required for 64 bytes.

Tuna-Fish · 2024-04-22T18:28:47 1713810527

DDR5 has a 16n prefetch, so a single transfer from a 32-wide channel moves 64 bytes.

NavinF · 2024-04-22T21:05:19 1713819919

> Some gaming memory kits can do 10ns or less latency

Source? My overclocked desktop RAM shows 45ns in benchmarks. I call bullshit on 4.5x faster RAM. Most people fight for an extra 5% latency reduction

99094 · 2024-04-23T23:36:36 1713915396

That's CAS latency. To calculate the latency of a timing you divide the timing itself by the clock frequency of your sticks. For example, DDR4-4000 CL14 is running at 2000MHz = 2GHz, so the CAS latency is 14/2 = 7ns.

But it's just a singular timing that's not even used all that often, so it's not that relevant to performance anyway - https://www.youtube.com/watch?v=pgb8N23tsfA

wmf · 2024-04-22T21:59:35 1713823175

That's probably 10 ns for the DRAM and 35 ns for the caches and memory controller.

gautamcgoel · 2024-04-22T23:58:44 1713830324

Just to make sure I understand: you're saying that checking L1/L2/L3 takes around 35ns, and then the CPU accesses DRAM which takes 10ns? If that's so, how is L3 cache any faster than DRAM? Also, can you explain why the memory controller adds some latency?

wmf · 2024-04-23T01:34:46 1713836086

An L3 hit only takes ~15 ns so that means another 15-20 ns is spent traversing the fabric and memory controller. I'm not sure what all is involved there but for Intel it has to go around the ring and for AMD it has to cross chiplets.

gautamcgoel · 2024-04-23T17:58:37 1713895117

Interesting. If an L3 hit takes 15 ns, then based on your argument a hypothetical CPU with only one core (and hence no fabric) would be better off without L3, since a DRAM read can be performed in just 10 ns.

nsteel · 2024-04-23T18:40:17 1713897617

You still need a memory controller, you still need to get to that controller on the edge of the die. And going to RAM more often will surely consume more power.

wmf · 2024-04-23T19:30:51 1713900651

No, the 10 ns is just the time inside the DRAM. Reading from DRAM would take 20-30 ns even in a very simple chip.

gautamcgoel · 2024-04-23T20:30:03 1713904203

This is the part I don't understand. You're saying that the interval from when the DRAM first receives a read request to when it sends the data back over the channel is about 10ns, at least in fancy gaming RAM. Ok, fine. Where is the other 10-20 ns of latency coming from? Why can't the CPU begin using the data as soon as it arrives? I guess some time is needed to move the data from the memory controller to the actual CPU core. But it seems to me (far from an expert) that this shouldn't take a full 10-20 ns. Or am I mistaken?

nsteel · 2024-04-23T22:47:51 1713912471

Firstly, to clarify, there's nothing very special about 'gaming ram' other than the particular chunk of silicon performs better than others so they stuck a shiny sticker and an oversized heatsink on.

The problem here is the latency is state dependent and who knows what people are talking about here. The memory itself can have a latency 1-3x the CAS Latency number and you need to understand how DRAM is accessed to appreciate why. Which will also clarify why an L3 cache is such a good idea.

> For a completely unknown memory access (AKA Random access), the relevant latency is the time to close any open row, plus the time to open the desired row, followed by the CAS latency to read data from it.

(It's actually worse than than for DDR5.)

https://en.m.wikipedia.org/wiki/CAS_latency

https://en.m.wikipedia.org/wiki/Memory_timings

https://www.anandtech.com/show/3851/everything-you-always-wa...

Then you've got some small time going to and from the controller, which might also be doing some address translation, maybe some access reordering to avoid switching rows. I think 30ns is very optimistic.

busup · 2024-04-24T12:31:50 1713961910

To read a single cache line from DDR4 (basically the same for DDR5 but I'm less familiar) the memory controller needs to:

  1. send ACT
  2. wait tRCD(RD)
  3. send READ
  4. wait tCL
  5. read the burst from the DQ

The original 10ns number was only taking step 4 into account. tRCDRD is just as long if not longer. Then the burst takes a couple more ns.

nsteel · 2024-04-22T18:16:00 1713809760

As others have said, there is nothing low latency about HBM.

Renesas did have a special Low Latency HBM thing at one point, but I don't think it ever saw the light of day.

jeffbee · 2024-04-22T20:03:46 1713816226

> Some gaming memory kits can do 10ns or less latency

Without a thorough analysis by real engineers my interpretation of this statement is "DRAM marketers can print anything they want on the sticker".

moffkalast · 2024-04-22T17:26:21 1713806781

I don't think they make HBM RAM kits. /s