JEDEC Extends DDR5 Memory Spec to 8800 MT/S, Adds Anti-Rowhammer Features

transpute · 2024-04-22T21:13:10 1713820390

https://stefan.t8k2.com/rh/PRAC/index.html

  Chapter 16: "DDR5 Per Row Activation Counting (PRAC)". PRAC introduces two key mechanisms for comprehensive Rowhammer defenses: an Activation Counter for every DRAM row and a mechanism that triggers when an Activation Counter reaches a specific threshold. This allows the DRAM to pause the memory controller from issuing new commands, giving it time to refresh potential victim rows. In the words of a DRAM industry veteran who will remain nameless, PRAC is the biggest change to DRAM in decades. Thus, I thought I should write up a brief article summarizing the change and its potential to solve Rowhammer once and for all.

nneonneo · 2024-04-23T03:48:37 1713844117

I'll bet good money that this gets turned into a timing attack soon - it's a prototypical address-dependent delay.

oakwhiz · 2024-04-23T04:54:06 1713848046

I kind of feel like JEDEC isn't taking the problem seriously.

chuckadams · 2024-04-23T17:26:17 1713893177

Aren’t there already timing attacks based on cache latency?

teaearlgraycold · 2024-04-23T03:48:47 1713844127

Oh my god! Rate limiting for RAM. The realities of security sure are harsh.

londons_explore · 2024-04-22T17:01:24 1713805284

I'd like to see the spec tackle latency with a "send then confirm" approach.

Ie. The RAM can reply to a read request with data, then a couple of clock cycles later it can confirm (via a flag) that the data it originally sent was correct.

This is helpful because it means the timing can be tightened to the typical access times, rather than the worst case access time (eg. the slowest preamp on the highest capacitance memory row/column).

Things like CPU's already have provisions for handling not-yet-confirmed information, and can roll back state if delivered info turns out to be wrong.

Yes, it adds complexity to the whole system, but it seems worth it for a -30% change to memory latency.

pshirshov · 2024-04-22T17:55:05 1713808505

And potentially opens a whole new family of side channels.

touisteur · 2024-04-22T18:58:25 1713812305

I wish we could mix 'I don't care about side channels, use them all' with 'I'm paranoid about side channels, plug them all' on the same machine. Disable speculative execution on one core, no frequency adjustment, no prefetching, sr-io/pcie-bypass some devices... E-cores but for the side-channel-paranoid (in a good way).

smallmancontrov · 2024-04-22T19:36:11 1713814571

Bring back EIEIO, like on Old Macs, but perhaps with a slightly expanded definition of what constitutes I/O:

    Enforce In-order Execution of I/O (EIEIO) is an assembly language
    instruction used on the PowerPC central processing unit (CPU) which
    prevents one memory or input/output (I/O) operation from starting until
    the previous memory or I/O operation completed. This instruction is needed ]
    as I/O controllers on the system bus require that accesses follow a
    particular order, while the CPU reorders accesses to optimize memory 
    bandwidth usage.

teaearlgraycold · 2024-04-23T03:50:15 1713844215

> EIEIO, like on Old Macs

This is what we should reserve Nobel prizes in computer science for

antod · 2024-04-23T04:13:55 1713845635

FARM is begging for a good backronym

colejohnson66 · 2024-04-22T20:57:56 1713819476

You mean memory fences? The big architecture (x86, ARM, RISC-V) all contain instructions for them.

touisteur · 2024-04-22T21:26:25 1713821185

I mean permanently disable all speculative execution on a specific core and reduce/disable all side-channels of the kind. If you're saying I can do through injection of fence instructions between every instruction, coupled with isolcpus... I might have a fun weekend coming playing with Intel Pin. But I'm guessing the performance hit might be worse than 'just' disabling speculative execution on a core - if it was possible at all - or that the fence instructions might not be enough there? Haven't thought it through.

But it would be a fun question to ask the likes of Daniel Gruss...

smallmancontrov · 2024-04-23T01:32:32 1713835952

Yep. Memory fences are a sniper rifle, EIEIO+ is a shotgun, and side channel attacks are a knife fight.

derefr · 2024-04-22T23:04:16 1713827056

Host cores and guest cores.

touisteur · 2024-04-23T06:03:48 1713852228

Something like this, yes.

bee_rider · 2024-04-22T19:35:50 1713814550

I mean nobody really believes there aren’t already countless side channels in existing hardware, right? No reason to give up performance for nothing.

thfuran · 2024-04-22T20:11:26 1713816686

Things are bad, so make no attempt to better or even avoid worsening them?

faeriechangling · 2024-04-22T23:31:42 1713828702

Yes! Having two architectures, one meant to securely run in a “zero trust” environment, and one meant to run at max speed while assuming inputs and code can be trusted (or will never have the opportunity to be executed such as in an airgapped environment) is reasonable. You can even combine the two and we do in practice, as seen with hardware security modules. At a grocery store you will see a lower security cash register with many functions and features connected to a higher security card reader that does a very small number of things.

An essential part of security is scoping. The door to the safe is higher security than the door to the bank. Speed & convenience & cost are paramount at the entrance to the bank, and security is paramount when it comes to securing the cash at the bank. We don’t act as though high security is always warranted when it comes to physical security so why would it be always be warranted when it comes to computer security? Sacrificing speed and convenience is willfully inflicting a denial of service on yourself, it’s only worth it if it’s less bad than the probable alternative.

Every personal computer sold has massive security flaws with only the most severe issues getting papered over and yet most people don’t have issues because the world isn’t actually all that hostile.

bee_rider · 2024-04-22T21:00:46 1713819646

Anything can be made to sound wrong or right if you get abstract and vague enough.

We shouldn’t sacrifice something for nothing.

thfuran · 2024-04-22T21:22:34 1713820954

A considerable amount of effort goes into mitigating side channels precisely because it isn't for nothing.

frutiger · 2024-04-22T20:42:52 1713818572

The front door is already open. Let’s open the bedroom window if we want more fresh air there.

gosub100 · 2024-04-22T22:25:01 1713824701

How about "One size doesn't fit all." ?

mungoman2 · 2024-04-22T18:31:45 1713810705

How many cycles could this actually save? I would assume the latency to actually get data from DDR is only a small part of the whole round-trip in a L1 miss. Actual savings much smaller than 30%.

foota · 2024-04-22T19:37:18 1713814638

Most of the cost of an L3 miss comes after the miss itself, for most architectures I've seen.

E.g., on Skylake an L3 hit is 80 cycles (~20ns) whereas a RAM access is 80 cycles plus 50 nanos (~70 nanos). See https://www.7-cpu.com/cpu/Skylake_X.html

luyu_wu · 2024-04-23T02:30:49 1713839449

From my rough knowledge of textbooks RAM access is usually in the hundreds of cycles. The napkin math makes more sense in that order of magnitude too! In any case it seems unlikely L3 has similar latency considering memory heirachy!

Interesting discussion though!

theandrewbailey · 2024-04-22T16:47:31 1713804451

> Unfortunately, the laws of physics driving DRAM cells have not improved much over the last couple of years (or decades, for that matter), so memory chips still must operate with similar absolute latencies, driving up the relative CAS latency. In this case 14ns remains the gold standard, with CAS latencies at the new speeds being set to hold absolute latencies around that mark.

Some gaming memory kits can do 10ns or less latency. Though I guess if memory latency is your bottleneck, you should look at HBM.

spintin · 2024-04-22T17:09:24 1713805764

HBM is slower than DDR per pin, the speed gain is from a hugely parallel bus.

Parallel means latency if you have non "embarrassingly parallelizable" tasks?

Tuna-Fish · 2024-04-22T17:38:48 1713807528

The smallest transfer done from memory is a single cache line, which on most desktop machines is 64 bytes, or 512 bits. You could imagine a memory bus that was 512 bits wide and transferred a cache line per clock, and this would improve latency when compared to a serial bus with higher clock speed. HBM doesn't do that, though, instead every HBM3 module has 16 individual 64-bit channels, with 8n prefetch (that is, when you send a single request to a single channel, it will respond with 512 bits over 8 cycles).

dist-epoch · 2024-04-22T18:11:26 1713809486

DDR5 has 2 independent 32-bit lanes. Multiple transfers are required for 64 bytes.

Tuna-Fish · 2024-04-22T18:28:47 1713810527

DDR5 has a 16n prefetch, so a single transfer from a 32-wide channel moves 64 bytes.

NavinF · 2024-04-22T21:05:19 1713819919

> Some gaming memory kits can do 10ns or less latency

Source? My overclocked desktop RAM shows 45ns in benchmarks. I call bullshit on 4.5x faster RAM. Most people fight for an extra 5% latency reduction

99094 · 2024-04-23T23:36:36 1713915396

That's CAS latency. To calculate the latency of a timing you divide the timing itself by the clock frequency of your sticks. For example, DDR4-4000 CL14 is running at 2000MHz = 2GHz, so the CAS latency is 14/2 = 7ns.

But it's just a singular timing that's not even used all that often, so it's not that relevant to performance anyway - https://www.youtube.com/watch?v=pgb8N23tsfA

wmf · 2024-04-22T21:59:35 1713823175

That's probably 10 ns for the DRAM and 35 ns for the caches and memory controller.

gautamcgoel · 2024-04-22T23:58:44 1713830324

Just to make sure I understand: you're saying that checking L1/L2/L3 takes around 35ns, and then the CPU accesses DRAM which takes 10ns? If that's so, how is L3 cache any faster than DRAM? Also, can you explain why the memory controller adds some latency?

wmf · 2024-04-23T01:34:46 1713836086

An L3 hit only takes ~15 ns so that means another 15-20 ns is spent traversing the fabric and memory controller. I'm not sure what all is involved there but for Intel it has to go around the ring and for AMD it has to cross chiplets.

gautamcgoel · 2024-04-23T17:58:37 1713895117

Interesting. If an L3 hit takes 15 ns, then based on your argument a hypothetical CPU with only one core (and hence no fabric) would be better off without L3, since a DRAM read can be performed in just 10 ns.

nsteel · 2024-04-23T18:40:17 1713897617

You still need a memory controller, you still need to get to that controller on the edge of the die. And going to RAM more often will surely consume more power.

wmf · 2024-04-23T19:30:51 1713900651

No, the 10 ns is just the time inside the DRAM. Reading from DRAM would take 20-30 ns even in a very simple chip.

gautamcgoel · 2024-04-23T20:30:03 1713904203

This is the part I don't understand. You're saying that the interval from when the DRAM first receives a read request to when it sends the data back over the channel is about 10ns, at least in fancy gaming RAM. Ok, fine. Where is the other 10-20 ns of latency coming from? Why can't the CPU begin using the data as soon as it arrives? I guess some time is needed to move the data from the memory controller to the actual CPU core. But it seems to me (far from an expert) that this shouldn't take a full 10-20 ns. Or am I mistaken?

nsteel · 2024-04-23T22:47:51 1713912471

Firstly, to clarify, there's nothing very special about 'gaming ram' other than the particular chunk of silicon performs better than others so they stuck a shiny sticker and an oversized heatsink on.

The problem here is the latency is state dependent and who knows what people are talking about here. The memory itself can have a latency 1-3x the CAS Latency number and you need to understand how DRAM is accessed to appreciate why. Which will also clarify why an L3 cache is such a good idea.

> For a completely unknown memory access (AKA Random access), the relevant latency is the time to close any open row, plus the time to open the desired row, followed by the CAS latency to read data from it.

(It's actually worse than than for DDR5.)

https://en.m.wikipedia.org/wiki/CAS_latency

https://en.m.wikipedia.org/wiki/Memory_timings

https://www.anandtech.com/show/3851/everything-you-always-wa...

Then you've got some small time going to and from the controller, which might also be doing some address translation, maybe some access reordering to avoid switching rows. I think 30ns is very optimistic.

busup · 2024-04-24T12:31:50 1713961910

To read a single cache line from DDR4 (basically the same for DDR5 but I'm less familiar) the memory controller needs to:

  1. send ACT
  2. wait tRCD(RD)
  3. send READ
  4. wait tCL
  5. read the burst from the DQ

The original 10ns number was only taking step 4 into account. tRCDRD is just as long if not longer. Then the burst takes a couple more ns.

nsteel · 2024-04-22T18:16:00 1713809760

As others have said, there is nothing low latency about HBM.

Renesas did have a special Low Latency HBM thing at one point, but I don't think it ever saw the light of day.

jeffbee · 2024-04-22T20:03:46 1713816226

> Some gaming memory kits can do 10ns or less latency

Without a thorough analysis by real engineers my interpretation of this statement is "DRAM marketers can print anything they want on the sticker".

moffkalast · 2024-04-22T17:26:21 1713806781

I don't think they make HBM RAM kits. /s

oneplane · 2024-04-22T20:32:01 1713817921

The article doesn't mention much about chip-to-controller distance or path length, presumably this suffers from the same issues we currently see where low power devices (and in some desktop configurations as well) can't really ever get those speeds unless the DRAM chips are near or on top of the CPU substrate.

It's nearly impossible to do those numbers in modern mobile form factors, even CAMM is having a hard time getting there with modularised memory.

Aurornis · 2024-04-22T21:16:35 1713820595

Generally the highest speeds aren’t intended for low power devices. They’re targeted at applications where performance is the most important goal and the power tradeoffs are not an issue.

Enthusiast motherboards and RAM kits can already exceed these speeds. Having official JEDEC timings just makes these speed a more universal target for long-term high end designs.

oneplane · 2024-04-22T21:29:43 1713821383

While that's true, It's also true that mobile devices tend to be a rather static configuration during their lifetime, and if you're going to have a fleet of those, having the best performance during that lifetime is a nice bonus. So I believe that form factor specific considerations are still a good value to write about.

AzzyHN · 2024-04-22T17:29:54 1713806994

I'm always a fan of bigger numbers, but I wish more time/money/whatever was put into letting DDR5 run at those XMP/EXPO speeds when using 4 DIMMs.

Aurornis · 2024-04-22T21:19:59 1713820799

4 DIMMs on a consumer board means 2 DIMMs per channel. This is inherently a compromise in signal integrity that must come with a speed tradeoff, unfortunately. We’re dealing with laws of physics.

In the past some motherboards tried a T-topology for RAM slots to optimize for 2 DIMMs per channel, but this would cause problems with 1 DIMM per channel usage. Not worth it for the average consumer.

smolder · 2024-04-22T21:20:42 1713820842

The way to do that would be with a chip & socket that has 4 independent memory channels (threadripper I think has eight now, but maybe used to have four?) and a many-layered motherboard that optimizes the routing and placement of each dimm slot, ideally with only 1 dimm slot per channel for maximum speed. The high end stuff with many memory channels generally isn't designed for pushing RAM clocks to gaming desktop speeds, though. You'd probably need to skip on ECC too or overclock and manually time some.

imtringued · 2024-04-22T18:17:41 1713809861

How do you expect that to happen? By sharing memory channels you are no longer using a point to point connection and are now prone to reflections in the PCB traces where you have split the signal. There is no "money" that can be put into this, that won't also improve the performance of the single DIMM per channel setup disproportionately. I don't even understand what your point is. Quad channel support would be a much better idea since it doubles your memory bandwidth, while remaining a point to point connection, but you're going to complain that you can't add eight DIMMs then.

zrm · 2024-04-22T20:14:40 1713816880

What if you give each slot independent command pins but not a complete memory channel?

What if you use a similar technology to registered or load-reduced memory, but put the register on the system board instead of the DIMM so it's in front of multiple DIMMs that then share the channel into the processor but not the traces on the system board? This may also allow higher capacity DIMMs in consumer systems.

atlas_hugged · 2024-04-22T22:59:22 1713826762

You doing ok dude?

Night_Thastus · 2024-04-22T16:48:09 1713804489

I'm a bit confused, DDR5 products are already out - as are CPUs and motherboards that support them.

How can this change happen retroactively? Would motherboard manufacturers just need to update the BIOS to enable new XMP configurations? (For when this new, higher transfer rate RAM becomes available)

imtringued · 2024-04-22T18:11:12 1713809472

It doesn't. If you buy A DDR5-6400 DIMM it doesn't get updated to 8800. It will stay at 6400. This just means that manufacturers will be able to brand their tested DDR5 DIMMs as supporting 8800. You still need a CPU and Mainboard that has been validated at those speeds. You're going to need a 8700G if you actually want to hit those speeds by the way.

tadfisher · 2024-04-22T17:22:42 1713806562

Not even that; this just sets standard speed/latency values for memory modules without XMP. You can already exceed these numbers with XMP.

PRAC would need handling in the memory controller, so that would require a CPU update if I understand correctly.

doikor · 2024-04-22T21:29:50 1713821390

PRAC should happen automatically in the background when possible and when it really needs to stop the controller from accessing something while waiting for the bits to refresh it uses the already existing ALERTn signal.

https://stefan.t8k2.com/rh/PRAC/index.html

> Panopticon retrofits an existing signal in the DDR specification, called ALERTn, to effectively “trick” the memory controller to pause issuing new DDR commands. DRAM uses ALERTn to signal errors to the memory controller. Upon receiving this signal, the memory controller stops issuing new DRAM commands and instead re-issues the old memory access. By making use of ALERTn, Panopticon requires no modifications to any hardware other than DRAM itself.

(As I understand PRAC uses the same design as Panopticon for this part)

dist-epoch · 2024-04-22T18:09:19 1713809359

The spec is just a bunch of numbers which are already configurable and motherboards can already be set at much higher frequencies.

It doesn't mean that any particular combination of CPU/motherboard/RAM will work.

braiamp · 2024-04-22T17:21:18 1713806478

> while leaving the spec open to further expansions with faster memory as technology progressed

They only set the current standard, but allowed to, if technology progresses, that other speeds/timings would also be jedec compatible, rather than being some kind of XMP. Motherboard manufacturers do not need to upgrade their previous models if the hardware doesn't meet the required SN ratios, or whatever. But they _could_ if they believe they had the hardware to support it.

snvzz · 2024-04-22T20:43:26 1713818606

Please make ECC mandatory.

wmf · 2024-04-22T21:54:54 1713822894

It's never going to happen because Dell counts every penny.

snvzz · 2024-04-22T22:51:22 1713826282

JEDEC could only standardize ECC modules.

Microsoft, Intel or AMD can, anytime, require ECC for their certification/logo programs.

Intel and AMD could even make their new chips only boot with ECC.

And FCC could make ECC a requirement for certification.

All these parties (and more) are enabling non-ECC memory, to the detriment of mankind.

_factor · 2024-04-23T06:32:36 1713853956

I prefer my computations have unreported errors at the whim of explosions billions of light years away. Makes me feel connected to the universe.