*What about DIMMs with Error Correction Codes (ECC)? Previous work on DDR3 showe...

wolpoli · 2024-03-25T22:18:32 1711405112

> The attacker will need to cause dozens of machine halts in order to achieve even a single exploitable bitflip. Dozens of machine halts is not something that goes undetected.

Is there a process for the operations team managing the system to figure out that it was an attack and not just flaky hardware?

adrian_b · 2024-03-26T06:28:10 1711434490

Memory bit flips are very rare.

Normally a memory error does not happen more than a few times per year, unless you have a huge amount of memory.

Therefore when 2 memory correctable or uncorrectable errors happen in the same day, that should be enough to trigger an immediate report to the user or administrator of the computer that either there is an ongoing RowHammer attack that must be stopped or one of the memory modules is approaching its end-of-life due to aging and it must be replaced before it will begin to have very frequent memory errors.

At least on server computers it should be easy to configure their logging system so that a second memory error per day, even if it was correctable, should immediately send an e-mail message and/or an SMS to the administrator.

wolpoli · 2024-03-26T08:22:03 1711441323

If that's the case, then I guess they would take physical server offline. And if other machines started showing similar signs of failure, then they would analyze the logs for possible row hammer attack?

crotchfire · 2024-03-26T00:34:43 1711413283

Sure: you replace the hardware with brand new hardware and it keeps happening. Then you know it's not the hardware.

pixl97 · 2024-03-26T00:39:18 1711413558

The same workload starts crashing after migrating to multiple machines?

justinclift · 2024-03-25T23:58:46 1711411126

Sounds like a process thing that would need to be developed by each team. So probably a mix of results there.

p1necone · 2024-03-25T20:55:19 1711400119

> The attacker will need to cause dozens of machine halts in order to achieve even a single exploitable bitflip. Dozens of machine halts is not something that goes undetected.

If you're targeting a specific machine, if you're throwing the exploit at a few thousand machines shotgun style then you're still going to get your botnet - it'll just be smaller.

crotchfire · 2024-03-25T20:58:04 1711400284

Can you point to any botnets which were built using rowhammer attacks?

Rowhammer and speculative execution attacks are incredibly labor-intensive and target-specific. They are targeted attacks for high-value targets.

vlovich123 · 2024-03-25T20:56:46 1711400206

I think the point is that people with thousands of machines are probably going to notice if a meaningful chunk of them start halting.

SAI_Peregrinus · 2024-03-25T21:51:01 1711403461

Yep, and desktop users will certainly notice. Only AMD has desktop (not workstation) ECC support.

riedel · 2024-03-25T22:04:18 1711404258

If you are running windows 10 random halts and the CPU getting hot won't seem suspicious.

p1necone · 2024-03-25T23:20:17 1711408817

Why do you need to target one person who has thousands of machines? What if I just want to pwn whatever random machines visit my dodgy website? Dismissing an exploit just because it only works some fraction of the time seems overly optimistic to me.

jquery · 2024-03-25T23:19:30 1711408770

Thanks for this. One reason I bought ECC for my home desktop was specifically for protection against Rowhammer (Zen2 TR platform), and that line made my heart race a bit. Very misleading.

transpute · 2024-03-25T20:34:32 1711398872

Any recommendations for client devices with ECC memory?

wtallis · 2024-03-25T21:22:08 1711401728

If it has ECC memory, it's going to be branded as a workstation or server or industrial device, not marketed as a consumer device.

Among consumer products, some AMD desktop CPUs and motherboards support ECC memory, and that's about it.

justinclift · 2024-03-26T00:02:55 1711411375

For desktops, ASRock motherboards seem to be the common choice for people wanting ECC memory.

It's specifically mentioned on the ASRock motherboard pages under "Specifications". Some random examples:

• https://www.asrock.com/mb/AMD/B650%20Pro%20RS/index.asp#Spec...

• https://pg.asrock.com/mb/AMD/B650%20PG%20Lightning%20WiFi/in...

• https://www.asrock.com/mb/AMD/X670E%20Taichi/index.asp#Speci...

These all have:

    Supports DDR5 ECC/non-ECC, un-buffered memory up to 7200+(OC)

jeffbee · 2024-03-26T00:50:37 1711414237

I think it's worth investigating the level of "support" these boards offer for ECC. The ASRock Taichi for example does not have any ECC DIMMs in its "qualified" list.

justinclift · 2024-03-26T01:23:10 1711416190

Interesting. Might be good for someone (not me!) to investigate then write in-depth info about. :)

As a data point, I'm using a previous generation ASRock AM4 motherboard with ECC and that definitely works.

I'm undervolting my cpu and ram, and very occasionally (every 6 months or so?) one of those seems to be generating a correctable ECC error that gets propagated to warning messages on my terminal. Haven't bothered investigating any further though. ;)

adrian_b · 2024-03-26T16:36:01 1711470961

The laptops with ECC memory are expensive and they are available for now only with Intel CPUs (while it should be possible to use mobile AMD CPUs I have never seen any such product). They are sold as "mobile workstations" by Dell, Lenovo and HP. I have a Dell Precision mobile workstation laptop with ECC memory bought in 2016 and it still works fine. However I had to pay for it EUR 3000 in 2016 and now something similar would be even more expensive (it had an NVIDIA Quadro GPU and 32 GB of ECC memory).

For desktops it is much easier to choose ECC memory, because the additional cost (the cost of the memory modules is 50% higher for DDR5-4800) remains a small fraction of the cost of an entire computer.

What is needed is to buy a motherboard with ECC support.

An example of a good motherboard with ECC support is ASUS PRIME X670E-PRO WIFI (for AMD Ryzen). I have been using a similar ASUS motherboard with ECC memory from the previous X570 generation for the last 5 years and it still works fine.

There are several other such MBs, mainly at ASUS and ASRock.

For Intel Raptor Lake there are fewer and more expensive such motherboards, but they can be found at ASUS (Pro WS W680M-ACE SE) and at Supermicro, as "workstation motherboards".

reliabilityguy · 2024-03-25T20:35:51 1711398951

crotchfire · 2024-03-25T20:39:56 1711399196

It will detect (by crashing) enough to make exploitation impractical. That is the key point.

reliabilityguy · 2024-03-25T20:47:02 1711399622

I would say that 60% success per trial is a good chance.

exmadscientist · 2024-03-25T20:54:38 1711400078

In the process of generating one triple flip, many, many, many, many, many single and double flips will occur and will be caught. That is why ECC is still an effective defense. Attackers don't just get to go straight to their end game.

YetAnotherNick · 2024-03-25T21:23:12 1711401792

You can cause any amount of single and double flip without worry. It's not a defence as the attacker can retry till ECC labels it as uncorrectable. AFAIK there is no cost in retrying.

exmadscientist · 2024-03-26T00:11:32 1711411892

That's true, but none of it is silent. Corrected errors get reported and it will be obvious that something is going wrong to anyone who's paying attention.

YetAnotherNick · 2024-03-26T00:48:45 1711414125

Reported where? There is no reporting in Ryzen CPUs.

theevilsharpie · 2024-03-26T05:27:58 1711430878

Ryzen CPUs report ECC errors like any other modern CPU -- it raises a Machine Check Exception which the operating system is expected to handle. Linux and Windows will both handle and log any ECC errors that the CPU raises. Presumably the various BSDs do as well.

YetAnotherNick · 2024-03-27T20:26:36 1711571196

[1] says AM4 doesn't support reporting:

> While the X570D4U-2L2T and its predecessors the X470D4U series supports ECC memory, there is a bit of a gotcha. As readers noted in the original X470D4U reviews, while ECC memory is supported and performing error correction, the reporting of that error correction was not functioning. In other words, even if you were experiencing continuous memory errors, no log of those errors was being recorded in the IPMI event log where one might expect them to show up. A user over on the Level One Techs forums had a conversation thread with someone from ASRock Rack, who reported that while the AM4 platform had ECC support, it did not have error reporting support.

[2] says:

> However we got AMD official respond today. AM4 does not support ECC error reporting function

[1]: https://www.servethehome.com/asrock-rack-x570d4u-2l2t-review...

[2]: https://forum.level1techs.com/t/asrock-rack-x470d4u2-2t/1475...

theevilsharpie · 2024-03-29T05:45:12 1711691112

This is referring to ECC as reported from an out-of-band management controller on those specific motherboards. This needs either some platform-level integration into the memory controller (which is unlikely to be present on consumer Ryzen), or an driver/agent running on the host OS that captures ECC events and sends them to the out-of-band management controller.

Standard ECC events are handled by the OS, and don't depend on (or otherwise interact with) an out-of-band management controller or any other external device. This works fine on Ryzen.

namibj · 2024-03-26T01:48:15 1711417695

Got a reference? Because my Zen3 desktop has the driver loaded and information shown, just not the bitflips but that may be due to excessively early refresh configuration.

adrian_b · 2024-03-26T16:56:26 1711472186

Normally you should not see any bit flips, because they happen at intervals of several months or even less frequently, depending on location.

Only for some old modules, e.g. 5-years old or older, the frequency of errors can increase a lot, even up to many bit flips per day, which means that the offending module must be replaced.

This feature of identifying the aged modules is one of the main benefits of ECC.

I have not looked again at the AMD EDAC driver, which has been updated during last year, but previously, a couple of years ago, its feature of injecting errors for testing was broken on Ryzen (because it had not been updated since Bulldozer, at that time), so the only method to verify that error reporting is working was to overclock the memory in the BIOS settings, to ensure that errors will be generated. Obviously, for the test one should boot from read-only media, to avoid the corruption of the storage in the case of excessive errors.

namibj · 2024-04-08T08:21:14 1712564474

I've overall looked at about 100 GB*years of edac counter on this massive, and never once there was any error.

If I knew how, I'd dial down the voltage very slowly while running a rowhammer PoC to either catch it hammering or catch edac counts.

reliabilityguy · 2024-03-25T21:01:57 1711400517

exmadscientist · 2024-03-25T21:18:10 1711401490

The ECCploit paper has extensive discussion of all the ways their work is detected, and how they even use detection to probe the correction structure. This is not a silent attack. This is a proof that ECC is a penetrable defense. Which we all know! The question is how difficult it is and how stealthily it can be done.

But regardless, ECC still sounds the alarm when it's being attacked. If no one listens, there's not much ECC can do about that.

rightbyte · 2024-03-25T21:16:49 1711401409

That's true for encryption too.