Coauthor of the original Rowhammer exploit here. ECC remains a highly effective method for turning this from a security issue to a reliability issue, mostly. As an individual owner of a server, if that server has ECC and you expect to notice machine halts due to uncorrectable ECC errors, the security implications for you are modest.
Now, if you are a cloud provider that provides VMs on multitenant hosts, your threat model may be different.
Either way, avoid machines without ECC. TRR was a lame duck even when Rowhammer was still fresh, and bits flipping in DRAM will not go away unless the economics in DRAM manufacturing change (e.g. not).
I would use ECC memory if I could. I used to use a TR 2920x with ECC but now I'm on a Ryzen 7950x with non-ECC. Unbuffered ECC memory is the only one supported by Ryzen, and it's slower or more expensive or both compared to the equivalent-capacity non-ECC memory. The latest Threadripper lineup supports Registered ECC, but Threadripper is overkill (cost, threads, PCIe lanes) for home users like myself.
Keep in mind that for my 5950X I had to buy Micron Rev E 16Gbit x8 due based DIMMs, rated 3200CL22, running 3600CL18.
I.e., they just don't ship with XMP presets.
The advantage of buying RAM with XMP presets is that the reseller who created the preset has tested the sticks with that overclock and binned the original chips accordingly. When you buy RAM that is only rated for the default speed (as ECC server RAM is), you have no guarantee that all sticks will overclock the same amount, so in the worst case one stick will bring all the other sticks down to its level.
I didn't want to have to deal with the non-uniform CCDs. Of course the two on a 7950x aren't uniform either due to silicon lottery (eg on mine the first CCD clocks 100MHz higher than the other on all-core load and 200MHz higher on single-core load), but that is a small difference. It would presumably be more pronounced on the 7950x3d since only one has access to the extra cache. So I would be using it "sub-optimally" if I didn't `taskset` / cgroup everything to run on one CCD or the other.
I wonder what workload needs more than eight dual-threaded cores, but has trouble if the additional cores have more cache or the RAM isn't factory-overclocked, and doesn't care about data integrity.
> bits flipping in DRAM will not go away unless the economics in DRAM manufacturing change
This seems like all the argument necessary to require all computers - and I do mean all computers - to require ECC memory. The security risk is simply too great, and everything is too integrated not to make this change. Even a "gamer" on a pure gaming computer will have some crucial information on that machine, so I simply do not see how we've gone this far without making this change.
>Even a "gamer" on a pure gaming computer will have some crucial information on that machine
Which belongs to the gamer user, so it can be extracted or encrypted by every malware running on that system. No need for esoteric attacks like rowhammer at this point. Unless you think about, for example, exploring user machine via rowhammer in js running in a browser tab, but as far as I know that was never made practical.
The reality is that Rowhammer remains one of the hardest ways to compromise a machine, largely because the software stack most people are running isn't great.
I've heard that DDR2 is immune to rowhammer. Is that actually the case, or is it just because nobody's looked at it? Is SRAM the only thing that's truly immune?
AMD now took Intel's market segmentation approach and is disabling ECC on most Ryzen CPUs. Only Pro and Threadrippers have it guaranteed, then some boards with some desktop Ryzens.
There are also a few oddball desktop SKUs that are actually a G-series processor with the GPU disabled (primarily ones below the "600" tier, e.g. the Ryzen 3 4100 or Ryzen 5 5500), which also lack ECC support.
What is weird is that AMD has published for almost 2 years specifications that all the Ryzen 6000 series and all the Ryzen 7000 series of laptop CPUs (Rembrandt and Phoenix) support ECC, then suddenly and silently they have removed the statement about ECC support from all their specifications.
For the current Ryzen 8000 laptop series (Hawk Point), the ECC support has been specified as missing from the beginning.
"Supports" might mean you can run unbuffered ECC UDIMMs but without ECC? Even Intel can run ECC UDIMMs in non-ECC mode. Also some manufacturers don't distinguish between "on-die ECC" (DDR5) and real ECC.
That is not true. The last time I have looked ECC DDR5 UDIMM modules had a price higher by at most 50%.
Nevertheless that is still excessively high. While in the beginning for DDR5 there were only 80-bit modules, which could claim a +25% higher price, now there are 72-bit modules, like in the previous generations, which can justify at most a +12.5% price increase.
I looked right when I posted that and found 2x32GB non-ECC 5600MHz for $165, exactly half the $330 price of the sticks listed in the post. I spent several minutes looking for cheaper ECC at the same specs and couldn't find any.
Trying again, I can find some Kingston sticks that are $120 each, so that's about 50% higher. Amazon's search is really bad, by the way. But that's not the "at most" price. And a month ago they were $140 each.
Edit: Micron data sheets suggest that UDIMMs have 2x13 command/address pins and RDIMMs have 2x6, so that's one piece of the puzzle. Apparently UDIMMs can do x64 and x72, and RDIMMs can do x72 and x80.
When DDR5 was first introduced, there were only x8 chips.
Because the DDR5 channels must have a width of 32, 36 or 40 bits, with x8 chips one had to use 40-bit channels, even if only 36-bit channels are needed, so indeed 10% of the memory capacity remained unused.
Meanwhile, about a year ago, at least Micron has also introduced x4 chips. There are such ECC UDIMMs, using both x8 and x4 chips, which waste no memory.
On the market there are both modules made only with x8 chips, which do not use a part of the memory, and modules with a combination of x8 and x4 chips, without unused memory.
Is that a change? I think what you described applies equally to every generation of Zen processors: Pro-branded chips have ECC capability officially, laptop chips don't have it, and consumer-branded chips have it unofficially with ECC capability optional for motherboards.
Already since a few generations, at least since Zen 3, the desktop Ryzen CPU have official ECC support, not unofficial ECC support like the first Ryzen.
This can be verified easily on the AMD site by reading the CPU specifications.
For laptop CPUs, there has been a time interval between the beginning of 2022 and the autumn of 2023 when ECC support was specified for all mobile Rembrandt and Phoenix CPUs, but then the ECC support has been removed suddenly from the specifications of all non-Pro laptop CPUs.
Last I checked, DESKTOP AMD CPUs have working (not disabled, but not 'supported') ECC with DDR5 UDIMMs (5v source, not 12v server ram). Desktop BOARDS, depends on the HW + BIOS; initial firmware revisions didn't do ECC but for many boards on some brands it does work. I haven't checked recently.
At least since Zen 3 (2020), all AMD desktop Ryzen CPUs have ECC support clearly included in their specifications, so it is official support, not just a non-disabled ECC.
This change was around the same time when Intel has begun to support ECC in some Alder Lake desktop CPUs (and in their successors), so it might have been a response to Intel's decision.
So now the ECC support depends strictly on the motherboard manufacturer. The best chance to find motherboards with ECC support is at ASUS and at ASRock (including ASRock Rack, which offers server boards for Ryzen CPUs).
All the Ryzen 7000 series desktop processors support ECC, I think. Check each model's specs to be sure.
Asus motherboards for those CPUs also support it, as stated in the BIOS manuals I looked over. It requires changing the BIOS ECC setting from Auto to Enable.
I have done this on one such system, and the appropriate EDAC messages showed up in the Linux boot log.
I have a few questions I haven't found satisfactory answers for in the existing papers:
* Are modern patrol read engines guided by the memory access patterns to respond to RowHammer style attacks?
* How aggressive would a patrol read engine have to scan the DRAM to safely stay ahead of RowHammer induced bit-flips?
* Would larger ECC words than the traditional 64+8 with multi-bit error correction change the game and allow us to build more reliable systems from DRAM with pattern vulnerabilities?
I would expect that increased crash rate of multi-tenant hosts would be something that would be detected and investigated by the cloud provider. At the same time targeting a specific tenant would require a lot of luck.
What about DIMMs with Error Correction Codes (ECC)?
Previous work on DDR3 showed that ECC cannot provide protection against Rowhammer.
This is incredibly misleading. The paper they cite states:
When the ECC detection is used correctly 0.65%-7.42% of all bit flips still cause silent corruptions... On setup AMD-1, uncorrectable errors crash the system.
The attacker will need to cause dozens of machine halts in order to achieve even a single exploitable bitflip. Dozens of machine halts is not something that goes undetected.
Kudos for calling out JEDEC's terrible behavior on the rowhammer question, but we should not be downplaying ECC as a near-term solution.
> The attacker will need to cause dozens of machine halts in order to achieve even a single exploitable bitflip. Dozens of machine halts is not something that goes undetected.
Is there a process for the operations team managing the system to figure out that it was an attack and not just flaky hardware?
Normally a memory error does not happen more than a few times per year, unless you have a huge amount of memory.
Therefore when 2 memory correctable or uncorrectable errors happen in the same day, that should be enough to trigger an immediate report to the user or administrator of the computer that either there is an ongoing RowHammer attack that must be stopped or one of the memory modules is approaching its end-of-life due to aging and it must be replaced before it will begin to have very frequent memory errors.
At least on server computers it should be easy to configure their logging system so that a second memory error per day, even if it was correctable, should immediately send an e-mail message and/or an SMS to the administrator.
If that's the case, then I guess they would take physical server offline. And if other machines started showing similar signs of failure, then they would analyze the logs for possible row hammer attack?
> The attacker will need to cause dozens of machine halts in order to achieve even a single exploitable bitflip. Dozens of machine halts is not something that goes undetected.
If you're targeting a specific machine, if you're throwing the exploit at a few thousand machines shotgun style then you're still going to get your botnet - it'll just be smaller.
Why do you need to target one person who has thousands of machines? What if I just want to pwn whatever random machines visit my dodgy website? Dismissing an exploit just because it only works some fraction of the time seems overly optimistic to me.
Thanks for this. One reason I bought ECC for my home desktop was specifically for protection against Rowhammer (Zen2 TR platform), and that line made my heart race a bit. Very misleading.
I think it's worth investigating the level of "support" these boards offer for ECC. The ASRock Taichi for example does not have any ECC DIMMs in its "qualified" list.
Interesting. Might be good for someone (not me!) to investigate then write in-depth info about. :)
As a data point, I'm using a previous generation ASRock AM4 motherboard with ECC and that definitely works.
I'm undervolting my cpu and ram, and very occasionally (every 6 months or so?) one of those seems to be generating a correctable ECC error that gets propagated to warning messages on my terminal. Haven't bothered investigating any further though. ;)
The laptops with ECC memory are expensive and they are available for now only with Intel CPUs (while it should be possible to use mobile AMD CPUs I have never seen any such product). They are sold as "mobile workstations" by Dell, Lenovo and HP. I have a Dell Precision mobile workstation laptop with ECC memory bought in 2016 and it still works fine. However I had to pay for it EUR 3000 in 2016 and now something similar would be even more expensive (it had an NVIDIA Quadro GPU and 32 GB of ECC memory).
For desktops it is much easier to choose ECC memory, because the additional cost (the cost of the memory modules is 50% higher for DDR5-4800) remains a small fraction of the cost of an entire computer.
What is needed is to buy a motherboard with ECC support.
An example of a good motherboard with ECC support is ASUS PRIME X670E-PRO WIFI (for AMD Ryzen). I have been using a similar ASUS motherboard with ECC memory from the previous X570 generation for the last 5 years and it still works fine.
There are several other such MBs, mainly at ASUS and ASRock.
For Intel Raptor Lake there are fewer and more expensive such motherboards, but they can be found at ASUS (Pro WS W680M-ACE SE) and at Supermicro, as "workstation motherboards".
In the process of generating one triple flip, many, many, many, many, many single and double flips will occur and will be caught. That is why ECC is still an effective defense. Attackers don't just get to go straight to their end game.
You can cause any amount of single and double flip without worry. It's not a defence as the attacker can retry till ECC labels it as uncorrectable. AFAIK there is no cost in retrying.
That's true, but none of it is silent. Corrected errors get reported and it will be obvious that something is going wrong to anyone who's paying attention.
Ryzen CPUs report ECC errors like any other modern CPU -- it raises a Machine Check Exception which the operating system is expected to handle. Linux and Windows will both handle and log any ECC errors that the CPU raises. Presumably the various BSDs do as well.
> While the X570D4U-2L2T and its predecessors the X470D4U series supports ECC memory, there is a bit of a gotcha. As readers noted in the original X470D4U reviews, while ECC memory is supported and performing error correction, the reporting of that error correction was not functioning. In other words, even if you were experiencing continuous memory errors, no log of those errors was being recorded in the IPMI event log where one might expect them to show up. A user over on the Level One Techs forums had a conversation thread with someone from ASRock Rack, who reported that while the AM4 platform had ECC support, it did not have error reporting support.
[2] says:
> However we got AMD official respond today. AM4 does not support ECC error reporting function
This is referring to ECC as reported from an out-of-band management controller on those specific motherboards. This needs either some platform-level integration into the memory controller (which is unlikely to be present on consumer Ryzen), or an driver/agent running on the host OS that captures ECC events and sends them to the out-of-band management controller.
Standard ECC events are handled by the OS, and don't depend on (or otherwise interact with) an out-of-band management controller or any other external device. This works fine on Ryzen.
Got a reference? Because my Zen3 desktop has the driver loaded and information shown, just not the bitflips but that may be due to excessively early refresh configuration.
Normally you should not see any bit flips, because they happen at intervals of several months or even less frequently, depending on location.
Only for some old modules, e.g. 5-years old or older, the frequency of errors can increase a lot, even up to many bit flips per day, which means that the offending module must be replaced.
This feature of identifying the aged modules is one of the main benefits of ECC.
I have not looked again at the AMD EDAC driver, which has been updated during last year, but previously, a couple of years ago, its feature of injecting errors for testing was broken on Ryzen (because it had not been updated since Bulldozer, at that time), so the only method to verify that error reporting is working was to overclock the memory in the BIOS settings, to ensure that errors will be generated. Obviously, for the test one should boot from read-only media, to avoid the corruption of the storage in the case of excessive errors.
The ECCploit paper has extensive discussion of all the ways their work is detected, and how they even use detection to probe the correction structure. This is not a silent attack. This is a proof that ECC is a penetrable defense. Which we all know! The question is how difficult it is and how stealthily it can be done.
But regardless, ECC still sounds the alarm when it's being attacked. If no one listens, there's not much ECC can do about that.
Serious question: as an average person, are those hardware security issues (rowhammer, spectre, meltdown) an actual risk?
My understanding with spectre and meltdown was that it was an issue for escaping VMs and similar attacks - something AWS engineers should care about, but not me
The solution is to disable JavaScript and not run any untrusted apps. And then move to a shack in the woods and live off the land, because you just cut yourself off from modern society.
Noscript is annoying for like a week until you get the sites that you use frequently and basically trust whitelisted.
Sure, it isn’t perfectly safe. If HN or my employer goes evil, they can rowhammer me I guess. I’d expect it to cause a big todo, though, so I’m not that worried about it.
I don’t really understand why people seem to think disabling JS is a big hassle. Is this motivated reasoning by web devs or something?
It is not a big problem, and the sort of “ambient shittiness” of the internet greatly improved by doing it. Most sites work fine, they’ll default to some (better) less dynamic state, maybe some ads won’t load. For those sites that don’t work, you can make an exception or leave. Personally I’m now mostly visiting sites by people who don’t enjoy over complicating things, and who think about fallbacks. It is great!
The year is 2024. Solar panels you installed from Alibaba begins to search for cell towers. Your local instance of LLM voice bot you built to keep you company is using a malicious npm package that suddenly communicates with the solar panels and starts sending packets to a Chinese server.
Your solar panels are talking to China, your lightswitch is part of a massive botnet promoting bitcoin on X, your car is selling your data to your insurance company to have an excuse to raise your rates, your browser is protecting your privacy by routing all your sensitive information through their servers for them to inspect.
Your phone is selling your location for antiabortion fanatics to harass you, or help your stalker find you. Your ISP is selling your browsing history to anyone with a dollar.
That databroker that everyone was selling too just went to bankrupt and the banks are selling your data to anyone with a penny.
The fact that I must run JavaScript written by just about anyone, in order to live in a modern society, or the fact that I keep having to write code in JavaScript in order to run a (completely non-JS related) business.
No. As a sober hardware security researcher, most exploited vulnerabilities that would affect an average person are far more mundane and mostly software driven.
Everyone should install some kind of script whitelisting ad-on and only run JavaScript programs from websites that they really trust. I like noscript. I’m not sure what the Chrome pick is.
Other than that… we don’t often run random programs from the internet, right?
They’ve only scratched the surface for these sorts of bugs. Modern hardware is too complex to actually believe they’ll ever get them all.
Definitely not a security expert here, but this is one of the reasons why I at least run ublock origin on just about everything - and recommend everyone do the same. The ad delivery networks is just such a huge vulnerability surface.
Noscript would be much better of course, I guess I'm just too lazy to go that extra step.
I’m not an expert either, but the I think the experts are not really very useful in this context.
At least, I typically see things about the trade off between usability and security and the need to enable certain use-cases. I think most security experts work in industry where their job is to figure out what can be done to patch things up within the constraint that their job doesn’t exist unless the company can do the stuff it needs to do to stay in business.
I don't really care about any of that, I just want to be able to read text from the internet without my system getting messed up. It is a much easier use-case, because static content is usually pretty safe (although I do think there have been vulnerabilities in font and image rendering libraries). We don’t need an expert to intelligently analyze things and balance against the interests of competing parties because there’s no need to push in the “open things up” direction for the most part.
If you really are an average person, then no: Like most other supposed threats, you lose more to the fixes/mitigations than to the threat itself. They just make for great headlines and sensationalism, which is why you as an average person would hear about them at all.
Note that the average person wouldn't know WTF "DRAM" means, let alone "Rowhammer" or "Zen" or other esoteric industry terms.
No. I've run the Rowhammer test in memtest86 on my PC after building it (as part of the whole memtest package to verify my XMP was stable) and got zero errors on 64GB of DDR5 memory over all the passes. If Memtest couldn't do it when trying its hardest to brute force it nobody doing drive-by javascript has any chance to exploit it.
Could you tell us what DIMMs you're using? I thought Rowhammer-free RAM was a thing of the past, but if some manufacturer has fixed theirs to be immune, they deserve the extra sales and publicity.
Corsair Vengeance 2x32GB 5200Mhz. My understanding is that DDR5 in general is mostly immune to known rowhammer attacks because the on-chip ECC is good enough to fix any issues. This attack seems to work only with AMD Zen processors and not with Intel 12th-14th gen so I suspect DDR5 on intel is still good.
This poses a significant risk as DRAM devices in the wild cannot easily be fixed, and previous work showed that Rowhammer attacks are practical, for example, in the browser, on smartphones, across VMs, and even over the network.
That is just one view, namely the authors' view. You may wish to consider recent perfect 10 vulnerabilities for comparison, as these are far more likely to cause problems.
The practical answer is that, if 99.9% of people out there has system that mitigates these issues, no one will bother using these exploits in the wild and you can turn off these mitigations to get the perf benefit and be reasonably sure that you won't get exploited. Unless you're targeted of course.
But "we", being the average tech expert, also has no way to know when that 1% will hit.
It takes only one creative genious to turn the next security issue into a thing that does affect us all. Some worm that eats all linuxes, a virus that spreads through all bsds or something that installs crypto miners on every second android or so. We cannot know.
And so we cannot defend ourselves against that. And so it's useless to worry about it. But it will happen. Our systems are way too monoculture, both soft- and hardware, to be protected against a digital potato famine.
If 99.9% of people can be exposed to the same malicious code and not even be aware that it was running in the background, it's all the more reason for a malicious actor to expose the largest amount of people to it with relatively minimal risk.
Some of these exploits can be used in a browser. They leave no trace. So it is hard to tell how much these exploits have been used in the past and how likely a wider attack will happen in the future.
Some of these exploits have been used in targeted attacks towards end users so the risk is not 0.
I have a very vague understanding of all of these DDR bitflip attacks, but I found the original Hammertime paper and it's actually very easy to read. I haven't gone through all of it but it breaks things down to be better understood very well.
I've heard bitflipping a million times and never really got it (not that I made serious effort) until this.
I feel like I just went through a 101 EE course. I had NO idea any of this was related to the actual hardware manufacturing imperfections, etc.
That explains the name Rowhammer. I've probably been under a rock and everyones knows this stuff.
> Due to the extreme density of modern DRAM arrays, small manufacturing imperfections can cause weak electrical coupling between neighboring cells. This,
combined with the minuscule capacitance of such cells, means that every time a
DRAM row is read from a bank, the memory cells in adjacent rows leak a small
amount of charge. If this happens frequently enough between two refresh cycles,
the affected cells can leak enough charge that their stored bit value will “flip”,
a phenomenon known as “disturbance error” or more recently as Rowhammer.
> Due to the extreme density of modern DRAM arrays, small manufacturing imperfections can cause weak electrical coupling between neighboring cells.
This makes it sound like it's unavoidable and inherent to making DRAM. It isn't.
DRAM manufacturers have been pushing the limits to an extreme. That's why. Pursuit of profit. This is no different from Ford deciding the cost of settling Pinto lawsuits (from injuries and deaths) was less than the cost of fixing the car's design.
I think the implication was that memory encryption could mean that a rowhammer-induced bitflip would be amplified into scrambling the entire word of memory, which is more likely to have catastrophic effects than a single bit flip. That would be true for any reasonable definition of "stable" that admits any susceptibility to rowhammer.
But that’s a good thing. Sane state would be synced to disk and any successful bitflip will halt a system telling you that something bad is going on. It would be “catastrophic” for runtime but not for the data
I know far too little about hardware security. Is this one of the many inevitable vulnerabilities that arise from CPU optimization and are of little feasibility in the real world?
Arguably worse. This arises from the physics of DRAM. This occurs at a much lower level than an edge case of a feature that lets you leak info over a side channel.
Instead this is just: the data is stored as a small charge in a grid by flipping nearby points on the grid alot you can leak some charge into your target charge.
The smaller the charge, and the closer together the charges, the easier rowhammer attacks are. Also, the smaller and closer together the charges, the faster, cheaper, denser, and efficient your RAM gets.
There are mitigations, but they are pushed to the limit.
From what I understand, it arises from DRAM manufacturers, interested in maximizing profits as much as possible, have been pushing the limits of how small they can make the RAM chip's features, and then backing off slightly until they felt ram was reliable "enough", but Rowhammer et al demonstrate it's very easy to cause bit flipping?
"maximize profits" and "best product for customer" are dual. you specifically want small chip features - or don't you like speed, power efficiency, and low cost?
They push the size to the limit, and stop when random writing is unlikely to cause any bitflips. Stopping at the point rowhammer would be unlikely would be stopping earlier.
As others said, this isn't just about profits. It's about being able to compete on costs (i.e. being able to survive at all) and to compete on the best performance.
This places the problem less at singular manufacturers and more at the whole industry.
The DRAM interface is pretty well decoupled from the memory array itself. So whether you're looking at DDR5 or LPDDR5(x) or GDDR6(x) or HBM3(e) isn't the right question. What matters are the implementation details up to the manufacturer's discretion, such as on-die ECC.
That only brings DRAM into alignment with flash and magnetic storage, so it's not really a negative. Everything in your computer is converging on semiconductor with bounded probabilistic state + math.
It's always been that way, just how many nines of reliability we're talking about. E.g. at Google scale, bitflips in memory from cosmic rays and general noise happy every day. Everything has checksums on it.
basically. pushing the timing and sizes makes it likely that some of your bits will fail to be built correctly. rather than dropping the speed and sizes to get reliability, you just throw an extra chip on to give you redundancy.
Well they said that it needs further testing. If it would be mostly fixed, it would mean that ecc could help even more.
I mean the on-die-ecc probably already helps
Now, if you are a cloud provider that provides VMs on multitenant hosts, your threat model may be different.
Either way, avoid machines without ECC. TRR was a lame duck even when Rowhammer was still fresh, and bits flipping in DRAM will not go away unless the economics in DRAM manufacturing change (e.g. not).