I'll bite: every time I see a CPU-related thread on HN there are a few people clamoring for ECC support. While I get why I'd want ECC on a high-availability server running critical tasks, I don't really feel a massive need for it on a workstation. I mean of course if it's given to me "for free" I'll gladly take it, but otherwise I'll prefer to trade it for more RAM or simply a cheaper build.
Why is ECC that much of a big deal for you? Maybe I'm lucky but I manage quite a few computers (at work and at home) and I haven't had a faulty RAM module in at least a year. And even if I do I run memtest to isolate the issue and then order a new module. An inconvenience of course, but pretty minor one IMO.
Do you also use redundant power supplies? I think in the past years I've had more issues with broken power supplies than RAM modules.
> I haven't had a faulty RAM module in at least a year
ECC isn't for physically broken RAM, it's for the prevention of data corruption caused by environmental bit-errors (e.g. cosmic-ray bitflips).
Memory density increases with RAM capacity - which means a higher potential for noise (and cosmic-rays...) to make one-off changes here-and-there.
I understand this now happens quite regularly, even on today's desktops ( https://stackoverflow.com/questions/2580933/cosmic-rays-what... ) - I guess we just don't observe it much because probably most RAM is occupied by non-executable data or otherwise-free memory - and if it's a desktop or laptop then you're probably rebooting it regularly so any corruption in system memory would be corrected too.
This stack overflow link is interesting but most of the concern is over very theoretical issues. In practice a significant portion of the humanity uses multiple non-ECC RAM devices every day and yet most of us don't seem to experience widespread memory issues. I can't even remember the last time my desktop experienced a hard crash (well actually I can, it was because of a faulty... graphic card).
I wish my phone fared that well, but I'm not sure RAM would be the first suspect for my general Android stability issues...
> most of the concern is over very theoretical issues
I've seen photo and other binary files become corrupted that were sitting on RAID drives. The RAID swears they're fine, the filesystem swears they're fine, both are checksummed so I believe them. The only possibility that I can see is that they were corrupted while being modified or transferred on non-ECC desktops connected to the RAID.
I'm not afraid of my computer crashing. I'm afraid of data I take great pains to preserve being silently, indeed undetectably, corrupted while in flight or in use. So that's why ECC is worth it to me.
In the past I've had a flimsy RAM module in a macbook Pro and it was a real pain.
Everything appeared to work just fine, but on stressing the RAM with a lot of virtual machines the host would crash.
That was not the main issue, but took some time to diagnose as I was also running beta virtualisation software and was tempted to blame the change instead of the hardware.
Copying virtual machines from one disk to another did end up corrupting the data.
That was painful to find out.
Almost exactly the same thing happened to me - marginal RAM in an old style macbook pro. Drove me absolutely crazy.
People always talk about cosmic rays as what ECC is guarding against - not at all! It's shitty RAM, especially in laptops. When it's under stress.. for example buffering files in a large copy.. and you find out months later when it's too late to fix it. Not "theoretical" at all.
I'm curious: if storing lots of photos as .dng, .png or .jpg on ZFS without ECC, one presumably gets bit flips eventually. How does this affect the files? Do you just get artifacts in the photo? Or does the file become unreadable? If so, can you recover the file (with artifacts)?
I guess the answer boils down to how much non-recoverable but essential-for-reconstruction metadata there is in these file formats.
I had bit flips on a few JPGs and it renders them useless. Luckily I had a backup of a backup that had them uncorrupted. I'm still trying to find a complete solution to this problem. Presumably the TIFF or BMP file formats are more stable against bit flips.
I'd been reading so much about it over the past year or so I got to wondering just how many times cosmic rays affect our brains and what kind of protections we're running up in our skulls.
Our brains evolved through a chaotic, organic process. We're all the time storing new data and even losing data (selective memory). I'm thinking there's no mitigation process. If anything the random environmental noise might play some role in consciousness.
Depends which part of the file gets corrupted, and the issues between PNG and JPG are dramatically different. There are key bytes in the file (like size of segment, start of segment, etc) that if corrupted would dramatically mess up your image. If it's just in the compressed image data you'll just see some artifacts, and a JPEG already has plenty of compression artifacts anyhow...
People don't realize how frequent it could be because most of the time you don't see any of the consequences.
Here's a "funny" consequence of bit flips: bit squatting.
It's about exploiting a bit flip before a DNS query: you register the proper DNS domain and you wait for machines to wrongly contact you because they got the name wrong.
If you edit images or videos, maybe you detect small corruption in the image. If you use databases or do data analysis, there may be one number that is wrong, or some string has one byte of garbage. Sometimes, application may crash.
All this is very rare. It only matters if you need data integrity and do work where data has value.
The value of ECC for me isn't to provide redundancy, or error correcting, it's to let me know that the RAM (or parity bit) is going bad. Otherwise you are flying blind and injecting bit rot without being aware of it.
You write, " I haven't had a faulty RAM module in at least a year." - How do you know this without ECC?
Here's one example: looking at Firefox crash data, a fairly large percentage of crashes are caused by bitflips that corrupt data structures (e.g. making a null pointer which is null-checked into a non-null one that points off into memory that can't be read).
So what I really want is for everyone to default to ECC RAM. It would prevent issues like that to a large degree.
I'm sure all of you who love debating ECC will enjoy this defcon video:
https://www.youtube.com/watch?v=aT7mnSstKGs
"DEFCON 19: Bit-squatting: DNS Hijacking Without Exploitation (w speaker)"
The problem is, without ECC you won't know if you have marginal RAM. Basic memory tests can frequently pass for hours at a time, only to have a couple bits flip a few days later.
Plus, with tens of GB of RAM, you likely won't even notice, as the majority of that RAM is being used as disk cache, or application data. The best case (but likely the lowest probability) is that the bit flip happens in an executable page on an instruction that gets executed (vs the large number which won't get executed) and the application crashes.
If that happens you will never figure out why your application crashes, but if it happens enough, you will start looking for problems.
> If that happens you will never figure out why your application crashes, but if it happens enough, you will start looking for problems.
Ironically, I had this happen. In particular Overwatch would crash constantly on my gaming rig starting around June, I reinstalled Windows, reverted to Windows 7, tried various video driver revisions and NOTHING MADE SENSE.
Eventually started pulling parts out of my system, lo and behold a single 4GB stick of RAM was bad and causing all my grief. If I had ECC memory I would have known what the problem was right away, and replaced it without pulling my hair out first.
If you had had ECC you might not have noticed anything at all because your system worked just fine other than incrementing an error counter somewhere in the kernel.
This depends to some extent on how many parts of that stick were affected. But best case the system would have simply recovered.
If a bit flips in a lone workstation and noone is around to see it, does it make a bug?
More seriously though, in my experience faulty RAM is generally pretty easy to diagnose and leads to general system instability. I guess the worst case scenario is generating corrupt data before the issue is diagnosed but again while I would be very wary of that on an database server or something similar, I've never found it to be a massive issue on a workstation (at least if you have decent backups that is).
But maybe I've just been lucky so far. But given that the vast majority of consumer-grade computers don't come with ECC and yet RAM issues are still relatively rare I guess I'm not the only lucky one.
> If a bit flips in a lone workstation and noone is around to see it, does it make a bug?
No, bugs are - these days - considered to be mostly software issues, undetected hardware faults are not bugs in that sense but could lead to data corruption or at a much higher level wrong output.
If you don't care at all about the output of your computer (game playing, other recreational use) then not having ECC is fine, but if you do care about your results and you have multiple 10's or even 100's of GB of RAM in your machine then to have the option of ECC is useful.
Intel is just using the ECC thing as a way to justify the price difference between their Xeon product line and the consumer stuff.
If you ever have to deal with a filesystem that slowly got corrupted because of an undetected memory issue you'll be overnight transformed into an ECC advocate.
Keep in mind that those 'decent backups' were made by the machine you do not trust.
And god help you if your backups were incremental.
An other approach is to verify your output (preferably on an other system). Good validation and test suites should be able to catch messed up "output" along with many other non-ECC related issues.
I guess it makes sense to have ECC RAM on the machine building your releases (I actually don't even have that at the moment but I wouldn't advocate that...) but for your dev machine does it really matter?
I mean, at this point it's really about a rather subjective perception of risk and particular use cases. In my situation I find that memory issues are very low on my list of "things that can go catastrophically wrong". Really the only thing I can think about is building a corrupt release on my non-ECC build server. But from experience I'm not exactly in the minority to do that either and yet I don't observe many such issues in the wild.
Verifying your output on another system requires another system, the cost of which handily outweighs the cost of having ECC if the CPU/chipset support it.
As for: "I guess it makes sense to have ECC RAM on the machine building your releases"
That's a very narrow use case, there are many more usecases than that one and for a lot of those it makes good sense to have ECC: inputs to long running processes, computations that have some kind of real world value (bookkeeping, your thesis, experimental data subject to later verification, legal documents and so on).
> but for your dev machine does it really matter?
Maybe not to you.
> I mean, at this point it's really about a rather subjective perception of risk and particular use cases.
No, it's about a thing that if adopted widely would allow us to check off one possible source of errors that would not meaningfully increase the cost of your average machine and would still be an option, nobody would be forced to use anything.
> In my situation I find that memory issues are very low on my list of "things that can go catastrophically wrong".
Good for you.
> Really the only thing I can think about is building a corrupt release on my non-ECC build server.
You are still thinking about just your own use-cases.
> But from experience I'm not exactly in the minority to do that either and yet I don't observe many such issues in the wild.
Likely you also have somewhere between 8 and 32 GB of RAM in your machine.
If I look at my servers which have been operating for years on end they do tend to accumulate corrected ECC errors. The only reason I know about it is because there is ECC in there to begin with. If those machines would be running without ECC I'd likely not even be aware of any issues. But maybe the machines or some application on them would have crashed (best possible option), or maybe some innocent bits of data would have been corrupted (second best). And at the far end of the spectrum, maybe we'd have to re-install a machine from a backup (not so good, downtime, extra work) or maybe it would have led to silent data corruption (worst case).
Now, servers are not workstations, but my workstation has exactly as much RAM as my servers and no ECC, which is highly annoying but single threaded performance of the various Intel CPUs is much better on the consumer systems than it is on the Xeons unless you want to be subject to highway robbery prices.
So for me having the ECC option on consumer hardware would be quite nice, and I suspect anybody else doing real work on their PCs would love that option too.
Yeah I see where you're coming from, I guess I have a different perspective because I never really considered using consumer-grade hardware for "pro" server use. But I suppose it makes sense if you don't want to pay the premium for a Xeon type build. I wouldn't be comfortable hosting a critical production database on non-ECC RAM for instance.
Going on a tangent this discussion made me wonder if ECC memory was common on GPUs (after all, with GPGPU becoming more and more mainstream what good is it having ECC system RAM if your VRAM isn't?)
Unsurprisingly it turns out that consumer-grade GPUs don't have ECC. However I stumbled upon this 2014 paper: "An investigation of the effects of hard and soft errors on graphics processing unit-accelerated molecular dynamics simulations"[0].
Now obviously it's a rather specific use case but I thought their conclusions were interesting:
>The size of the system that may be simulated by GPU-accelerated AMBER is limited by the amount of available GPU memory. As such, enabling ECC reduces the size of systems that may be simulated by approximately 10%. Enabling ECC also reduces simulation speed, resulting in greater opportunity for other sources of error such as disk failures in large filesystems, power glitches, and unexplained node failures to occur during the timeframe of a calculation.
>Finally, ECC events in RAM are exceedingly rare, requiring over 1000 testing hours to observe [7, 8]. The GPU-corrected error rate has not been successfully quantified by any study—previous attempts conducted over 10,000 h of testing without seeing a single ECC error event. Testing of GPUs for any form of soft error found that the error rate was primarily determined by the memory controller in the GPU and that the newer cards based on the GT200 chipset had a mean error rate of zero. However, the baseline value for the rate of ECC events in GPUs is unknown.
Not the op, but I twice spent weeks debugging random software crashes under high load that turned out to be triggered by faulty memory. Because it was while working on a locking infrastructure I really had to figure out why it was crashing on my workstation and couldn't just be content that I couldn't reproduce or elsewhere.
I would really hope that AMD enabling ECC on all parts would cause Intel to stop differentiating their product lines on something that should be available everywhere. ECC should be mandated except for the least critical uses (toys, video players, etc).
I can see no mention of it anywhere and it's a "consumer" chip (for Intel, ECC is a segmentation feature between consumer and server hardware), so probably not.