Hacker News new | past | comments | ask | show | jobs | submit login
[dupe] Why don’t PCs use error correcting RAM? “Because Intel,” says Linus (arstechnica.com)
48 points by husam212 on Jan 8, 2021 | hide | past | favorite | 27 comments




I absolutely love that Linus speaks his mind. Candor and telling it honestly is getting rarer...


Unfortunately, "candor and telling it honestly" is frequently code for "I want to be hostile toward people I don't like, but I don't want to experience social consequences for it", so there's a lot of noise in the interpretation of this space.


Linus has since apologized for his words and actions against people, worked to correct them, and has been maintaining a pretty good track record so far. Your generalization falls short of reality unfortunately. It is possible to use candor and "tell it like it is" without degrading people, which is exactly what you see here.


I agree it's often a problem. Happily, in this instance, Linus's criticisms are on the mark and not directed at an individual.


Do you have any evidence for this claim?


I'm not sure what 'evidence' would be satisfying to you for something like this, but maybe something about the man himself will help you understand what he means?

https://www.newyorker.com/science/elements/after-years-of-ab...

tl;dr for the article: “Please just kill yourself now. The world will be a better place” - Linus Torvalds, former "Candid" speaker


Interesting. Looking online, it seems AMD's desktop-grade Ryzen line supports ECC, although motherboard ECC support is still somewhat rare. Yet another reason to prefer AMD these days I guess. This reddit comment has more info on Ryzen ECC support: https://www.reddit.com/r/Amd/comments/ggmyyg/an_overview_of_...


Given the current RAM and disk sizes, and how much we put trust on computers, every cell phone or computer should use ECC RAM and checksummed filesystems these days.


The density/process size of the RAM would seem to drive that as well. A cosmic ray, for example, has a better chance of flipping a bit on a denser collection of transistors. Assuming the die size is the same, due to the total amount of RAM increasing, as you mention.


The market killed ECC for consumers. It used to be widely available to desktop class hardware (I used to have some), but it was slower and more expensive than non-ECC RAM. None of the marketing-types made a good case for why normal people should pay more money for less-performant hardware, so the market disappeared.

Even the security issues only meaningfully impact servers.

People like Linus can buy server-class hardware if ECC is so important to them.


The 'market' was influenced by two things. 1. Intel using ECC to segment consumer/server offerings. 2. Even when AMD supported ECC (officially or unofficially) on consumer products, consumers voted with their wallets with a mix of ignorance.

ECC ram should be more expensive as you're paying for more bits, and marginally slower as you're comparing stored with computed. If consumers don't value their data (they say they do but don't act accordingly), bit errors is what we get.

I do my job on a company laptop, if it were my own company I'd use an AMD desktop with ECC.


Old Sunblade desktops have a jumper switch where you can use either ECC or non-ECC ram.


> so the market disappeared.

So your claim is that it has nothing to do with Intel's product segmentation strategy?


While I don't think he should be flipping the bird, his technical arguments are absolutely spot on.

An aside question for readers: How often does RAM error, and is it a significant problem in practice?

I don't have any direct experience with this (that I could tell) using consumer computers for the past 15-odd years of programming.


That photo is from him saying "fuck you" to NVIDIA back in 2012: https://arstechnica.com/information-technology/2012/06/linus...

FTA: "Bit flips can happen for many reasons, beginning with cosmic-ray impact or simple hardware failure. A large-scale study[0] of Google servers found that roughly 32 percent of all servers (and 8 percent of all DIMMs) in Google's fleet experience at least one memory error per year. But the vast majority of these are single-bit errors—and since Google is using server CPUs and ECC RAM, this means the machines in question keep right on trucking."

[0] http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf


Something that doesn't get said enough is that Google had a habit at the time of buying RAM chips that had failed manufacturer QA, stuck them on DIMMs themselves, and revalidated those DIMMs. They were really leaning into the whole "embrace failures if they're going to happen anyway and you can get cheaper servers out of it". So those numbers need to be taken with a grain of salt.


Even still, a one-third of a memory error per year is a rate that I'm totally comfortable with.


Also FTA: "ECC RAM ... can generally stop Rowhammer attacks—in which rapidly flipping bits in one area of RAM cause bits in an adjacent area to change."


> While I don't think he should be flipping the bird, his technical arguments are absolutely spot on.

Linus is certainly known for some inappropriate rants against people which he has since toned down, apologized for, and worked to correct.

That said, this image is taken from an event where Linus was prompted about Nvidia in 2012 during some filming. For those who don't use Linux, these times were extremely chaotic in graphics, especially for Linux with respect to Nvidia, who were largely catering to Windows users. It's what kept Windows gaming in such a strategic position. This affects people who have lower end machines, people who can't afford Windows, and people who seek freedom from predatory data collection. Throwing up his middle finger at a company with those kind of practices is hardly revolutionary or inappropriate.

Source: https://www.wired.com/2012/06/torvalds-nvidia-linux/amp


Ah, thank for you for the explanation on the pic. I definitely remember that being a huge issue in 2012.


More often than you'd think. There's a paper from black hat 2011 [0] where bit flips and domains with a small hamming distance were used. Take google.com and woogle.com which in their ASCII representation only differ by one bit. The paper showed that these domains will receive traffic for their intended counterparts because of bit flips in the RAM in DNS servers.

[0]: https://media.blackhat.com/bh-us-11/Dinaburg/BH_US_11_Dinabu...


I think a measured dose of profanity has its place as a way to emphasise a point beyond what stern wording can convey.

That said, I also would like to know more about how often RAM errors actually happen. My gut feeling is that on mass produced consumer hardware, software errors FAR outweigh hardware errors in terms of how much they inconvenience me personally. Maybe the very occasional "that's weird" moments where a restart fixed it were hardware errors but they're few and far between. That's just a feeling, though, and I like numbers.


If I understand it right, all of the AMD Ryzen processors support ECC, provided the motherboard/chipset do as well. Which somewhat supports Linus' opinion.


They don’t officially support it but they don’t disable it on consumer chips like intel does.

It’s supposed to work but if you have any problems with it you can’t go back to AMD and complain. IMHO that’s not really good enough, if ECC is going to protect you they need to officially support it.

It sounds like not officially supporting ECC is a CYA move for when it doesn’t work.


Makes me think how many blue screens/crashes I've experienced over the years on my desktop could have been avoided by ECC. On a global level, think of all of that time & productivity lost due to an arbitrary policy/penny pinching.


Capitalism drives tech forward, and then very quickly stifles it once it's proven you can sell it.

It's sad thinking of where we could be today if mega-corps had less incentive to do this.

Edit: to the down voter, please explain why you think my opinion is wrong?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: