Microsoft analyzes over a million PC failures, results shatter enthusiast myths

gcp · on June 27, 2012

I wonder how many of those overclocked systems belong to people who are 100% convinced their system is stable and the programmers of Microsoft and <whatever other applications they're using> are idiots for having stupid crashing bugs.

My system can prime95 for 5 minutes, obviously it's rock solid.

thalur · on June 27, 2012

Alternatively, how much of the stats were the same machines failing repeatedly until they found stable settings?

[Anecdote] If my brief experience is anything to go by, overclocking involves lots of crashes until you get it right. Once it is working I would expect no more failures than a shop-bought PC, but by that point the statistics are already skewed by all those crashes that I don't care about.

gcp · on June 27, 2012

This is accounted for in the article.

Retric · on June 27, 2012

There are OS's that are designed to survive a fair amount of HW errors, but I don't think most people that OC would want to pay the speed penalty's for that.

fragsworth · on June 27, 2012

> Are Intel chips just as good as AMD chips? At stock speeds, the answer is yes. Once you start overclocking, however, the two separate. CPU Vendor A’s chips are more than 20x more likely to crash at OC speeds than at stock, compared to CPU Vendor B’s processors, which are still 8x more likely to crash.

I've got a hunch which one is inferior, but I kinda wish they disclosed this information. Any idea why they wouldn't?

nchlswu · on June 27, 2012

What's your hunch? I'm very curious as I haven't paid attention to any hardware in quite a while.

As for non-disclosure, I'd guess it's just some industry politics. No need to burn bridges. That, and I can see irresponsible bloggers sensationalizing the data. Would there be any possibility of legal action from a CPU Vendor (sensationalist headlines or not) if it was divulged?

eli · on June 27, 2012

It's pretty tough to make a defamation or libel claim if the statement is factual. But IANAL, and I would guess the potential for pissing off your partners outweighs the benefits of a unredacted report.

eswangren · on June 27, 2012

It's AMD. I OC and, if you look around OC communities and benchmarks you will find that Intel chips outperform SMD chips in this area consistently.

tadfisher · on June 27, 2012

The theory being that Intel bins for price control, and AMD bins for defects.

neutronicus · on June 27, 2012

Can you explain what this means? I'm not understanding.

pudquick · on June 28, 2012

The concept of "binning" has to do with the manufacturing process for chips.

They're such complicated devices that even if they do everything correctly, in a given batch of chips, one may do it at a slightly slower rate or only stay stable at a slower clock speed.

The chip manufacturer takes this into account and separates out the chips according to performance, with a low-high scale, which makes up the "family" of processors that they offer.

However - there is a certain amount of business process involved in the sorting as well. Chips with higher performance rating marked for a "high end" bin will result in a larger sale / dollar amount for that chip ... but it will also, in the end, increase the availability of that "high end" chip. This availability may not be entirely used by the market - and may result in a surplus that will actually drive the value of the chip down in the end. In addition, the middle range (commodity) chips will be in higher demand anyways so you'll want to have an abundant supply of them in general.

I think the parent comment was suggesting that Intel is more interested in keeping chip supplies for higher end processors low (thus scarce, thus valuable on the open market) and is willing to "underbin" a chip to help control those prices - whereas AMD is apparently strictly binning them based on pass/fail/rating/performance.

The end result is that you're more likely able to overclock a commodity Intel chip and have it stay stable because in reality it's likely "overqualified" for the processor label within the family it's sold under.

jiggy2011 · on June 28, 2012

I think they mean this:

When you build a CPU factory, it can cost a lot of money to set up the manufacture process for a chip. You want to sell a number of different chips at different speeds/prices (if you have a cheaper chip lower in the range you can charge top dollar for the fastest one).

You don't necessarily want to setup that many separate lines though, so what you do instead.

You make (let's say) a 3Ghz chip. You then test each one at 3Ghz, ones that pass are sold as 3Ghz CPUs. Ones that fail are tested again at 2.8 Ghz (or whatever) and so on until they find a speed they are stable at, they are then sold as that speed.

The implication is that AMD does this , but Intel will just take a chip that works (or might work) at 3Ghz and sell it as a 2.8Ghz because they don't have enough faulty (as in only run stable at slower speeds) chips.

What this means is that overclocking potential will be better on Intel chips because they have a higher chance of being stable at higher than the advertised clock speed (basically you're not overclocking so much as un-underclocking).

TwoBit · on June 27, 2012

The only thing surprising to me in the results was that laptop CPUs and memory were more reliable than desktop. Their reasoning is that laptops are built more conservatively. Maybe so, but desktops come from many more sources and may simply have. much more variable build quality. I want to see that data for desktops made only by the companies who make the laptops.

beagle3 · on June 27, 2012

I don't think it is surprising at all.

A modern laptop (produced in the last 10 years or so) has zero unused space, meaning that even if they are not treated well, components won't come out of sockets because they literally have no where to go. Most laptops also have tighter thermal regulation than desktops, because they are facing a harder cooling problem and wouldn't work well if they didn't.

Laptops also lag desktop technology by at least half a generation -- the average laptop CPU at a given time is produced with older-at-the-time (higher yield, better understood, more reliable) silicon processes than desktop chips.

Every floor-sitting-desktop I've seen gets kicks occasionally, and there's actually room for memory, fans and cards to move around.

I would guess a Lenovo or Dell desktop is still less reliable (as measured by real life crashes such as in this study) than a laptop produced by the same company.

narism · on June 28, 2012

The Nvidia GPU fiasco shows that you can still have problems even if the component doesn't have anywhere to go.

pkteison · on June 27, 2012

They already excluded overclocked and white box systems from the laptop comparison. Edit: From laptop comparison section of original paper: "To avoid conﬂation with other factors, we remove overclocked and white box machines from our analysis. Because desktops have 35% more TACT than laptops, we only count failures within the ﬁrst 30 days of TACT, for machines with at least 30 days of TACT."

regularfry · on June 27, 2012

Laptops are more likely to fail in situations where they can't report crash data.

narism · on June 28, 2012

Windows will write out a minidump to disk when it kernel panics and report it to MS later when it's back online.

regularfry · on June 28, 2012

Well, there goes that theory :-)

bryne · on June 28, 2012

What, because they're more prone to catastrophic, accidental physical failure, being less sedentary devices?

I don't think that has anything to do with the results in this paper.

jleader · on June 28, 2012

Less likely to be online at the time?

ak217 · on June 27, 2012

Laptops get much better integration testing, cleaner power (fewer cheap crappy power supplies), and fewer user-installed parts. I'm not surprised.

I would agree with you that desktops built by a major OEM should exhibit less of this behavior.

jleader · on June 27, 2012

I suspect laptops also don't get as many hardware upgrades. While many people probably never open their desktop enclosure, some do, and I suspect a far smaller percentage open up their laptops.

(Edited to remove mention of overclocking, after I noticed that the researchers left overclocked systems out of the comparison).

Another possibility is that hardware is designed to handle some level of outlying abuse (maybe it's engineered to handle 95% of the things people do to it), and laptops' outliers are more severe relative to their average level of abuse than desktops.

aidenn0 · on June 27, 2012

Haven't read the linked study yet, but huge red-flag for the extremetech summary:

The period tested was CPU-days, and when you change clock-speeds the amount of work done (including e.g. # of memory accesses) changes. If you assume that under-clocked CPUs are on average slower than non-under-clocked CPUs, then this could skew the data.

[edit] Read paper, the rates are not against CPU-days but against 8-month calendar time. All is good

[edit2]Also TACT in the paper is not defined as what normal people call cpu-time, but wall-clock time in which the CPU is on (i.e. idle counts, but sleep or off does not)

stephengillie · on June 27, 2012

Yesterday's thread: http://news.ycombinator.com/item?id=4161409

waivej · on June 27, 2012

I read the original study and came away thinking that it was skewed strongly to errors that an OS could detect. For example, CPU errors were far worse than DRAM errors, and both were much worse than hard drive errors.

Regardless, in my life, "white box" computers have lived twice as long and undergone twice as many upgrades as any of the others. Eventually they got retired when they started crashing. Laptops might get extra ram or drive space but they got retired when they were too slow or too heavy. It doesn't mean the laptops were more reliable. It's like comparing a truck that drives 350,000 miles and a car that covers 120,000 miles. I'd say the truck was more reliable since it fulfilled its job three times as long.

I also kept old "white box" machines running much longer because the old 3-4Ghz Pentium-IVs were fast compared to the Core 2 Duo laptops that replaced them. Those machines kept running from the 500MB RAM days to the 4GB RAM/SSD days.

Alterlife · on June 27, 2012

I find the note on laptop reliability vs desktop reliability rather surprising...

I've always assumed that laptop components run in a hotter environment - because cooling is a tough problem in a small device. Apparently that doesn't affect reliability at all?

It would have ofcourse also be interesting to know how particular brands / models did better in terms of reliability. I guess there are legal problems with releasing that kind of info?

jemka · on June 27, 2012

I guess there are legal problems with releasing that kind of info?

Perhaps legality was a concern, but more importantly is the quality of the information. If I buy a Dell Model X and upgrade the components is it still a Dell Model X?

Tracking the individual configurations of each manufacturer's models and identifying upgrades would be a tall order if not bordering on impossible.

freehunter · on June 27, 2012

They say there's no way to tell if environmental factors contributed to the crash, so there's no data on if heat damages the components.

eapotapov · on June 27, 2012

what's really interesting for me is their brand-name vs white box comparison:

DRAM one-bit ﬂip, Brand: 1 in 2700, White box: 1 in 950 Disk subsystem: Brand: 1 in 180, White box: 1 in 180

Assuming that brand-name desktops have brand-name memory and disks it means:

1. brand of your non-SSD desktop-level HDD absolutely doesn't matter. 2. brand-name memory is 3 times less likely to fail.

mkup · on June 27, 2012

How many hardware crashes were not recorded? If CPU, memory or harddisk fails, OS can't always record the event, isn't it? Is the data backing this paper statistically valid? We have only subset of failures here, when failed system was partially alive after failure.

nothacker · on June 28, 2012

At first I thought- wow. I really shouldn't overclock.

But, then I realized the findings were biased. The major players, HP, Dell, Acer, Asus, Lenovo, etc. obviously have more lobbying power. Two things stood out specifically:

1. The mention of OEM vs. white box. That is to reduce sales at Newegg, Tiger/CompUSA, the myriad of mom and pop stores, etc. only. No good comes of that. If you really were trying to do good, you've provide the most reliable and least reliable component types/manufacturers, although I know they would never do that, as they piss off everyone but the guys that came in #1.

2. There should be enough data on SSDs now to see some trends in reliability, even if the numbers are small and the variability is much higher. I assume that this was left out because the major players want to switch to SSDs because they provide a much snappier experience, but the context of the study was reliability, and there are a few scary things about SSDs in terms of reliability: the first is that when they die, they die hard right away with little warning, and because of that, the life is shortened; the big players hope this issue will resolve itself over time and in the meantime they'll just cater to the upscale market that buy computers more often and might not notice the reliability as much. Despite this, I still love SSDs and will continue to buy them, but it should be reported.

jamesaguilar · on June 27, 2012

I'm not sure how much this tells us. I overclocked my system to the point where it crashed, then dialed it back from there to the point where it was stable while running IntelCpuBurn for ten minutes. I'd be surprised to find any overclocker who hasn't crashed his computer at least once during the overclocking process.

What myth is supposedly being shattered here, by the way?

freehunter · on June 27, 2012

What myth is supposedly being shattered here, by the way?

Besides the obvious statement that ExtremeTech is a god-awful blog that loves to write incorrect statements and editorialize, I would say there are three main "myths" being "shattered" just from reading Microsoft's paper:

1. Myth: Desktops are more reliable.

2. Myth: Custom systems are more reliable than OEM systems

3. Myth: A stable overclock is stable.

On 1, the article speaks for itself. Laptops clearly show fewer failures in their data. Same with 2. On 3, you have to read between the lines, where they state "Even absent overclocking, faster CPUs become faulty more rapidly than slower CPUs." With that in mind, it's easy to see that even when you find a stable overclock speed, your hardware will crash more often even without taking into effect the crashes caused by finding this stable speed and without necessarily inferring that overclocks themselves cause hardware issues. However, their conclusions show "even small degrees of overclocking significantly degrade[s] machine reliability, and small degrees of underclocking improve[s] reliability over running at rated speed."

Microsoft is just putting this data out there in an academic journal, they make no recommendations or assumptions. They certainly don't make the claim that enthusiast myths are being shattered. That analysis is all on the side of ExtremeTech.

davyjones · on June 28, 2012

Direct link to paper: http://research.microsoft.com/pubs/144888/eurosys84-nighting...

_wk3u · on June 27, 2012

I find it interesting that the site hangs waiting for stumble upon. Here's to putting a reference to someone elses javascript in the middle of the page!

ekianjo · on June 28, 2012

Not very surprising laptops are more reliable than desktops. Desktops use more power, run at faster frequencies, tend to heat more than laptops, and are not built for duration. The point of desktop being that you can replace the parts easily when one fails or when you need an upgrade.

userulluipeste · on June 28, 2012

Apple propaganda around the corner, again! The media should try at least to conceal a little bit more the Apple-coined "PC" term (with the meaning used in that context), or choose a more specific naming - Windows machines.

thalur · on June 27, 2012

I'm curious, does anyone know if this study counted all the computers that didn't have any of these failures at all during the time period? Or is it just those that reported a failure?

freehunter · on June 27, 2012

In addition to dictating our choice of failure types, our use of crash logs has two additional consequences. First, we observe only failures that cause system crashes, and not a broader class of failures that includes application crashes or non-crash errors. Second, we cannot observe failures due to permanent faults, since a machine with a permanent, crash-inducing fault would be unable to recover to produce, record, and transmit a crash log.

They only include systems that have had a hard crash or a stop error.

JoeAltmaier · on June 28, 2012

"White box" failures may be explained by bad technique during installation e.g. not wearing a grounding strap may damage the equipment leading to early failure.

robomartin · on June 28, 2012

Of course, the conclusions are only as good as the data.

Take the issue of over-clocking as an example. Do they have data on cooling? Component quality? Component selection? Component handling during the build of the OC system?

An over-clocked system with inadequate cooling is definitely more likely to fail. We have several over-clocked systems for FEA/CFD that have been rock-solid since they were built (about three years ago). They generally run nearly 18 to 20 hours per day for weeks and weeks when we have such projects on the table.

All of the machines were built by us (I guess they call this "white box" now?). Every single one of them was built in a static-controlled environment. Every single one of them underwent full-load testing when built before being put into service. In most cases this led to identifying memory that failed prematurely. In other cases we've rejected motherboards and CPU's.

After a successful two week burn-in period the machines were officially deemed qualified for service. Oh, yes, all of them had fluid-based cooling systems installed and oversized external radiators. I forget what we aimed for in terms of CPU/Memory temperature, but it was definitely nice and cool compared to a normal heatsink setup.

Component selection, handling, build quality and burn-in testing are of paramount importance when trying to push the limits. My guess is that most hobbyists don't do any of this and simply go for the shiny new object on the shelf and expect it to work. That being the case, failure rates are sure to suffer.

The same applies to non-OC self-built systems. I don't think that I have ever bought a factory built (Dell, HP, Compaq, etc.) system save laptops. I can't remember ever having any hardware failures in, say, the last twenty years, save maybe one case of a hard drive that refused to spin after a few months of service (backups are golden!). Again, component selection, handling, testing, etc. are of paramount importance when building your own system.

The other part of this study that I think needs data is the "quality", for lack of a better word, of the user. How many of these users are Mom, Dad, Grandma, Grandpa and Uncle Fester? How many of these are hackers and computer enthusiast? I, for one, have never enabled the "share crash data" functionality on any of our machines in, again, about twenty years. That's data from a group of users that are smarter than the average bear that MS simply does not have.

The data from factory-built systems is probably far more reliable because, well, they are a relatively known quantity. There are still issues to consider, such as the operating environment (air-conditioning, temperature, humidity), air filter condition (full of dust or cleaned every few months), user cabinet incursions (did they install more memory and zap the MB with static?) and more.

I did not read the entire report. I don't know if any of these points were covered or not.

The conclusions might provide a good approximation of what the average user might experience. Without more data I would be careful about placing excessive weight on these findings.