Why things fail

nnq · on Oct 21, 2012

This just reminded me again how different software is from any physical world artifact... It's mind boggling how for lots of software the cost of failure can be reduced to almost ZERO by having the faulty component restart/reload/reboot or just pushing an automatic upgrade to millions of customers that didn't even knew they had a huge security hole...

And the ugly part is realizing how this "almost zero" cost of failure makes people behave:

1. we release software with security whole upon security whole to customers (think Oracle, Ms, Apple... and probably everybody else too...), knowing the fix will only be one upgrade away anyway, or

2. we just code ugly hacks that "just restart the goddam service" when it fails to respond or fails a sanity check, but leave the app otherwise running and just dump everything to a log that nobody will ever care about until performance really starts to degrade, or even then

3. just throw more hardware at the problem and keep restarting stuff that leaks memory or starts to fail some empirical sanity checks

...on the other hand I wished the plane I flew in could just restart/reload/reboot after a failure before it smashed into the ground :) ...so maybe reducing the costs of failure and caring less about it is the way to go.

-- EDIT: realized that the plane/software analogy leaks badly: if the plane I was in were the software, then I would be the data that get corrupted when it fails :|

mrich · on Oct 21, 2012

In comparison to relatively simple mechanical devices it is also mindboggling in how many more ways software can fail.

- failures from the underlying hardware - out of memory may occur in many places - handle each corner case that may be reachable correctly - concurrency - malicous user input - running the program on future hardware

nnq · on Oct 21, 2012

I don't thin "physical" devices have fewer points of failure than software... it's just that our intuition is used to lumping thinks together (like zillions of types of cracks are all "just cracks" unless you're a specialist in the field...), maybe we have less well formed intuitions about software failure.

...but do you really think an analog device doesn't have MUCH MORE ways in which it can fail than a piece of software? ...I mean, just think that software has a "limited" number of states it can be in ...whereas for a "lump of atoms" you need an overly simplified model just to start counting some sort of states for the smallest components

andrewcooke · on Oct 20, 2012

great article; thanks for posting.

vextec's site is here - http://www.vextec.com/ - and the patents for what i guess are the core technology are here - http://www.vextec.com/our-technology/patents

from reading the first article it seems that they extend finite element analysis down to the level of crystals (they say "microstructure based") (which is what you'd guess from the wired article). there's more detail related to particular models (crack nucleation and growth) that i don't completely get - i suppose those are different approximations used in the models (and where the hard science is). and then they seem to do monte-carlo on top of that.

lylemckeany · on Oct 21, 2012

As I read this, the conspiracy theorist in me keeps thinking that Ford built Building 4 to optimize failure rates so they occur immediately after the warranty is up so they can cash in on service visits afterwards.

nathan_long · on Oct 21, 2012

I don't think this can practically be done. The article says that for a given part, the time to fail will vary widely.

>> The chart below shows the logarithmic failure curve of steel bars placed in a fatigue machine. Most fail after 1 million cycles, but if you were to test only a few bars, those failures might occur after 10 million cycles.

If this is the case, the only way to ensure a cluster of failures at say, 10 years, would be to 1) engineer the part so that most will last much more than 10 years (to guarantee that nearly all will last at least that long), and 2) add a mechanism for breaking it. Like, a random number generator and a tiny saw that cuts the axle if it returns 0.5. That seems both unfeasible (you have to engineer the tiny saw, incorporate it into the cost and weight budgets, ensure that IT won't fail, etc) and easily discoverable. The resulting scandal would destroy the company.

Besides all this, a reputation of "barely lasts through the warranty" is bad for business. I (like many people) buy Toyotas because experience shows they last a long time. Ford has impressed me lately, but I have to be convinced they're just as good before I'd buy one. "Lasts 10 years" is not good enough to compete.

You don't have to suspect a conspiracy if you can show that the evil action discussed is against the interests of the supposed perpetrators. This is basically why capitalism works.

shabble · on Oct 21, 2012

There is of course the apocryphal story of Henry Ford and the junkyard analysis[1]:

"Ford sent a team of agents to tour the scrap-yards of America in search of discarded Model T Fords. He told them to find out which components never failed. When they returned they reported failures of just about everything, except the kingpins. They always had years of service left in them when some other part failed irretrievably. His agents wanted to hear how the boss would improve the quality of all those components that failed. Soon afterwards, Henry Ford announced that in future the kingpins on the Model T would be engineered to a lower specification."

[1] http://www.snopes.com/business/genius/fordpart.asp

thaumaturgy · on Oct 21, 2012

Heh. Not to take the fun out of the story (and it might be true!), those kingpins -- along with a lot of other Model T parts -- are still in use and on the road today, almost a hundred years later in some cases.

Good Model T enthusiasts and restorers try to use as many original parts as possible, not just because it adds to the vehicle's antiquity, but also because the old parts actually seem to be more reliable than the hundred-year-old parts. For example, during Model T production (I don't remember what year), Ford engineered a rear axle shaft that would twist under extreme stress instead of breaking, helping mud-stuck T's get out with less risk of an expensive repair. (Hill climbs are still a popular hobbyist thing with these cars.)

Typically it's the sheet metal on a T that is in the worst shape -- lots of blister rust, pinholes, dents and warps and so on. Even then, an accomplished welder or backyard blacksmith can have a fender fixed up in about an afternoon and ready for its first finishing step.

Paint, too, was different. Certainly it was a bigger environmental problem, and we've switched paint formulations and techniques for good reason; still, you never see paint coming off of old Ts in flakes and chips. Instead you'll see a spiderweb series of cracks and a sort of a lackluster dull finish that can sometimes be spiffed up with a few hours of buffing and polishing.

Intellectually I get why the automotive industry has moved the way it has -- modern cars are safer, more comfortable, and much more efficient -- but it's also easy to see why old car guys are so annoyed with modern vehicles. I seriously doubt there will be even one slightly "original" Honda on the road in a hundred years.

noonespecial · on Oct 21, 2012

Almost. Building 4 is there so that they can optimize build quality (vs. cost) so that the things they build can last just long enough to make it through their warranty period. It sounds harsh but not many people would pay an extra $1k retail purchase price to have an alternator in their Fiesta that lasts for 300 years.

nathan_long · on Oct 21, 2012

That's not what the article says.

>> On Ford parts, the very first fails aren’t supposed to happen until just after the 10-year mark (with most of them occurring much later).

noonespecial · on Oct 21, 2012

Yes, I thought we were talking conspiracy theories here. At any rate, and purely anecdotally, 10 years? Heh. I've had too many Fords.

forensic · on Oct 21, 2012

All modern corporations do this. It's called planned obsolescence and it is NOT a conspiracy theory. It's a well established standard operating practice for the majority of companies with manufactured, mass marketed goods.

Planned obsolescence occurs when barriers to entry are so high that the remaining players in the market, with a wink and a nod, form a subtle cartel. These cloak and dagger cartels of course raise prices, but they also increase failure rates to extract rent.

This is the dark side of capitalism and what happens when capital concentrates in the hands of a small plutocracy. The health of capitalism is completely dependent on decentralized wealth. As soon as wealth starts to concentrate, oligopolies and cartels form and they put a brake on innovation in favor of rent seeking.

There's a documentary on this called "The Lightbulb Conspiracy" which is worth watching.

mseebach · on Oct 21, 2012

There's a distinction between carefully engineering something to be as cheap as possible, but still lasting for the full length of the warranty, or actively adding a feature to make something that would otherwise last for a lot longer break after the end of the warranty.

The former is just good engineering, the latter is bad.

kiba · on Oct 21, 2012

The flip side of reliability is less innovation and less of other qualities such as fuel efficiency and so forth.

So, it's not a dark side, because corporations are economizing in accordance to a variety of factors.

forensic · on Oct 25, 2012

> because corporations are economizing in accordance to a variety of factors.

Except, they're not. Planned obsolescence is not about economizing for other factors, it is about rent seeking. You seem to be dismissing the findings of this research without even understanding it. The entire point is that this research has demonstrated that the planned obsolescence in question is has to do with cartel formation and rent seeking rather than striving for a competitive edge in a competitive market. This rent seeking only happens in non-competitive, oligopolistic or monopolistic markets.

Cartel formation and operation is a well understood economic phenomenon. You seem to be doubting it based on some kind of faith in the free market, some kind of faith in the infallibility of markets. Adam Smith would have strong words against this kind of naive market-worship.

jseliger · on Oct 21, 2012

>It's a well established standard operating practice for the majority of companies with manufactured, mass marketed goods.

As Wikipedia and XKCD (http://xkcd.com/285/) say, [[Citation needed]].

A documentary of unknown provenance doesn't cut it: show me a book or website with an actual bibliography.

forensic · on Oct 25, 2012

You realize that XKCD is mocking people who do that, right? It's completely ridiculous to ask for citations in the context of a speech and only an idiot would do that.

In the context of writing, it is certainly appropriate to cite sources. Which is precisely why I cited my source. In turn, the source I cited, cites other sources in addition to doing investigative journalism and speaking to credentialed experts, who in their turn cite their sources.

As a writer, it is not my job to prove the veracity of my sources. That is the job of the reader. My job is to cite them; it is the reader's job to determine the "provenance" of my sources if they think provenance is the most important thing here (it's emphatically not).

If you're stuck up on provenance, you could have googled and discovered that the provenance of this is Arte France, a French-German TV producer, which in turn is owned by Arte Group, a French-German media corporation. They are relatively small players--Wikipedia reports they have about a 1% audience share.

Having said all that, in any good research the least important question is provenance. The most important question is research methods and findings. If you were to do your due diligence as a reader, and actually check the citation in question which is your duty as the reader, you would find out that their research methods are disgustingly rigorous and beyond any reasonable doubt. Their research is of the extremely boring variety that unearths extremely obvious facts. They also speak to credentialed, published experts to put these facts in context.

In conclusion, your objection is completely without merit, you have failed to engage in good faith, you have failed to do your due diligence that is your duty as a critic. In academia, it falls to the reader to follow up citations. An academic who would criticize the sources of another academic without actually checking those sources would be laughed out of the room. In this way, I too laugh at you.

jacques_chester · on Oct 21, 2012

For those of you who are members of the IEEE, you might be interested in joining the Reliability Society. They publish journals in this area.

They also cover software reliability; particularly security and privacy.

http://rs.ieee.org/