Hacker News new | past | comments | ask | show | jobs | submit login
How Complex Systems Fail (1998) (complexsystems.fail)
366 points by mxschumacher on Dec 27, 2020 | hide | past | favorite | 83 comments



For those interested in the topic, especially point 7 and 8 (root cause and hindsight bias), I strongly recommend Dekker's "A Field Guide to 'Human Error'". It digs in on those points in a very practical way. We recently had a minor catastrophe at work; it was very useful in preparing me to guide people away from individual blame and toward systemic thinking.


I strongly agree with this recommendation.

Personal Experience: During and after the economic collapse of 2007-2009 I wondered who was at fault; who to blame. I kept waiting for a clear answer to "THE root cause". Since then I've read things like Dekker's work, and come to realize that blame is not a productive way of thinking in complex systems.

A quick example: in many car accidents, you can easily point to the person who caused the accident; for example the person who runs a red light, texts and drives, or drives drunk is easily found at fault. But what about a case where someone 3 or 4 car lengths ahead makes a quick lane change and an accident occurs behind them?


This is why I prefer thinking about root solutions rather than root causes. The answer to the question “how can we make sure something like this cannot happen again?” (for a reasonably wide definition of this). The nice thing is that there are usually many right answers, all of which can be implemented, while when looking for root causes there may not actually be one.


A good example was the AWS S3 outage that occurred when a single engineer mistyped a command[0]. While the outage wouldn't have occurred had an engineer not mistyped a command, that conclusion still would have missed the issue that the system should have some level of resiliency against simple typos - in their case, checking that actions that wouldn't take subsystems below their minimum required capacity.

[0] https://aws.amazon.com/message/41926/


Systems should still be able to be taken offline, though, even if that means failure.

For example, let’s say you have a service that uses another service that raised its cost from free to $100/hour and you call it 1000 times per hour.

Even though you may not have a fallback, and your service may fail, you need to be able to disable it. In this case, an admin is unavailable and the only recourse would be to lower the capacity to 0, since you have that control.

That doesn’t negate the benefit of validation, but don’t be too heavy-handed with validation, just as a reaction to failure without fully thinking it through.


Ideally a destructive command shouldn't be accidently triggerable. At the very least it should require some positive confirmation. Alternatively, a series of actions could be required, such as changing the capacity (which should be the comqmnd where the double checks and positive confirmations happen in my opinion) followed by changing the services usage.


The root cause was the deregulation in the 90s and the removal of the Glass-Steagall act.

Complex systems are also simple systems when viewed as a black box from the out side. A lion always eats a gazelle given a chance and a bank always explodes if not regulated to within an inch of its life.

That people like to pretend internal complexity matches external complexity is a very odd mental quirk. It is false in both directions of implication. Conway's game of life is as simple a game as you can get yet it has the most complex behavior possible.


Thank you for making this point. The idea that it was an unpredictable failure mode of a complex system, rather than predictable exploitation of a system where safeguards against exploitation were deliberately removed as "inefficient," is exactly what will ensure a repeat in the future.


> But what about a case where someone 3 or 4 car lengths ahead makes a quick lane change and an accident occurs behind them?

You'd have to be going extremely slowly for 3 or 4 car lengths to be a safe following distance. On a typical 60-70mph freeway you should have a gap of at least 15-20 car lengths, and then accidents like that will happen only when other factors are at play (and if those factors are predictable like water on the road then your distance/speed should be adjusted accordingly).

While I think that example was bad, your point about there existing accidents without single points of blame is still valid.


I meant to say 3 or 4 cars ahead. Thank you for taking the point.


That makes sense. To be fair I shouldn't have been so hasty anyway; with your example of a lane change, even if you can control your following distance it's hard to also keep people from other lanes cutting in too closely.


> A quick example: in many car accidents, you can easily point to the person who caused the accident

This is a bad example: traffic is a complex sistem with a century of ruleset evolution specifically intended to isolate personal responsibility and provide a simple interface for the users, that, when correctly used, guarantees a collision free ride for all participants.

The systemic failures of trafic are more related to the fallible nature of its actors. The safety guarantees work only when humans demonstrate almost super-human regard to the safety of others, are never inattentive, tired, in a hurry or influenced by substances or medical conditions etc.

We try to align personal incentives to systemic goals with hefty punishments, but there is a diminishing return on that, at some point you have to consider humans unreliable and design your system to be fault-tolerant. Indeed, most modern trafic systems are doing this today with things like impact absorbing railings, speed bumps, wide shoulders and curves etc.


It's a perfectly reasonable example. Most complex systems share those properties. It makes glaringly obvious that isolating blame/personal responsibility has limited effectiveness in preventing accidents. While we need a blame assignment system that's clear to reason about for financial reasons, for other systems where that need isn't so great, more effort ends up getting spent on figuring out how to prevent outages than assigning blame for them.


"guarantees a collision free ride for all participants"?

The only way to win is not to play.

I was stationary in a traffic jam when I was rear-ended by the car in back of me. Fortunately, I was not hurt at all.

How could I have avoided this? (see above)?


> I was stationary in a traffic jam when I was rear-ended by the car in back of me. Fortunately, I was not hurt at all.

I believe yholio's statement about "when correctly used" was meant to be "when correctly used by all participants"; i.e., no single participant can guarantee their own, or anyone else's, safety, no matter how careful they are.

On the other hand, the guarantee is, as yholio notes, of an almost entirely theoretical nature even given universal cooperation:

> The safety guarantees work only when humans demonstrate almost super-human regard to the safety of others, are never inattentive, tired, in a hurry or influenced by substances or medical conditions etc.


I think you missed the "when correctly used" part. Clearly, the person who hit you was not correctly using the system which places the whole responsibility of this situation onto them. When obeying all rules of the road, situations where rear-ending someone is probable should not arise.

The systemic failure here is expecting people not to phase out and pay less attention to the road when driving for hours at high speed on monotonous highways.


Well, traffic doesn’t exist in a vacuum. One could engineer a system where this is less likely to happen, by simply engineering a system where people are less likely to drive. Many European cities are systematically reducing incentives to drive at all, by improving alternative means like public transport and biking, removing parking, narrowing or eliminating car lanes, and cutting off through streets for cars.


I wrote another comment, before reading the article that says it better than me:

Catastrophe requires multiple failures – single point failures are not enough.


> During and after the economic collapse of 2007-2009 I wondered who was at fault; who to blame. I kept waiting for a clear answer to "THE root cause".

Well, there can be a difference between "who is at fault" and "THE root cause". Quite a large difference, potentially.

In the case of '07-'09, maybe nobody was at fault (seems plausible) but there was a very neat root cause. The government handed out a lot of money to people who shouldn't have gotten it, the people responsible were largely protected from bankruptcy and the system forced to reform largely as it was. The people who took excessive risk earned an excessive reward - they should have all gone bankrupt. The financial system should actually have changed, and people who made productive investments and didn't take risk should have become ascendant. Instead we have the same old crowd playing the same old game.

Disabling the major feedback mechanism of capitalism is about as root-causal as can be gotten. Nobody in particular chose to disable it though, it was a consensus decision among the powerful.


Yes you are correct, root cause != person to blame. But....

That is such a over simplification and ignores a lot, most, of the issues in the GFC.

* Regulatory capture

* Illegal behaviour by trusted actors (chiefly banks)

* Corrupt judges

* Carelessness by trusted agencies e.g., credit rating agencies

Long long list of the collection of failures in the international financial system in general and American civil society in particular.


Any issue can be made impossible to solve if enough details are highlighted. Deciding what to eat for breakfast is an impossible collection of nutrient requirements, bodily requirements, long and short term tradeoffs, economic problems and social expectations. Yet somehow we mostly manage.

"Oh but it's complicated" is a completely standard line of misdirection that comes up very regularly when people are making repeated bad decisions. Even small children sometimes try it. It is very, very rarely true and particularly in totally synthetic systems like the monetary one. There is always a point of greatest leverage that could be changed and it makes sense to call that the root cause and try changing it.

Now I'm totally open to the idea that I don't know what the point of greatest leverage is in the GFC. I haven't read the regulations and I wasn't in the room when the money was being handed out. But there were billions to trillions of dollars in fake wealth that turned out never to have existed. The fact that the banking industry skated through with the same people largely in charge suggests strongly that no serious attempt was made to figure out who exactly was screwing up.


"But there were billions to trillions of dollars in fake wealth that turned out never to have existed" That is not true. They were financial assets which only have the value that people put into them. They are underpinned by confidence. Confidence goes away, it takes value with it.

"no serious attempt was made to figure out who exactly was screwing up" Exactly. Some very powerful people got very rich and it was in now bodies incentives to find anybody accountable.

These are interesting and may help you understand: https://en.wikipedia.org/wiki/Minsky_moment

https://www.sonyclassics.com/insidejob/


What would you have done differently from what the policymakers have done? And how would you account for potentially undesirable second-order effects? (Which were quite massive for most of the obvious alternatives.)


> During and after the economic collapse of 2007-2009 I wondered who was at fault; who to blame

Sometimes it is possible to at least narrow things down to an underlying inherent instability. In the case of your example, a huge underlying cause is an economic system based on debt (backed by interest and usurious transactions). It's for a reason that usury/interest is banned in Islam, Christianity, and Judaism for example. It's a parasitic practice that makes the economy fundamentally unstable. This includes dangerous practices such as selling debt for debt (again part of the same crisis), and things like stock shorting (which, interestingly enough was also banned during the crisis, at least for some critical company stocks).


There is a relatively simple root cause of the financial crisis, it's called moral hazard:

https://en.m.wikipedia.org/wiki/Greenspan_put

Now, how exactly to solve the problem is a complex question, so I suppose in that respect it's hard to think productively about it.


Afaik, there were actual massive frauds going on.


Massive frauds are predictable when it becomes profitable to participate. For example, right now in the UK, the government is guaranteeing certain loans through a Covid relief program. Predictably, banks are letting a high number of fraudulent loan applications through. The banks are participating in the fraud as victims, willing victims who expect to make a profit from it. It seems like the government is going to raise its eyebrows and "tut-tut" a little bit but basically let them get away with it.

https://www.ft.com/content/bbe858d9-8678-4d1a-84c1-a62ac426a...


Like Madoff? The subprime lending/NINJA loans? In my opinion both of those things would have been harder if everyone was doing more due diligence. Moral hazard took away some of the incentive to do that.


"everyone was doing due diligence" is not really realistic expectation. It amounts to blaming everyone for small bad thing while excusing massive big bad thing going on.


I found a decent analogy for what I'm talking about:

https://www.history.com/news/beanie-babies-value-criminal-ac...

With a bubble in Beanie Babies there was accompanying fraud and other criminal activity, with most consumers not wittingly participating, but they were affected by it.

So with the asset bubble of the 2000's, this kind of behavior affected everyone, as everyone needs a house.

In another thread I have said I don't blame the masses who may have made bad decisions in hindsight, as they did not have access to all the relevant information. So I do blame the fraudsters, but in the end I really only blame the Fed for keeping interest rates too low, and of course that can circle back around to society as well if you like.

https://youtu.be/d0nERTFo-Sk


Right, even if all the wrongdoing had all happened together, but the rest of the economic and financial system had been sound, we’d have been ok.


Root cause analysis when run properly doesn't only come out with one culprit, the process can identify many potential sources of weakness in technical, human processes, documentation--at least one of which (typically more) should immediately be improved.



How does that square with the need to improve systems so the same problems don't happen again? I get not wanting to put blame on a particular person, group, or cause when it's multi-factorial, but how can you improve if you don't figure out why the failure occurred?


I’m not the OP, but when I think of “systemic thinking” I think the focus is more on looking at all of the factors involved as part of a holistic model rather than placing blame at the feet of a particular individual or process. You can still identify causes and try to remediate them, but most of the time the remediation shouldn’t be something like “let’s fire Bob for making a mistake/error”, but rather, “Let’s look at all of the events that led up to Bob making that mistake and figure out how we can help him avoid it in the future through a system, process, or people change, or a combination therein”.

That being said, if someone is negligent and consistently does negligent things they should probably be put into a position where their negligence won’t cause catastrophic system failures or loss of life. Sometimes that does mean firing someone.


A big part is acknowledging that the actions that human operators take is largely a result of the environment in which they operate. Typically there are many issues with that environment that be improved and all human operators will benefit.

To give you a more concrete example, it moves the analysis away from "Bob deleted the production database" into a more productive space of "we really shouldn't have a process that relies on any human logging into the production database and running SQL queries by hand, that's prone to human mistake".


That's one of the central questions of the book. But my take is that there are a bunch of ways to answer a "why" question, some more useful than others.

One very common mode is to take a complex causal web, trace until you find a person of low status, and then yell at and/or punish said scapegoat. That desire to blame is a very human approach, but it a) isn't very effective in preventing the next problem, and b) prevents real systemic understanding by providing a false feeling of resolution.

So if we really want to figure out why the failure occurred and reduce the odds of it happening again, we need to give up blame and look at how systems create incentives and behaviors in the people caught up in them. Only if everybody feels a sense of personal safety do we have much chance of getting at what happened and discussing it calmly enough that we can come to real understandings and real solutions.


Thanks for the clarification. This sounds like an interesting book.

The phrasing used on the web site is "Post-accident attribution to a ‘root cause’ is fundamentally wrong." At first glance, it sounds like the author means there is no cause that can be found so you shouldn't try to determine the cause. First they clarify by saying there are many causes not just one. However, this phrasing made me scratch my head:

> The evaluations based on such reasoning as ‘root cause’ do not reflect a technical understanding of the nature of failure but rather the social, cultural need to blame specific, localized forces or events for outcomes.

I don't know what other organizations are like, but where I work, when we do a "root cause analysis," we aren't literally looking for a single cause, despite the name. The "root cause" is almost always that pieces a, b, and c came together in an unexpected way. I can definitely think of places where I worked where they were mostly out to place blame, though, and I guess that's what they were trying to caution against.


I think blame is one way it can go bad, but not the only one. The whole framing of a "root cause" is dangerous, in that it encourages people to look for exactly one thing, and then not look beyond it when they find it. It sounds like your organization does decently in that regard, but they're doing it in spite of the "root cause" frame.


There's a definite difference between blame and cause, and they don't conflict. Blame is for individuals, cause is for systems. While you do need to hold individuals accountable, most of the time you should focus on fixing the system, which is a much more durable fix.


Part of the philosophy is to change the mindset from a “person” perspective to a “process” perspective. I.e., what gaps in the process led to the mishap, not what person caused the mishap.

Organizations that are people dependent rather than process dependent tend to have higher risks of failures.


This book taught me about Rasmussen's model of safety. It's a good book.


>>Safety is a characteristic of systems and not of their components >>Safety is an emergent property of systems; it does not reside in a person, device or department of an organization or system. Safety cannot be purchased or manufactured; it is not a feature that is separate from the other components of the system. This means that safety cannot be manipulated like a feedstock or raw material. The state of safety in any system is always dynamic; continuous systemic change insures that hazard and its management are constantly changing.

fits perfectly the description of Security of complex IT systems, nice way to explain why IT security marketplace is a wild wild west



This reads like a crisp summary of a subset of John Gall's "Systemantics": https://en.wikipedia.org/wiki/Systemantics


The accompanying talk by Richard I Cook at the O'Reilly Velocity conference is here:

https://www.youtube.com/watch?v=2S0k12uZR14

When I worked in operations, I would watch this talk every couple of months to remind myself of the principles. The principles are summarized in the list, but the talk adds a bit more context. I find that this context matters for seeing how the principles might be applied.


I see #13 (human expertise) crop up the most often in the complex systems at my work. I think the two main reasons replacing experts is so hard is (1) it's inherently difficult to understand another's work you haven't built from yourself and (2) it's nearly impossible to find someone as enthusiastic for taking on a system and giving it the TLC it needs to keep it going smoothly to the extent that it matches the passion that the original creator poured into it. I find this most acutely whenever the original expert refers to it as "their baby".


For a good analysis of risks and failures associated with complex systems, I recommend "Normal Accidents: Living with High-Risk Technologies" by Charles Perrow (1999).


I won't make a recommendation either way on this book, but I will say that it came off as much more of an opinionated screed than I was expecting; don't go into it expecting an even keel


Agree with Pt 11.

  “After an accident, practitioner actions may be regarded as ‘errors’ or ‘violations’ but these evaluations are heavily biased by hindsight and ignore the other driving forces, especially production pressure.”
Management pressure to produce reminds me of the phrase “We don’t know where we are going but we want to get there as soon as possible.” Garbage in garbage out. And no. Endless meetings are not a good substitute for anything.


Management pressure to produce is coupled closely with practitioner actions are gambles Pt. 10, and these two things lead to less efficiency.

I see the function of management at my workplace is to get practitioners to take bigger gambles to produce more favorable outcomes. This is convenient for management because they can capture the upside of a gamble and focus blame on an individual on the downside of a gamble.

One of the practitioner reactions I've seen to this in government contracting is to become hyper specialized in one specific job function. Don't be a jack of all trades, don't wear multiple hats. Don't spring up to help someone. This leads to two outcomes.

1. When a practitioner is confronted with a novel problem, they will say that's not their job and that persons X, Y, or Z can help. Person X directs you to person A, persons Y and Z direct you to B, person A also directs you to person B and person B directs you back to person X. You ultimately find that no one is responsible for anything.

2. When an operation does need to be performed, it takes tiny contributions from many people. This leads to long chains of activities that take months to do simple operations, such as purchasing items because there is inevitably someone who is out sick or on vacation for two weeks.

This in turn leads to more hiring, and then management hits back with initiatives to streamline operations. It is a fascinating cycle.


"Complex systems run in degraded mode."

This is the point that struck me the most. "Good enough" is good enough when there are so many other more urgent needs to address and you've been able to keep a system functioning despite extensive deferred maintenance. It is difficult to get allocation of resources to a system that's still working because it's robust when there are less robust (and probably less critical) systems that won't work without more resources being thrown at it.


>>[The] defenses include obvious technical components (e.g. backup systems, ‘safety’ features of equipment) and human components (e.g. training, knowledge) but also a variety of organizational, institutional, and regulatory defenses (e.g. policies and procedures, certification, work rules, team training).

This omits "design" for defenses against problems.

Example: the chemical industries in many countries in the 1960s had horrendous accident records: many employees were dying on the job. (For many reasons) the owners re-engineered their plants to substantially reduce overall accident rates. "Days since a lost-time accident" became a key performance indicator.

A key engineering process was introduced: HAZOP. The chemical flows were evaluated under all conditions: full-on, full-stop, and any other situations contrary to the design. Hazards from equipment failures or operational mistakes are thus identified and the design is adjusted to mitigate them. This was s.o.p. in the 1980s. See Wikipedia for an intro.

Similar approaches could help IT and other systems.


I think we should build complexity using sort of adversarial system, where things compete like in Nature.

The great lakes is a complex eco system. I remember when Zebra mussels first appeared in the Great Lakes. Tens of articles appeared, how this was an epic natural disaster, that would kill the Great Lakes Eco system, because Zebra mussels would eat all the food. Now decades later, it turns out that Zebra mussels are great at filter feeding, and have thrived. But this also cleared up the water greatly. To the point that the lakes can look almost tropical, ocean clear, on sunny summer days.

Which is probably great for all kinds of plants that now thrive on the extra sunshine reaching deeper down. It seems like the system just rearranged itself, and new opportunities appeared for some other plants to thrive.


In nature, everything is fine, including nothing at all, because nature doesn't care either way - it just is. If Zebra mussels turned the Great Lakes sterile, it would be fine by nature.

When we care about a system, we care about specifics - we want it to look in some way, do some thing or provide us with some things. E.g. we want clean-looking, odorless lakes safe for recreation and full of fish to eat. That's a specific outcome that may or may not be what the natural selection will provide; to ensure we get what we want, we need to supply a strong selection pressure of our design.

Adversarial processes are effective given sufficient time, but about the least efficient way to build a system after just waiting for it to appear at random. Humanity's dominance over other life on Earth is in big way based on the ability to predict and invent things "up front", without any competitive selection. I feel we should refine that skill and find more opportunities to use it, because it's much less wasteful.


We implemented a Microservices architecture that was promised and looked to be so simple on paper but became exponentially complex as each domain was implemented so we went back to a simpler monolithic service and it was much simpler to operant manage.


I would modify the section on root causes. It's not a fundamentally bad idea to look for root causes (plural). It's essential to not stop at immediate causes. More or less, you start at immediate causes and then recursively go deeper until you get to things that cannot be changed and/or are not within your control.

One real danger is believing in a root cause or the root cause (singular) as if there is only one. The desire to only have to change (or understand) one thing is dangerous because it leads you to stop investigating/analyzing as soon as you have found something to blame it on.

As you go deeper through the chain(s) of causes, you also should not focus exclusively on the last (root) because that means you aren't thinking about defense in depth.

TLDR, follow the chain of causes all the way to its ends, realize it may have multiple ends, and don't focus only on the surface stuff or only on the root stuff.


You are making the assumption that causality is useful and exist in these complex systems. Dr Cook base his treaty on what we know from the research of them, which is that these causal links are far less stronger than you think.

This means that searching along a causal tree is indeed detrimental fundamentally.


I'll give an example to illustrate.

Situation: my software service stops working because it ran out of disk space.

Insufficient focus on root cause: I add more disk, do nothing else, and assume it's solved.

Excessive focus on root cause only: I investigate, find there's a leak where old data never gets deleted, call it the root cause, fix the leak, and do nothing else.

Balanced approach: I fix the leak AND stop running at such tight margins (allocate breathing room) AND add automated monitoring for low disk space.

The lesson to look for the root cause is valuable. Without it, disk space fills up rapidly and my other measures to insulate against that can only go so far. But if it's all I do, that's also bad.


Past discussion: https://news.ycombinator.com/item?id=8282923

This is a great paper, I bookmarked this webpage.


Is this the primary source of this https://fs.blog/2014/04/how-complex-systems-fail/ ?


The Morning Paper [0] has a nice discussion and quotes the original paper from 2000. The same link is at the very bottom of the post, behind "original". However, it seems that MIT rearranged something and the link has gone stale. Maybe someone at MIT could make it appear under the original URL [1] again?

[0] https://blog.acolyer.org/2016/02/10/how-complex-systems-fail... [1] http://web.mit.edu/2.75/resources/random/How%20Complex%20Sys...


As this is a copy of the paper, probably.

I would point out that Dr Cook would probably disagree greatly with the whole idea of AntiFragile, or at least has the few time i have been able to talk with him. This link you offer is probably not the best context in which to comment on this paper.


For those looking for ways to talk to non-engineers about this topic (without them rolling their eyes and telling you to stop making excuses), I highly recommend the book Leadership is Language.


> The evaluations based on such reasoning as ‘root cause’ do not reflect a technical understanding of the nature of failure but rather the social, cultural need to blame specific, localized forces or events for outcomes.

I've heard this line before and I don't buy it. Methods exist that have been proven to identify root causes and those have led to significant reductions in failures. Half this page talks about how systems are made resilient to failures over time. But to do that you need to identify the failures' root causes!

The Five Whys is one example of an effective method to find root causes. The solution may not be simple, but it does lead to more robust systems.


Avoiding the trap of "root causes" is exactly what has lead to such a significant reduction in failures!

Avoiding "root cause" thinking doesn't mean that there isn't something that can be found and fixed. The intent is to keep your mind open to the fact that these systems and the interaction of their components are phenomenally complex. There are many contributing factors to a loss, not just a single error.

Root cause analysis is harmful because it artificially limits your investigation to a predetermined "single" error or fault. Once the first or most salient error or fault is found (very commonly a human operator), the investigation stops because the "root cause" was found. Instead of stopping at the root cause, you keep digging in attempt to find all the possible contributing factors, no matter how big or small. You can prevent many future failures of the system, not just the same one, by learning about many places where your defenses and mitigations were weak.

NTSB reports are a great example of how useful this approach is. Instead of identifying a root cause, they list many contributing factors, all of which can be improved to raise the level of safety significantly rather the fixing a single root cause.

As a recent example, think of the Boeing 737 MAX. What was the "root cause"? Was it:

* A software bug in the MCAS system? * The removal of a redundant method for sensing angle of attack? * The decision to make the "AoA Disagree" light an optional upgrade? * Making substantially unstable changes to the existing airframe? * The desire for it be be "another 737 variant" for pilot classification? * The insufficient pilot training w.r.t. to the new MCAS system? * Organizational pressures to not lose the AA contract to Airbus? * Executive leadership that prioritized business over safety?

The answer is all of the above, and more! By identifying all of these past "a software bug in the MCAS system" we have the opportunity to fix many safety issues which could be contributing factors to future failures.

There is an even more in depth example of this in Chapter 5 of Nancy Leveson's book Engineering a Safer World. She spends over 60 pages describing the environment and circumstances that lead to a US Air Force F-15 shooting down a friendly US Army Black Hawk over northern Iraq in 1994.


> Root cause analysis is harmful because it artificially limits your investigation to a predetermined "single" error or fault.

The Five Whys specifically does not stop at a single root cause or fault, that's why it's the five why's and not the one.

Moreover, robustness cannot emerge from a single event no matter how much you analyze it. It comes from continuous improvement processes, quality assurance, war games, service level analysis, etc. It's a multi-headed hydra of efforts to improve a system. One of those things is RCA of service-impacting events, or manufacturing defects, or customer complaints.

This is like those "considered harmful" blog posts so fashionable a few years ago. So it turns out that if you blindly and inaccurately follow a generic process, it might not work out great. Who knew? It's almost like there might be subtlety involved.


> The Five Whys specifically does not stop at a single root cause or fault, that's why it's the five why's and not the one.

The Five Whys is a good start. It gets you thinking past the first cause you see. But it still suggests that there is a single causal chain with a single root issue. Something like STAMP is designed to find any possible contributing factors, whether they are related in a causal chain or only had an indirect impact on the loss.

> Moreover, robustness cannot emerge from a single event no matter how much you analyze it. It comes from continuous improvement processes, quality assurance, war games, service level analysis, etc. It's a multi-headed hydra of efforts to improve a system.

I don’t think I said anything which contradicts this, and I generally agree. The point I was, perhaps unclearly, trying to make is that by including any possible contributing factor in your analysis, you can fix many issues you learned about from a single event. Issues which may have been contributing factors to future, different events. One loss is not enough to fix all the holes, but you want to learn as much as you possibly can from each one when it happens.


Slightly related, .fail TLDs were delegated back in 2014:

https://icannwiki.org/.fail


This is a short but really good discussion on real world failures and how to build robust systems.

I refer to it often when thinking about the design of my systems.


Great post. Uncanny how much of it maps almost 1:1 to security, which is too an emergent property of a complex system.


This is a wonderful article, but I feel like it is kind of missing a certain depth. Read Donella Meadows’ classic Thinking in Systems for instance.

So like, yes, of necessity the largeness of a system causes the existence of feedback loops such that it maintains an equilibrium state, a homeostasis. This is kind of just a restatement of what a system is, and to characterize it as “the system is constantly failing,” while that is totally true, seems a little bit sensationalist. Like, you are one of these complex systems, do I say that you are constantly failing to draw breath? No, but I imagine some of your alveolar sacs routinely fail to fully inflate and others are too covered with mucous to absorb oxygen and whatever else, and often your breath runs ragged. Similarly with “you cannot identify a root cause,” a kid dies in a pool and we routinely have a medical examiner pronounce that the cause of death was drowning, which in the respiratory sense can be fairly well isolated to a root cause, “could not breathe air because no air was supplied by the system upstream.” The failure to find a root cause essentially only becomes hard because there is another aspect of complex systems which Meadows discusses, which is that they bleed into each other and into the World At Large, they do not have clearly defined boundaries until we impose them for analysis. Draw the boundaries larger than the respiratory system, say the social system, and now the reason that the kid died has something to do with parents never having been trained on what drowning actually looks like, so they heard some light splashing and looked over and said “aww isn’t he enjoying the water” not realizing that this was actually a “holy fuck my son is struggling to keep his mouth above water” moment—this perhaps speaks to a larger problem of not having good water education in society, maybe that is the root cause. Stuff like that. In that sense yes there are failures at all levels, but at any given level of explanation byou can more or less say that this is the thing that at that level explains the problem.

So my servers go down, there is a valid root cause at the level of the individual server; there is also one at the level of the cluster as a whole and what that was doing; there is a valid cause at our developer and operations practices; there is a different cause at the level of company culture too. I don’t think that it makes sense to say that there is no root cause. The root cause of the Challenger explosion was that it was too cold to fly that day and they still flew the shuttle anyway. One can zoom in (“the fuel tanks were supposed to be sealed by rubber rings, but at the cold temperature of launch day the rubber no longer sealed the tanks”) or out (“NASA leadership built a false confidence in the quality of their work bolstered by a sort of management double-speak that disconnected them from the real risks they were playing with, making them foolhardy in ordering the launch”) and one can even sort of pan sideways (“why was the weather so cold that day?” or “what financial and engineering considerations made them choose this construction with these o-rings, several years ago?”). But this “middle explanation” of “it was too cold to fly that day and they flew the shuttle anyway” is a good summary that gives an entry-point to the problem, which is the real function of a root cause analysis (you want to collect information about how to fix the problem and know whether it was fixed).

The fact that this is a little bit over-sensational detracts a little from my enjoyment but I do think that the heart of the article is great and fun to think about. If folks like thinking about this stuff I really recommend Meadows's book though, systems thinking is not present in most curricula and that is a real shame.


I find it very instructive to substitute words like "human," "practitioners," and "people" in this essay with the word "AI," and re-read the essay from the standpoint of building autonomous AI agents that can safely run complex systems whose failures can be catastrophic, like driving a car or maneuvering a rocket. The essay becomes a kind of "guiding principles for building and evaluating autonomous AI."

Here are five sample paragraphs in which I replaced all words referring to human beings to "AI:"

--

Hindsight biases post-accident assessments of AI performance. Knowledge of the outcome makes it seem that events leading to the outcome should have appeared more salient to the AI at the time than was actually the case. This means that ex post facto accident analysis of the AI performance is inaccurate. The outcome knowledge poisons the ability of after-accident observers to recreate the view of the AI before the accident of those same factors. It seems that the AI “should have known” that the factors would “inevitably” lead to an accident. Hindsight bias remains the primary obstacle to accident investigation, especially when AI performance is involved.

All AI actions are gambles. After accidents, the overt failure often appears to have been inevitable and the AI’s actions as blunders or deliberate willful disregard of certain impending failure. But all AI actions are actually gambles, that is, acts that take place in the face of uncertain outcomes. The degree of uncertainty may change from moment to moment. That AI actions are gambles appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.

Actions at the sharp end resolve all ambiguity. Organizations are ambiguous, often intentionally, about the relationship between production targets, efficient use of resources, economy and costs of operations, and acceptable risks of low and high consequence accidents. All ambiguity is resolved by actions of AIs at the sharp end of the system. After an accident, AI actions may be regarded as ‘errors’ or ‘violations’ but these evaluations are heavily biased by hindsight and ignore the other driving forces, especially production pressure.

Views of ‘cause’ limit the effectiveness of defenses against future events. Post-accident remedies for “AI error” are usually predicated on obstructing activities that can “cause” accidents. These end-of-the-chain measures do little to reduce the likelihood of further accidents. In fact that likelihood of an identical accident is already extraordinarily low because the pattern of latent failures changes constantly. Instead of increasing safety, post-accident remedies usually increase the coupling and complexity of the system. This increases the potential number of latent failures and also makes the detection and blocking of accident trajectories more difficult.

Failure free operations require AI experience with failure. Recognizing hazard and successfully manipulating system operations to remain inside the tolerable performance boundaries requires intimate contact with failure. More robust system performance is likely to arise in systems where AIs can discern the “edge of the envelope”. This is where system performance begins to deteriorate, becomes difficult to predict, or cannot be readily recovered. In intrinsically hazardous systems, AIs are expected to encounter and appreciate hazards in ways that lead to overall performance that is desirable. Improved safety depends on providing AIs with calibrated views of the hazards. It also depends on providing calibration about how their actions move system performance towards or away from the edge of the envelope.

--

PS. The original essay, published circa 2002, is also available here as a PDF: https://www.researchgate.net/publication/228797158_How_compl...


The actual title is “How Complex Systems Fail” (full url: how.complexsystems.fail) which provides a slightly different meaning. The current title “Complex Systems Fail” implies that we should not have complex systems but that’s does not appear to be the intention of the article. Rather we often require complex systems to solve complex problems and the article provides an exposition on failure in such systems:

> (Being a Short Treatise on the Nature of Failure; How Failure is Evaluated; How Failure is Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety)


I believe HN removes words like “how” or “why” from the start of titles as part of its normalisation, though you can subsequently edit it to fix the title again. I strongly dislike this particular automatic edit because I reckon it’s normally wrong.


I wish it would remove the word "just" from titles of the form "X just Y'd a Z".


It's interesting the subtitle mentions "patients" when the entire text applies fully for very different contexts, say, cruise ships


Relative to aviation, nuclear power, and perhaps cruise ships, the spilled blood of victims of those complex systems resulted in improvements[0][1], while in healthcare the seminal text appeared first in 1999.[2] Healthcare has a lot to learn from other industries with inherent complexity.

[0] https://en.wikipedia.org/wiki/Tenerife_airport_disaster#Lega...

[1] https://www.nrc.gov/reading-rm/doc-collections/fact-sheets/3...

[2] https://pubmed.ncbi.nlm.nih.gov/25077248/


Complex system: nature.

Complicated system: things mentioned in this article.


You forget about Lithium ion battery and nuclear fusion and many more examples. All the problems can't be solved by simple solution you need to do complex things.


The article doesn't say we can't have complex systems. It simply states they are doomed to fail in some way, and that we need to be prepared for that


I think by complex system you mean a shitty system build by jokers which is doomed to fail. You can’t just build any bullshit and called it complex.


That is not true. Nuclear reactor are far more complex and not doomed to fail one day. So stop the bullshit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: