Hacker News new | past | comments | ask | show | jobs | submit login
Resilience engineering: Where do I start? (github.com/lorin)
167 points by azhenley on May 15, 2019 | hide | past | favorite | 11 comments



Readers beware -- this particular taxonomy of robustness vs resilience is not a pervasive or even common one. Often these terms are used completely synonymously. And often they are used with different subtleties that distinguish them.

For example, some distinguish between the two terms in that robustness refers more to staying functional in the face of failures, where resilience refers more to the capability to work around failures (neither having anything to do in particular with whether the unknowns were unknown).

The blog post author says that this taxonomy come straight from David woods, so there's no problem. Just keep in mind that most people don't use these terms in this particular way.


Outside of software, resilience engineering is an established field using this definition and disambiguating the others. Some info on the origins going back to the 70s http://erikhollnagel.com/ideas/resilience-engineering.html. It’s only the last 5-10 years that people in software have been getting involved


> Readers beware -- this particular taxonomy of robustness vs resilience is not a pervasive or even common one. Often these terms are used completely synonymously. And often they are used with different subtleties that distinguish them.

> For example, some distinguish between the two terms in that robustness refers more to staying functional in the face of failures, where resilience refers more to the capability to work around failures (neither having anything to do in particular with whether the unknowns were unknown).

> The blog post author says that this taxonomy come straight from David woods, so there's no problem. Just keep in mind that most people don't use these terms in this particular way.

Can you go into more detail with specific examples between the two that highlight the differences? "Working around failures" and "staying functional in the face of failures" sound borderline synonymous to me, so I'm curious how that plays out in practice.


> "Working around failures" and "staying functional in the face of failures" sound borderline synonymous to me

One is going around the iceberg, the other is ensuring the Titanic can sail on with a couple of huge holes in it's hull.


If the Titanic does not hit the iceberg, it does not enter a failure mode. That doesn't sound like "working around failures" but "avoiding failure" which seems very different.


This is a great overview. I would also recommend Dekker's book The Field Guide to Understanding Human Error [1]. It's a bit easier to read than Drift Into Failure, which I found to be very dense.

1: https://www.amazon.com/Field-Guide-Understanding-Human-Error...


The team I work on at AWS wrote a paper on this https://d1.awsstatic.com/whitepapers/architecture/AWS-Reliab... covers concepts such as Recovery Oriented Computing (ROC) etc


this is a good one as well, show how humans and our societal systems act around disasters (we are happy to live on volcanoes even after we see they blow up) The Big Ones: How Natural Disasters Have Shaped Us (and What We Can Do About Them) https://www.amazon.com/Big-Ones-Natural-Disasters-Shaped/dp/...


I think the work by NN Taleb on Fat Tails, Black Swans and Antifragility at least deserves a mention on this list.

Edit: also this USCSB youtube channel has some cool info on disaster engineering https://www.youtube.com/channel/UCXIkr0SRTnZO4_QpZozvCCA (see also https://www.csb.gov/videos/)


I love small concise mini-syllabi’s like this. Just give me the big papers and set some context.


I've been trying to implement and apply these principals at my $job. It's so helpful to have an intro guide published with all the supplemental reading. I'm going to send this around to all my teams.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: