Hacker News new | past | comments | ask | show | jobs | submit login

The problem is it is easy to look an item and say "we no longer need this, remove it". It is much harder to be sure you are fixed all possible ways the problem could have happened. Maybe the system auto-signs the release - but does it in all cases? Is there a rare special case where the automatic process doesn't happen? Do you really know what tricks they have to pull in that branch office in India?



That's why the same consideration to adding an item needs to be put into removing an item, need to be put into reviewing if the checklist is relevant still at all.

Broken down a different way, you need two systems: an automatic system so you're not reconsidering decisions that have good outcome (a checklist), and a considerate system that decides what goes into or out of the automatic system.

If the considerate system is broken or just frozen, then you have a problem. How long it will take to manifest is a different question.


I think there's an asymmetry that you're not quite capturing here: when you add an item to the checklist, you have a specific concrete failure at hand (the process just failed, you had everyone on hand to do your root cause analysis, and you know exactly why). Later when you want to remove it, you (a) lose context because time has passed and memories are fallible and/or a different person is dealing with it, and (b) you have to have some reasonable assurance that you've covered other ways the same failure can occur. No amount of process can completely eliminate this asymmetry so at some point you're forced to make a tradeoff based on how risk averse you want to be.


The problem is not lost context. The problem is that adding an item to a checklist protects you from a category of outcomes. If you have a checklist item to make sure a release is signed, you will never push an unsigned release.

Removing an item from a checklist is done in response to a change in inputs. Sure, you may have automated release signing - but unless you are 100% confident that you are aware, and have mitigated 100% of the ways in which this can fail, you cannot, and should not remove the 'check that the release is signed' step.

Lost context has nothing to do with this. Unless you are an omniscient god, you probably cannot reason, with 100% certainty, that you have mitigated every possible input that produced a bad output.

So, check your outputs.


So it could be an automated test to check if the release is signed as well instead of a checklist item.

Having long checklists for comparatively simple tasks really hurts productivity, plus they're often used as an excuse to put automation in place, because 'the process is already defined' and 'people are already used to it'.

When designing a process, it is of utmost importance to keep checklist lengths minimal.


I can definitely concede there can be some asymmetry, but I think it's system dependent.

As the Boeing 737 Max 8 shows, adding new items (in this case a new design element) is also fraught with risks. You have to get the root analysis correct when adding or removing elements. Adding has risks of unknown (not accounted for) like removing has risks of unknown (accounting forgotten).

In the end, I guess I still believe the best strength is in good analysis at the consideration stage.


Adding a design element is completely different from adding an item to a checklist.

The asymmetry with checklists is that it's completely risk-free to add a new check, but it's risky to remove one. For example, someone might say "we're not totally sure that our system won't fail when we do X, so let's check that in QA, or at runtime, or at takeoff, or whatever." Now that the check is there, it protects you from failures when you do X. And now you're in a situation where you can't safely remove that from the checklist unless you can prove that your system won't fail when you do X. Adding requires only a suspicion, removing requires rigorous proof.

The case you describe with the 737 Max isn't the same at all. There's an actual risk when adding a new component to the system, but no risk when adding additional verification. That's not to say that there aren't other costs, but it can't directly make your system less reliable.


Just one example: If I add a step of, If patient enters with heart attack symptom X, was patient injected with 100cc of <drug>?

It's not a risk free check at all. It will likely increase the rate at which the drug is administered, with all types of plusses and minuses associated with adding a component to a system.


I think the case of 737 max is they game the checklist approach - use the guy who should be checked to be the guy to check it. Hence the argument to remove an item is so dangerous. (But then it is also may be bad blood. Life is more complex than a checklist. )



But the point of Chesterton's fence is not that you should never knock down a fence :P


But what it does do is tell us that the difficulty of removing a fence is irrelevant to its utility. :)


I feel like it is the reason why disruption works so well: After an organisation exists for a while, 100s of fences are clogging up the streets. Some of them useful, some of them used to be useful, some of them never thought out well to begin with.

So disruption is taking a bulldozer, and driving right over them. The clearly usefull fences get erected quickly. The forgotten usefull ones are rediscovered after a while, maybe causing some minor damage in the process. But the grand majority turned out to be useless, and is now gone for good.

Of course, if there is a pack of nuclear civilization killing wolves out there, they better stay fenced off or else. The trick is identifying them. There might be a role for a regulator in there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: