I've seen this called "pointing and calling" [1], Japan's train drivers use the technique to force themselves to perform actions and take notice of the current environment.
I personally took it to heart, it's a good system for forcing a cache miss in the brain - make sure you're on "database production" or "database localhost" etc.
War story time. Long ago, I worked for an interesting company that insisted on running its entire business on Linux desktops, all the way back between 1999-2002. Imagine running StarOffice/OpenOffice, Thunderbird, Netscape Navigator, etc, for your entire business back in 2000, including your executive team, marketing teams, everyone, most of whom had never even heard of Linux before.
Anyway, this being Linux, everyone's home directory was mounted on NFS. All our builds were standardized with a tool called SystemImager, which we could use to push out updates to everyone's desktop whenever we wanted. If there was a new version of KDE, we could pretty easily push that change out.
Sometimes it was convenient for me to work on updates to these images by chrooting into a directory containing the "image," which was really just an rsync tree. And sometimes, when updating these images, it was convenient to mount our NFS home directories in this chroot environment, so I could access things like an archive I had just downloaded on my own desktop.
And eventually we had lots of different images, and the old ones were using up a lot of disk space, so I decide to clean up some space removing the old images. And these are fairly large images, with lots of small files, and this was before SSDs were a thing, so it made sense that deleting them was taking a while, and I stepped out to grab something to eat.
As I was eating lunch, I started getting the tech support escalations. But this wasn't that unusual, our users routinely had problems with the environment we had provided. They hated it, because it was in many ways terrible, and they made sure we knew it. So I wasn't terribly alarmed. I didn't think any major changes had been made, so I didn't hurry back.
By the time I leisurely returned from lunch, half the NFS home directories for our users were gone, along with all their documents, emails, bookmarks, or whatever else. Suddenly it hit me what had happened: at some point, perhaps months earlier, I had left our NFS home directories mounted within one of these image chroots. And now I had sudo rm -rf'd it.
We had backups, but they were on tape, and it took several days to restore, with about a day of data loss.
My favorite version is when that UPDATE or DELETE SQL query that you expected to finish instantly takes a few seconds before giving you your cursor back.
If someone just gave me a tool to show me the expected wall time of query before actually running it, I would be quite happy. I would not even need that much of accuracy, anything up to one order of magnitude would be useful, and even up to two orders of magnitude I would use occasionally.
You probably knew this already, and there's probably better solutions if you're not in the manual sysadmin world, but after I did that on a personal machine a few decades ago (I think it was?), I got in the habit of using `--one-file-system` when doing major recursive rm operations that weren't meant to cross filesystems. Or `find -xdev … -delete` for anything more selective.
It seems better to alias rm to "rm --one-file-system", assuming major cross-filesystem deletes aren't something you do all the time that should be made as ergonomic as possible.
Similar story, except we were using an NFS appliance that took hourly snapshots. As soon as we figured out what was happening, we had the storage team save off the latest snapshot. It was 1TB of data (a lot for the time) and took a week for us to restore.
A lot of companies still work in a similar fashion to what you described, maybe with root squashed, but still, very possible to have something like that happen now a days!
I remember someone hit a bug with docker exec --rm years ago where it started deleting some NFS files that it shouldn't...
This reminds me of a time when a colleague and I were investigating some persistent D-State processes that were occurring when container processes were being exec-ed.
Once on the box, we wanted to create a container with utilities in the fs but didn't want to download an image tarball or look through the rootfs layer directories for one to use, so we just bind mounted host root onto another directory, beside the config file we were using.
This worked like a charm. Until we rm -rf'd the config directory and deleted host root in the process.
In our case, fortunately the consequences were minimal as all workloads were stateless. The container scheduler moved all the workloads to other hosts and the host scheduler noticed this VM wasn't responding any more and rolled a new one. The whole thing resolved itself in about 5 minutes with no interaction from us - so that was pretty neat.
I once cloned a directory for standing up an environment via Terraform. I modified all of the environment variables and config and ran it. It worked perfectly. Except I’d forgotten to wipe out the Terraform state, which meant that in the process of creating a new environment, it completely deleted the environment I had cloned. That was my initiation into very experienced :)
Some time ago, it was common in Unix sites to have an NFS filesystem mounted on all machines that contained locally-built binaries to augment those provided by the operating system. At this site, we used a bunch of different platforms: OSF/1, Solaris, Linux, HP/UX, etc. So we had a large filesystem containing the source code, and built binaries for all the different platforms, and this included heaps of things, from Bash upwards.
A colleague of mine accidentally ran rm -rf on this filesystem.
It was taking a loooong time, so he realised and killed it, but not before it had removed a heap of stuff. Because this was something that could be rebuilt, it wasn't backed up, so we had to go through the process of downloading the tarballs, and recompiling everything for all the different platforms. It took a few days to recover most of it, and weeks to completely restore things.
The day after the incident, when he arrived at work, he found his keyboard was missing a few keycaps. It took him a while to realise that there were four gone: 'R', 'M', '-', and 'F' ...
Reminds me of when I accidentally deleted a virtual hard disk I had a few years ago, because I'd copied it earlier and I thought I still had the other copy left. Only afterward did I remember I'd done the exact same thing to the other copy earlier... thankfully the information on it wasn't critical, but it was kind of terrifying to realize it very well could have been.
I have been that boss. Is that you, Wendel? In any case: the deletion even had a "type your app name to confirm" prompt, but I knew I wanted to act on production; the issue was deleting the wrong one of multiple production databases. The takeaway was to grab a second pair of eyes to review any dangerous operations.
I deleted our production CRM database meaning to delete the test database. While my boss was running queries on the database for setting my quarterly bonus.
Good news is that I was deleting the test database to ensure that the recovery from backups was properly automated, so it wasn't down too long.
Yup. Senior dev here, my own devops config screw up wiped out all production sales order data earlier this year. Had to restore from multiple backups, took a while. Stressful experience.
Consider network partitioning so dev/test/accept just has 0 contact with prod.
Most of the worst production issues I've been involved with have come from trying to fix a minor issue and then somebody making a mistake. The way our brains are wired to handle stress isn't really useful for debugging complicated problems.
Ever since hearing about point-and-call, I've started using it in the kitchen when turning on the stove. I used to destroy one or two pans a year by turning on the wrong burner, but it's now been about a year and a half and I haven't screwed it up yet.
The knobs are labeled with a terrible little glyph meant to indicate which is which, and I've supplemented this with plain-english Brady labels "front left", "front right", etc. Now I speak the words above the knob, and point to the burner. It felt goofy at first, but now it feels normal, and like I'm tempting fate if I skip it.
I'm curious how exactly you managed to destroy pans. I've never destroyed a pan in my life, and take no particular precautions - is this a common thing? Is this more common with non-stick stuff or something?
The non-stick ones especially, but even plain metal pans will warp if they get hot enough. And then they don't sit flat on the burner, which might not matter on a gas stove, but contact with an electric burner is pretty important.
Not sure how it is in other countries, but don't the knobs when going left-to-right always correspond clockwise to the burners, starting at the lower left? And the oven knob is to the right?
My four knobs go front to back. I don't know what order they're in - the glyphs are fairly readable to me. I've seen this arrangement plenty, it's not unique.
Worth mentioning that, assuming the single study on the matter can be believed, the pointing and calling method is extremely effective in reducing the incidence of silly mistakes (that is, mistakes made in simple routine tasks, by competent individuals).
Unfortunately, it strikes many as looking rather silly, so it hasn't been widely adopted.
I learned a technique from a gray beard[0] when I worked as a student sys admin for the CS dept over two decades ago. Whenever typing a destructive command, he'd take his hands off the keyboard and drop them to his side, re-read the command, then put his hands back to press enter.
I do this whenever I'm on a production server (which is rare anyway). I use different colored prompts for local and remote shells.
[0] Technically he had no beard and if he had, it wouldn't have been gray.
Could be related to me doing the electricians equivalent of deleting production DBs. I've drilled through the comms cable to payment terminals during opening hours. I've run over a copper gas line with a scissor lift. And yes, I've cut live 230V cables with hand tools.
That sinking feeling in your stomach you get immediately after doing something bad - it's universal across professions.
Thankfully, I've never fucked anything major up, and I've had my hands in hospitals, power plants, ISP fiber backbones, police stations and whatnot.
> I've drilled through the comms cable to payment terminals during opening hours.
A friend of mine who does fire alarm systems was tasked to install one at a bank branch. He found out the hard way that one of the cables for the safes safety system wasn’t in the place where it should have been according to the plans. Safe’s safety system hosed, bank branch closed for repair.
Solid tip. For GUI-enabled servers, use distinctively coloured wallpapers. I recommend bright red for production machines. The image itself can be just about anything, provided the colour is clear.
Doesn't hurt to use an image that's related to the server's purpose, and to put the name of the server right there in the wallpaper somewhere.
Using iterm2, you can set a "badge" (large text overlay) on a terminal tab. I have a short shell function (`ib foo`) that sets the badge to arbitrary text. It's NOT as good as setting the terminal theme, but it's still very helpful to use it like this:
ib production && ssh production-machine
ib demo && ssh demo-machine
It's definitely helped me when testing a fix on a demo or staging instance, and has helped me avoid doing it on production accidentally.
A similar tip I picked up long ago: If you're typing a dangerous command, first type a `#` (or `--` if it's SQL, etc.), then the command. Then read it. Then go back to the start of the line and remove the comment and run it.
I always do destructive SQL commands in two steps: first run a select using the WHERE clause you intend to use and verify which records will be affected, then hit the up arrow and edit the beginning of the query leaving the WHERE intact.
I also like adding redundant conditions to the WHERE so a typo in any single one of them won't sink me.
For the rare but critical manual SQL mod our common safety measure is to wrap every DELETE or UPDATE in BEGIN TRAN...ROLLBACK TRAN first. Run on test systems or snapshots multiple times, checking the result inside the transaction.
Finally, change ROLLBACK to COMMIT only when you are positive all is well.
IIRC (without checking the manuals) data-definition commands might not be covered by such transactions: such as altering, dropping tables and possibly truncates.
PostgreSQL is quite good about DDL being transactional. So I was surprised (tbf, I shouldn't have been) when Redshift autocommitted after a TRUNCATE. But DROP TABLE is transactional, go figure.
I use an alternate version on SQL: when running any modification on any kind of sensible database (which is a bad practice in itself, obviously, but sometimes you don't have a choice), always type in the WHERE clause before the table name (added bonus: do a SELECT first with that clause to see what you are modifying).
That way, if you accidentally send it, the command fails and nothing happens.
I've done this for several years (also after seeing a video about Japanese railway operations). It doesn't seem to catch on.
It's also not perfect; it does not catch mistakes concerning "non-local" state, e.g. configuration files in /etc merging with one in . merging with some command line options. (Personally I try to avoid writing tools with defaults of this sort, but especially Java developers seem have different opinions.)
Unfortunately if you do P&C and still make the mistake due to the aforementioned tooling, you look even stupider.
Around industrial machines, I've long held and promoted the view that the machine is _trying_ to kill you, _trying_ to damage itself, _trying_ to ruin the workpiece. Only by outsmarting it at every turn, and having safeguards against every mishap, can you go home at the end of the day.
When something happens despite all that, just step back and realize how much worse it could've been, and how successful your safeguards have been up 'til that point.
Then look carefully at the procedure. Is there something about the naming or structure that could be more clear? Can you think of near-misses that resemble the failure you just experienced? Are you using boobytraps in production? Symlinks and overlay filesystems seem clever in the moment but they're bound to subvert our intuition someday. Perhaps you should get in the habit of always using full absolute paths, for instance.
There's always another gotcha, but if your workflow doesn't look as over-the-top safety-silly as aerospace, you're not doing as much as you could be. (Hint: It's not silly.)
I searched Youtube for examples of this. This is a little bit staged, but it seems to be a real checklist they're going through: https://www.youtube.com/watch?v=JG7SkOQDDt0
Though they're not perfect. They said that one pilot is supposed to read the item, the other pilot say the answer, and the first pilot visually confirm it; but at 1:42, I noticed the first pilot say "emergency exit lights", hear the confirmation, and move to the next item without her eyes moving away from the list.
I'm not sure which of several possible conclusions to draw from that. ("Humans suck", "it is indeed staged", "the procedure has enough redundancy that the chance they're both careless on a given step is small", "the pilots feel that the emergency exit lights aren't particularly important", ...)
Routine is the killer. Have a look at the fatal maglev train accident in Germany. Service car was on the track. Presence of service car in service bay (and not on track) can be seen visually by operator (driver) in the control centre when turning head. (If I remember correctly)
Rock climbing is remarkably similar. When a climber begins up a route the standard exchange with their belayer (the person managing the rope and keeping them alive in a fall) goes something like
A: "Belay on?"
B: "Belay on"
A: "Climbing"
B: "Climb on"
Then the climber begins.
It's interesting to me that highly regulated and totally unregulated activities have evolved extremely similar processes. I suppose having your life on the line is a good motivator to follow best practices.
Prior Navy Nuke here. We called it PRO (Point, Read, Operate)- we’d point at the thing we were going to manipulate, state what we were manipulating, and announce the completed action.
For certain procedures we had a second party (“reader”) observing and acknowledging each part of each step.
Operator (Gesturing anti-clockwise while pointing at valve XYZ)
Operator: Opening valve XYZ.
Reader: Opening valve XYZ, aye.
Operator: Valve XYZ is open.
Reader: Valve XYZ is open, aye.
Operator: Indications of flow
Reader: Indications of flow, aye.
People can still get complacent, and things can still get missed but the deliberate mentality goes a long way. Now when GitHub makes me type out the repository name before I can delete it, I sometimes copy/paste... YOLO.
I've noticed from pair programming that the person navigating with a mouse is far less able to read and interpret their surroundings or pick up typos while typing, than an observer that simply has to watch what the other person is doing.
Like when clicking on a file in a directory you just entered and looking for the file, the observer can literally locate and point to the file for the mouse user 5-10x faster than the mouse operator.
The observer seems to interpret the information that results from the directory listing faster than the person who just did the double-click to enter the directory because they don't have the muscle coordination context switch and can immediately move to interpreting the results.
It's probably because mouse manipulation uses brain infrastructure that is more recently evolved, but observe-react is a lot earlier in the brain processing pipeline evolutionarily, and a lot more refined/involved.
Since I have a vision impairment, I'm sure the effect is amplified very much for me, but using the mouse is such a massive break in flow:
- First you have to lift one hand up off the keyboard and put it down on the mouse. This may or may not mean taking your eyes off the screen.
- Then you need to find the mouse pointer on the screen
- Then you need to aim for what is usually a relatively small target and move the pointer there.
- If you're right-clicking, the right-click menu usually presents more small targets you need to aim for.
- If you need to use the keyboard, again you have to move your hand over to the keyboard from the mouse.
For finding the pointer, I developed this unconscious habit of slamming the mouse pointer to the very top-left of the screen. It's difficult though when on someone else's machine, where your brain isn't used to the pointer velocity or where multi-monitor means that slamming the mouse to the top-left actually puts the pointer on another monitor.
People look at me in awe when I'm using a two-pane file manager but honestly not having to take your hands off the keyboard and not having to move your eyes off the screen gives so much better flow. It's also why I like the UI of Blender - one hand on the keyboard and one hand on the mouse at most times.
I think this is because writing software is so much more than operating switches and controls. I really hate pair programming for this reason, but I love industrial-style controls and protocols involving multiple people.
Back in the '80s I worked on a financial system (SWIFT interface) for an Italian bank. It went operational and we observed 2 operators effectively doing "pair operating". We just thought it was weird Italian style socialising - one had the keyboard and the other was chattering away with a commentary.
But they were surprisingly effective!
I accidentally learned when teaching a course at a site with too many people for the available machines, that pair exercises was very effective - I got lots more questions and overall learning went way up. If the pair discussed it and couldn't find an answer they would have the confidence to ask. On their own, neither would probably bother and just wait for me to go through things.
And it should be kept in mind that almost none of those procedures were intuitively obvious things to do. As the saying goes, safety standards are written in blood.
Back when I shelled into servers more, I really liked having my deployment put the environment in the prompt and set a red background on production for similar reasons. It only takes a small change to jar you out of habit.
>> I personally took it to heart, it's a good system for forcing a cache miss in the brain - make sure you're on "database production" or "database localhost" etc.
Yeah, ouch. More ouch if it's the other way around- you delete the test database and it's not the test database.
> you delete the test database and it's not the test database.
> (long story)
I think you can skip the long story, as most of us can tell a story similar in theme if not specifics (and sometimes, probably some similar specifics too). ;)
With great power comes great responsibility (to not completely screw stuff up because you were on autopilot for a second...)
I worked at a company where someone deleted the production database by accident and the snapshot mechanism hadn't been working AND the alerting for the snapshot mechanism was also broken. Fortunately someone had taken a snapshot manually some weeks prior and they were able to restore from that and lose relatively little data (it was a startup, so one database was a big deal, but weeks worth of data was not such a big deal).
Firing the person who happened to be at the wheel when a mistake like this occurs never seems like the right choice to me, especially if their performance to-date had otherwise been good.
Everybody has off days, or just instances where circumstances misalign in just the wrong way. To pretend otherwise is silly; instead, it's the leader's/team's responsibility to ensure that those sort of off days don't lead to massive losses via redundancy & the sort of measures we're talking about here & in the OP. Firing somebody in these circumstances just acts to severely reduce morale, since we all secretly know in our hearts that it very easily could have been us.
Firing in this case just seems retributive. It's not going to bring the lost data back, and you've just eliminated the very person who could have told you most about the chain of events leading to the incident in question to help you guard against it in the future. These incidents usually sound simple at the surface level ("I clicked the button in the wrong window") but often hint at deeper, perhaps even organizational, issues. A lack of team focus on reliability/quality, a lack of communication or trust about decisions made (or not made) by higher ups, or so on.
And they are probably the single least likely person to cause a similar incident again -- that person will now likely be double and triple checking their commands for eternity.
Agree. There is never a single cause to this kind of error. It takes a village. Someone didn't name things properly, someone else didn't store backups properly, someone else gave everyone root access to production, etc. It was inevitable the database would be deleted - doesn't matter who actually did it.
If your CTO scattered those landmines all over then "not stepping right" is not an error. It just sucks.
Sometimes. And sometimes they make the same mistake over and over.
We had an admin in charge of our storage. He had worked with our old vendor's SAN for years, then we got a new SAN. Trained him/certified him etc. He "accidentally" shut down the entire SAN. That brought down the entire company for over 9 hours.
Fast forward two years later, he screwed up again and caused a storage outage affecting about 1100 VMs. Luckily not much data loss, but a painful outage.
Then a month ago, he offlines part of the SAN.
Some people never learn, and recognizing this early is usually better than letting someone continue to risk things.
3 mistakes in... >2 years? I feel like it's really hard to tell if the problem is really the person at that point. Have you had others perform the same job for a similar duration to see if they avoid the same mistakes?
If you made a list of every mistake each person makes in 2-3 years, and omitted all other detail, pretty much everybody would look like a terrible person. Context, frequency, etc. all matter.
If particular systems or people are seeing a high frequency of mistakes, maybe the system design is at fault, not just the person. Obviously it's hard to do in practice, but the ideal is to design systems that are mistake proof.
He was trained and certified on the new SAN, and surely some of his prior experience on the legacy SAN would translate. Just as moving from AIX to RHEL/CentOS wouldn't invalidate all your skills and experience.
It was a real accident when he shut down the SAN the first time. I don't know why I put it in scare quotes.
> These incidents usually sound simple at the surface level ("I clicked the button in the wrong window") but often hint at deeper, perhaps even organizational, issues.
These words reminded me a story of similar/different "flaps" and "landing gear" controls on a plane - where crashed airplanes were also blamed on pilots first, before a trivial engineering/UI solution was implemented:
https://www.endsight.net/blog/what-the-wwii-b17-bomber-can-t...
Nickolas Means has an absolutely wonderful set of talks on themes like this. Particularly relevant here I think, is his talk: "Who Destroyed Three Mile Island?" - which goes through the events that occurred at the nuclear power plant, the systemic problems, and how to find the "second stories" of why failures occurred.
There's a really good book describing this phenomenon called Behind Human Error. It speaks of "first stories" and "second stories" and how in analysis of incidents, it is all too common to stop at the first story and chalk it up to human error, when the system itself allowed it to take place.
"Both cloudformation stacks were identical (instance names, etc)."
This is why it's a good practice to include the environment name in the resource names when it makes sense. Even better, don't append the env name, but use it as a prefix, like ProdCustomerDb instead of CustomerDbProd. I also like to change the theme to dark mode in the production environments as most management UIs support this. One other neat trick is to color code PS1 in your Linux instances, like red for prod, green for dev.
One other neat trick is to color code PS1 in your Linux instances, like red for prod, green for dev.
This is definitely a nice one to add. Though I did work with someone once who believed that all servers should be 100% vanilla and reverted my environment colors.
In container-only shops with no ssh, this is less of an issue, and instead you rely on having different permissions and automations for different environments.
That's very similar to what happened to me - except I didn't delete any backups, thank the Great Old Ones. And I didn't get fired.
Basically, I had a habit of starting a new SQL Server Management Studio instance in its own window for each database I was working on. At some point this struck me as wasteful, for some reason, so I closed all my windows and opened all the databases in one window. Then sometime after that I went to delete the test database as a routine maintainance task, but of course I was used to clicking the database at the top of the left pane in SSMS, which was the test database when it was the only database in a window... but now happened to be the production database. Then five minutes later I got a call from the client company that used our system, to ask me if there was any maintainance going on because everyone's client had just crashed.
The horror when I realised.
It was educational, though. I don't think I'll make that particular mistake ever again. And my bosses were ace to be fair, probably because I worked my ass off to correct the mess that ensued.
When I worked in production environments, I used to set up little Firefox userscripts that would add a banner or anything visual to the production site. It's entirely client side and easy to customize.
> I've seen this called "pointing and calling" [1], Japan's train drivers use the technique to force themselves to perform actions and take notice of the current environment.
The concept makes sense, though I don't quite fully get how to translate it to other contexts besides train driving where unexpected and unpredictable events come up all the time. Let's say you're driving a car and the traffic light turns red. Do you point at the traffic light, say "red", point at your brake pedal, say "brakes", and then hit the brakes?
In high school, I drove a 1993 Toyota Tercel. It was a functional, reliable car, but it had no keyfob to lock the doors remotely.
Getting out of your car, pressing the lock button on the inside of the driver's side door, and shutting the door are all routine, boring actions that make it easy to forget your keys inside the car. The keys can go in all kinds of places as you climb out of the car - jacket pocket, pants pocket, center console. It is very easy to lock your keys in your car.
I quickly learned to hold my keys in one hand, say out loud, "Keys in hand," and then lock the door with the other hand.
This technique is perfect for any repetitive action that could go wrong with non-trivial consequences, and there's lots of that in everyday life.
I wake up in the mornings with "Shit Shower Shave" and leave the house with "Wallet Watch Testicles Spectacles". Simple mnemonics work, doubly so if you actually say them out loud and check them each off.
I do that exact some thing, and I haven't smoked in 3 years. The downside is that if I'm supposed to remember to bring something, in addition to those 3 things, I'm extremely likely to forget it. If it's super duper important, I tie it to the door handle.
To remember to bring a physical object, I leave my keys on it. Downside, sometimes people will bring my keys to me when they find them in strange places, like the fridge.
Definitely a good idea. In the subject of the analogy (software incidents) I think both should be done -- a regular and habitual focus on important/high risk commands via procedure, and preparations for the time when the inevitable still happens because people are people and it's impossible to fully predict all potential sources of unintended consequences. A lack of habitual focus when important consequences are at stake could lead to an over-reliance on the safety nets, and you really don't want your safety nets becoming routine. Otherwise you'll need safety nets for the safety nets.
Repetitive tasks are exactly what pointing and calling helps with. The intent is to prevent the brain from going on autopilot for a task that happens exactly the same way 99.9% of the time, in order to prevent disasters that last 0.1% of the time.
Traffic lights are a lot more random (and therefore mentally engaging) than the types of things train conductors are pointing and calling.
Whenever I have something in my hand that I'm about to put down for a second in the exact absent minded kind of way that would leave me searching all over the house for it 5 minutes later, I say it out loud. "Headphones on the table by front door."
Embarrassingly I once lost a hamburger while still holding it.. I had my arm propped up on a the back of the chair and it was just out of my peripheral vision. Not my smartest moment.
I lost my sunglasses when I was wearing them! We were going to a state park for a hike. It was a 2 hr ride for which I was wearing my sunglasses but forgot. As we came out of the car to start the hike, I spent 5 minutes searching for my sunglasses in my backpack until my friend asked what I was searching for .... Maybe I should be saying "sunglasses on" from now on
Funny, there is a Polish rhyme [1] for children based on the same concept: a person searching the whole house looking for glasses which they were wearing all the time :)
I believe the trick is to anticipate failure, and call out the normal thing instead. So you’d always slow down at every light, and only speed back up after calling out green. This is what all drivers are actually supposed to do, although I fully realise nobody practically does that, which is why we get so many automobile accidents all the time.
Only speed back up after calling out green and intersection clear.
I don't necessarily always do that, and don't make audible calls, but when driving at night or in inclement weather, I try to make extra effort to check for unexpected cross traffic.
The pointing and calling performed by Japanese train drivers is very much about expected events. "Green signal" would be one of the most common call-outs. For example:
Your example is a reactive event. Something happened in your environment.
This idea is more useful for situations that you are initiating, and where feedback is not immediately obvious.
An example could be turning your car’s lights on at night. Before starting the car, you force yourself to point to the switch, say “lights on”, and do it.
I use this with keys. When leaving my office, house, or car, I hold up the key in my hand and establish sight (I don’t say anything out loud). Then I lock the door.
I'm a photographer, and I used to get annoyed that I'd have little distractions on the edges and corners of the frame, because I was focussed on the subject and overall composition. I trained myself to sort of bounce my eyes around the sides of the viewfinder when pressing the shutter (think like the DVD player menu). Now I almost never forget to check.
I don't think it really applies to stuff like driving, which almost has to be muscle memory to work at all. even with something routine and non-urgent like switching gears in a manual, the steps have to happen faster than you can say what you're doing.
a good example from normal life is (physical) key management. I used to always forget my keys when walking out the front door, which was a big problem since it locks automatically. to solve the problem, I made my back right pocket be the designated "key pocket". I now slap my right butt cheek whenever I leave a building. it might look weird to observers, but I have not once forgotten my keys since I implemented this system.
After losing my wallet several times and not having a clue when the last time I had it on me was, I implemented a similar system. I now habitually triple tap my three designated pockets for phone, wallet, keys, every time I walk through a doorway.
That way, if any of them are missing, I know they must be in the room I just left.
Invert it and I think it works. Always prepare to stop at an intersection. Then point out it is green and call out you do not need to engage in stopping.
It may seem silly, but if we asked people who drive 30+ minutes every day if they have every accidentally ran a stop sign or red light, I suspect the numbers would be quite high (though they likely happen at times/places where chance of accidents are the smallest, such as empty roads late at night).
I teach my children to point in the direction of where cars can come from before crossing the road. He used to just swing his head around before, now he has to search directions and point there to direct his attention and it works excellently.
As others have pointed out, this is for repetitive tasks that your brain wants to automate away, but you really want to keep in attention.
It can be used for exactly the same purpose: checking the environment before doing the action.
E.g. force yourself to read the “production” part of your prompt before running the command. Point at the user name before deleting its record. Read aloud the version name before sending it to deploy.
It really makes a different between just glancing at the info, and having to parse it as part of an action.
Let's say you get a request to delete users #s 1, 17, 152, and 43.
Now you can have the request and database administration tool open and point and call at the numbers and any queries and make sure you are deleting the right users.
OpenShift does this by forcing you to write the name of the project you are about to delete. It was something that used to annoy me but reading this I understand it is a good call from their side.
1. Avoid silly terms our industry should have ditched years ago, like 'drop'
2. Making sure that nobody will ever change HARD_CODE_TEST_DATABASE_FOR_SAFETY because they thought it should 'always be the active database' or whatever.
I have had many disasters in my software career because I jut wantonly hit "Y" without thinking about it.
I have noticed, since learning to cook at a professional level in the kitchen, that I point and call out a lot more in my other activities too. "From hot behind" and "knife" and "oven is over temp" to "Saw blade is live" and "circuit is live" in the workshop to "production server" and "erasing records" in database maintenance. Some days I feel like Sigourney "I have one job damnit" Weaver in Galaxyquest. It's a useful stop-think-go sanity check.
The video doesn't really explain why conductors point at the signs - it just says "to prove they're paying attention". Paying attention to what? The answer is that they are verifying that the train is correctly positioned in the station so that all of the doors will open on the platform.
This comes up every few weeks on HN but nobody has ever offered any statistics that would suggest this is as good let alone better than just having the trains handle alignment automatically. It's a task humans are bad at and machines are good at, so just giving it to machines makes more sense, modulo unions.
London Underground hasn't had guards for decades at this point, and the Docklands Light Railway hasn't even had drivers (there is a member of staff who is trained to be able to drive it on every train, but they are usually doing other things) since its creation. If they're misaligning often enough for it to be possible for New York to be statistically better I haven't seen anything about it after repeatedly asking.
Actually what exactly is the member of staff doing on the DLR that is necessary, other than answering tourists' questions and putting a triangular key into a receptacle at every stop and then turning it? I have not been able to figure this out.
In the Netherlands, the NS has two types of trains that go between towns. Intercity and Sprinter. Sprinters have someone who will walk onto the platform at every stop, or failing that, lean out of the carriage, verify that no one is getting in, and then step into the train again to put the key into the receptacle and then turn it. Following that, the doors close. In contrast, there is no such person on Intercity trains; they do fine without. There may be a conductor who checks tickets. In comparison to the DLR, both Sprinter and Intercity trains have drivers.
Is there some requirement or function that I am missing that requires a dedicated member of staff to perform this key-turning ritual at every stop on the DLR and Sprinter, or is this simply to appease the unions?
It could be that Sprinters are meant to be more lenient towards people running to get on than Intercities, which might have a stricter schedule.
It's a GoA 3 system, so it isn't designed to be safe without a human staff member on every train. There are GoA 4 systems which do not need a human but the DLR isn't one, so while it would seem to operate normally if you just let passengers operate the doors - when anything goes wrong those passengers are in trouble because the system design assumes a trained member of staff is there to fix it and now there isn't.
That triangular key opens a panel by the front left seats of the train, which reveals a complete set of controls for manually driving the train which that member of staff is trained to use. If the GoA 3 system has given up when the train is just out somewhere random then "just get out" while technically possible since there's a walking route along the side at all times - is clearly not ideal even for able-bodied passengers, so in fact the member of staff will drive the train manually to a station unless obviously that's impossible somehow (e.g. terrorists blew up sections of track either side like a Hollywood movie).
Because humans are bad at driving trains, they aren't allowed to move at full speed, they can either let the GoA 3 automation oversee everything (e.g. it won't let them go anywhere it wouldn't be willing to go) at a reduced speed or when that's not useful they can switch off all automation and move at a crawl with no oversight.
Every morning the first train of the day on each route is driven in the first of those two modes, because overnight human maintenance teams sometimes manage to leave tools and equipment on the line and the automation doesn't know not to drive the train into a welding kit left on the track by some idiot who just discovered his wife is leaving him or whatever. So the human staff member's job is to drive the train (with the AI preventing them smashing it into other trains) while looking out the front window for problems.
I try to do that during incidents. I'm not 100% there since it's no a company rule, but it helps me at the time and later when writing up details: "I see <behaviour X>", "<Y> should fix it because <Z>", "I'm starting to do <Z> now and seeing ...", etc.
It also helps when Z results in a total meltdown and you need to pull in more people to help out, so they have context of what happened.
Killed just under 1k access points when they all upgraded on one go. They had no problem erasing the firmware but when they all tried to download the new one at once it killer the service and we ended up with a lot of blank APs. The conformation message for 1 or 1000 APs is unhelpfully "This will overwrite all existing system images. Are you sure Y/N"
I think a router analogy might be more precise - more like fast path / slow path - where when most packets come in they hit the fast path in hardware, and slow path exception packets get handled by the cpu.
I do this with my kids, gesturing (not pointing) as it helps my mind remain focused on truly listening to them amid everything else going on. I probably look ridiculous, but I'm a better father for it so ¯\_(ツ)_/¯
I wish it were possible for similar prompts to appear before all sorts of policy-makers and bureaucrats. "It appears you are about to institute a policy which will require 400 million patients to sign an additional waiver every time they visit a clinic, this will waste a total of 354,921 human hours within the next year alone. Please type 354,921 to proceed."
The motivations are different: the cost to the rule maker of the effort by all those people is nil. While the cost of not adding the paper is the risk of something happening in the future which could cost them their job. This is why the shoe removal theatre was added to flying: the risk of something happening is essentially nil, but if it did, heads would roll.
This is not a criticism of bureaucracy or regulation BTW (I'm a fan of both, in general). It's simply a recognition that there's a misalignment of objectives.
Not sure how to analyze the calculus in the case of rachaelbythebay's observation. Certainly there is one misalignment which is if the tool has sharp unprotected edges (e.g. can take the company's whole site down) the person who ran the program will be blamed, not the person who wrote it. Unless they are the same person, it's hard to get a proper feedback loop in place. The only tools we have are coding standard and code reviews: bureaucracy!
Yeah, it's quite surreal. "Hey, privacy is important, so let it make so that to handle people's private data, you'll need a permission from them". All right, now whenever you try to e.g. send a (paper) mail, you have to sign the waiver that yes, you do allow the post office to see and handle your name and your mail address. Not only that, all such waivers seem to be written as "I hereby allow <insert the legal entity> to handle my private data in whatever way they want to", so we're back on square one, just with more perfunctory paperwork required.
it requires the office of management and business to calculate the impact of records-keeping requirements impact on time and privacy, among other things.
I do not believe it has resulted in a reduced recordskeeping burden. For the most part I simply see an estimate of how long it will take to complete my tax forms and permits, on the form itself. Perhaps others have different views.
Hard to say, knowing the cost of a new process could have informed a new design or requirements. We don’t know what the other path held. But I believe in general having more information allows us to make better decisions so this is a good act.
I have a habit of creating cli tools, which potentially do dangerous things, to default to dry-run mode. For example, instead of the typical `--dry-run` or `-n` option, my scripts instead had a cheesy `--do-it` to be non-dry-run. It is annoying as hell to my colleagues, but saved the day many times.
A coworker of mine would write all his bash scripts to echo out the commands it would run, and then to actually run it he would pipe it to bash. This way he could inspect the commands to make sure they were correct before running them.
I would love a shell that allows you to “run” a script in manual mode - Where at the end of every command, every statement, it prints what the next command will be with all variables expanded or otherwise called out, and then requires you to hit “enter” to cause it to proceed. I write a decent amount something between README and Shell Script. I’ve already got an awk one-liner that parses the shell out of Markdown. I typically copy+paste, line-by-line, from my README and add a bunch of echo statements to verify what i’m doing.
The nice thing is that in PowerShell, unlike bash, this flows through to the vast majority of other commands. If the script has the snippet above, then you don't have to litter it with "if ( $userSaidYes ) { ... }" blocks all over the place.
Similarly, PowerShell automatically wires up logic to produce all of the useful modes you might want:
[Y] Yes [A] Yes to All [N] No [L] No to All [S] Suspend
This is very fiddly to implement manually, and "Suspend" is likely impossible for most shells.
I do something similar with my scripts, but have `--go` action, even on a script that requires no other options, just so that if it's run without any options, the person running it gets a message saying what the script WOULD do, if `--go` were passed in.
I do the same thing. All of my scripts have a -defang parameter which walks through the entire process, including placeholder log messages, but not actually performing the operation. My run books always say to run your exact command with this switch first, to proofread it. For some dangerous scripts, defang is enabled and has to be manually turned off. Defang is also nice because it will tell you e.g. here’s the size of the backup you’ll be restoring, or the filepath you’ve composed based on your parameters, or confirming that you’ll be replacing an existing thing instead of creating a new one. It has saved me many, many times.
I generally throw up a status report type of thing "you are applying $this_operation to $this_many_machines on $this_farm. Continue (yes/no)?" and enforce yes/no full typing. Anything other than yes is a no
Even having a dry run mode is exciting. Doesn't even have to give complete results just "I was planning to delete 3 files and create 7 files", gives a hint whether the command will blow up the system or not.
For interactive queries / surgery, you do have an option with a transaction (begin/commit/abort).
If it is Postgres (don't know about other dbs), you can go a way long way using "savepoints" and "rollbacks" to truly have a trial-and-error safe surgery on db. Still dangerous, but quite helpful. I hate working on any other db without those features. Postgres also allows schema changes to be within a txn envelope.
Transactions and rollback is the dry run. The problem is that if you keep the transaction open for too long, you will block other updates to the same data.
Yep, I always write any update queries as a rollback transaction with some selects inside it to verify what the data looks like after it's done now, before I switch it to commit. I primarily use Microsoft SQL Server right now, so I also use WITH (NOLOCK) to prevent issues running my query will have with other updates.
Enough folks have replied that transactions are the way to go, but I just wanted to add that whatever interface tool you use for your database may have an option to force you to commit your transactions manually. For example PostgreSQL's default 'psql' shell has the "autocommit" option which, when disabled, requires you to manually 'commit;' before any changes take effect.
I think an improvement to SQL would be for insert/update/delete clauses to require a where clause and allow for something like 1=1 if you really intend to hit all rows. A safe but even more invasive would be requiring an end to the were clause as well (to prevent selecting a few but not all constraints).
I like this format in general, since it communicates the command is severe/irreversible. Heroku implements a similar confirmation when performing destructive actions. Commands require your to pass a `--confirm ${APP NAME}` flag, so the original command itself does nothing. Of course, this doesn't prevent you including those flags in makefiles, etc. I once dropped a table in a side project by accident because I took the wrong tab autocomplete suggestion in a makefile.
I suspect someone who'd do that isn't going to take that or other precautions seriously regardless of it being aliased. It's still a problem that they're circumventing it, but I think you have a larger problem if someone with that mindset has access to production.
Reminds me of the proposal to keep the nuclear launch codes inside the body of an innocent volunteer, so the President would have to kill the person to get the codes.
If you believe we should never use nuclear weapons, then don't have them at all.
If you believe there is a case where it may be moral and rational to use nuclear weapons, why would you want to put a potential barrier in the way of their use? You could have a situation where everyone was agreed to use them but the president was physically unable to harm the aide to use them.
You can know that something is the right thing to do but not have the courage to physically harm someone to do it.
An interlock that you may not be able to unlock for reasons unrelated to the task at hand is a bad interlock.
>You can know that something is the right thing to do but not have the courage to physically harm someone to do it.
In this specific case the "thing to do" is literally to harm hundreds of thousands of people.
The reasoning behind this proposed interlock is that any logic which concludes that it is moral and rational to harm hundreds of thousands of people must also conclude that it is moral and rational to harm the "interlock" individual. Otherwise, it is likely that dropping the bomb would be a mistake.
> The reasoning behind this proposed interlock is that any logic which concludes that it is moral and rational to harm hundreds of thousands of people must also conclude that it is moral and rational to harm the "interlock" individual.
Yes, but you can know it's the right thing to do, but not be able to physically do it.
The president's ability to physically cut someone open is not relevant to whether it's a good idea to use nuclear weapons or not. Him being unable to do it tells you nothing about whether they should be launching the weapons.
If the president fails the test that tells you nothing about whether the launch is the right thing to do. Doesn't that fundamentally make the test bad?
> It is about forcing the president to look somebody in the eye before they kill them.
Right, but can you understand that 'the President being able to look somebody in the eye before they killing them' is not a requisite for 'the employment of nuclear weapons being justified'?
We require the president to be able to do B before they can do A. But what if A is the right thing to do but the President is not able to do B? Being not able to do B does not mean A is wrong.
Doing A cannot be the right thing to do if you think doing B is still impossible.
If you cannot kill your friend to kill a few hundreds of thousands more, how can it possibly be justified? I just struggle to come up with a scenario where that is the case.
Of course I’m of the school that thinks firing nuclear weapons is never a good idea.
But that is the exact point. Having a human interlock explicitly shifts the dependency. Knowing that you should launch nukes is no longer enough and being able to bring yourself to physically kill someone is the additional requirement that we are _deliberately_ adding to this process despite there not being an obvious logical link between the two actions before.
I believe it is a requirement. I believe that the natural bias would be towards using nuclear weapons when we shouldn't. I believe there there is no possible world where the use of nuclear weapons is justified and the president couldn't also kill one additional person. I do believe there are cases where a president may use nuclear weapons when it isn't truly justified and that having additional checks will help prevent that.
> The president's ability to physically cut someone open is not relevant to whether it's a good idea to use nuclear weapons or not. Him being unable to do it tells you nothing about whether they should be launching the weapons.
Our emotional systems are the product of millions of years of evolution and often (not always, but often) show better judgement than our "higher" faculties. Bringing that part of our capabilities into the decision-making loop is a very good idea.
I think it would work equally well if the president had two aides and had to order one to butcher the other, in front of her eyes, in order to launch a nuclear strike.
Regardless of the exact details, I think the point of this thought experiment is that for a head of state, the decision to launch a massive attack that will cause hundreds of thousands of casualties can feel a little abstract. "Bombing a city" can seem abstract, even if the president understands this means killing children. Understanding is quite different from feeling. However, if the act of ordering a bombing raid on a city involved physically murdering a child, it would definitely feel more immediate and less abstract.
Your point stands, of course. But the part about removing the abstractness of the act seems relevant when ordering people killed.
Everybody agrees that this is a nuke-them-all situation, but the president, given himself part of the task of ripping apart human bodies, thinks more about the subject and decides a another diplomatic round is a better option.
I think that's the point. I'm personally not an advocate of this because it seems to be a little too "beat you over the head" with its moral metaphor, but the whole point is that the President should have to personally kill someone to understand the gravity of what they are about to do.
From the perspective of an advocate I'd say: If they can't come to terms with killing one, who are they to execute hundreds of thousands?
> "If you believe there is a case where it may be moral and rational to use nuclear weapons, why would you want to put a potential barrier in the way of their use?"
Because you think the point where they become moral and rational to use is way way way further than commonly discussed, and you want to put many barriers of many kinds (physical, emotional, logistical) to delay their point of use without completely blocking them.
You could also say that if a person is incapable of doing the hard parts of the job, don't vote them into the position. (Downside of that is that you'll end up voting someone who doesn't mind killing someone in cold blood while expecting that to be a filter that brings more empathy to the position).
It's an attempt to make an abstraction concrete. Think of it as the trolley problem in real life.
Stalin is famously supposed to have said, "one death is a tragedy, 100,000 is a statistic". Cynical or not it is how humans think.
> If you believe we should never use nuclear weapons, then don't have them at all.
Strategic game theory and Mutual Assured Destruction depend on the possibility that the other guy will use them if you do, and may be the only way to prevent their use. Interestingly this is one reason why you want the other guy to know your procedures, capabilities, deployments etc. Secret weapons have no deterrent value.
> Think of it as the trolley problem in real life.
Well exactly... doesn't that show you that it's a bad idea? People don't know if they could bring themselves to throw the switch even if everyone thinks it makes rational sense.
You're taking a rational, well-considered, strategic decision... and making the interlock a messy personal emotional one unrelated to the actual issue at hand. That sounds like the wrong way around to be doing things?
> Well exactly... doesn't that show you that it's a bad idea?
I don't think so, no. Sometimes we think too abstractly and make what turn out to be poor decisions. Emotions are really valuable heuristics and should be harnessed at a time like this.
Absolutely not, mutually assured destruction only works if both sides know that the other is committed to carrying out a retaliatory strike in the minutes before their death. It’s essential that the person in the position to order a retaliatory strike be someone ready to kill hundreds of millions of people for no reason other than the fact that they said they would. Putting emotional barriers between that person and the codes they need to carry out that enormous responsibility just makes it less likely that they will be able to follow through. If there’s sufficient uncertainty about whether there will be a follow-through then the nuclear arsenal loses its deterrence factor and we’re back to having to live with the fear that our rational enemies may carry out a first strike on us.
> Absolutely not, mutually assured destruction only works if both sides know that the other is committed to carrying out a retaliatory strike in the minutes before their death.
Not really. You would need to be absolutely certain that the other party won’t carry out a retaliatory strike before they’re destroyed.
The only thing that matters is that the other party is capable of indescriminate destruction, not the certainty they’ll actually do it.
It’s like punching someone holding a gun in the face.
Trolley Problems are themselves a bad idea... the Kobayashi Maru is a similar exercise. I, like Kirk, don't believe that there are situations that can't be worked around if there is time to think, and resources to act.
Isn't the Trolley problem a situation that is, by definition, time sensitive? If you had more time to think and resources to act, it wouldn't be a Trolley Problem.
If the answer to launch-nukes-by-cutting-a-human-aide is "well, I need more time to think" then maybe that's a good outcome?
It's the 1980s, and the United States implements this policy. What happens on the Soviet side? After the United States' announcement the Soviet press and Soviet sympathizers worldwide gasp loudly in horror. "How cruel are Americans, really? Is the barbaric act of murdering and butchering an innocent young man the only thing still able to keep their president from destroying our Earth?"
The Soviet General Secretary soon receives a report about what the new policy means tactically. Americans will take several extra minutes, possibly more, to authorize retaliation. (The exact delay is subject to disagreement. Secret experiments are conducted to get the timing down. They are inconclusive.) Amid the decade's mounting tensions, a preemptive nuclear strike looks more tempting than before.
Too bad sociopaths and narcissists are more common in positions of power. All it would do is uselessly kill a volunteer.
Time is also of the essence for MAD; known delay only makes MAD less effective if e.g. sub-launched cruise missiles are faster than dissection. And do all the fallback commanders need their own willing victim to mount a response?
Similar idea as GitHub's "type the exact name of this repository if you want to delete it" confirmation dialog. Maybe that's really what you want to do, but in case that's not actually what you meant to do, having a few extra hoops to jump through seems like a good idea.
> having a few extra hoops to jump through seems like a good idea.
I think that there is more to that. You need to consciously type the name of the repo that you want to remove. Windows used to add a lot of jumps to get something done, and the result was mindless clicking the "yes" button and realizing 1 second later that you deleted important information.
Yes, and infrequent; the main issue with Windows (Vista mainly) was that it appeared far too often. Even with 7, when you're setting it up for the first time for example, I think it shows up too often.
Same with Terms & Conditions. If you want your customers to truly have read and understood them, you have to show them a short quiz at the end of it. You're required to do a quiz in Europe nowadays if you want to engage in stock trading.
One of the largest AWS outages to date was caused by a scenario like this. [1] A mistyped commanded removed too many servers from an S3 subsystem, overloading the remaining servers and crashing the subsystem. The failure snowballed until the entire S3 region was down, which then caused issues with dependent services like EBS, ALB, and Lambda. They couldn't even update the status page because that also depended on S3.
I remember that. The AWS dashboard was all green checkmarks... because the red checkmarks icons the dashboard was supposed to display were stored inside the crashed servers.
Raskin talks about the futility of this in his book The Humane Interface.
Basically, what happens is the brain switches operating context from "I want to do something" to "resolve this interruption (confirmation box)" and you don't relate the one to the other - you're so focused on getting rid of the interruption that the original task is forgotten until after the interruption is gone.
Then you switch back to the original task that had been interrupted by the confirmation box and then you realize you made a mistake.
It's much better to engineer "undo" ability into systems - like delaying commands (GMail's "Undo Send" does this), or caching previous state, etc.
That's exactly why it's not a "confirmation box", but requires you to slow down and think for half a second. She even talked about mitigating copy-paste, which is the next obvious way people could habituate.
Also, while undo is great, it's not always technically feasible. The tools in question are basically for modifying the layer that implements undo for your end users, and are often themselves fundamentally irreversible. Undo for raw hard disks involves forensic analysis at best.
The problem (I probably didn't paraphrase Raskin well) is when you slow down & think for a half a second, you context switch from "I need to do operation" to "I need to make this dialog box go away".
No matter what tasks are required to make the dialog box go away - doing math, retyping a message, clicking a randomly ordered box - that becomes the top task in your head and you "forget" about the original task until you finish this task.
Once you resolve the interruption, you switch context back to the original task and then you still have that "oh crap" moment.
Yes, sometimes undo is very difficult, and can require a system designed to support that ability as a first-class feature from the start. Many systems you can perform rollbacks, but there are definitely destructive actions - in which case you should have test stacks to validate your actions in advance, and peer review. (e.g. dual keys to launch the missiles)
It amazes me that something like this can be done by a single person.
In aviation any time input is given to the machine, it's entered by one human (typically pilot flying) and then verified by the other human (typically pilot monitoring) before being committed to or executed. For example... when a new altitude is assigned by ATC, say FL300, the pilot flying will spin it in the selector window and keep his hand or finger there until the second pilot agrees with and confirms the selection by reading FL300 out of the selector window.
I know there are meat bags in these giant tubes so that changes attitudes towards safety etc. However, it seems to me that when organizations start putting the power to halt nearly the entire business in the hands of one person, there should be some slightly different attitudes. A breaking change in a million servers could easily cost hundreds of thousands or maybe even millions in lost revenue or employee productivity.
I'm just an outsider though. Perhaps this level of attention is practiced at some shops. It's just interesting to me how in some fields we settle on pretty uniform standard practices whereas others are seen as non-human-life threatening so it's just shoot first, ask questions later.
Best practice for using the "weaponized" version of the tool when you had powers to actually hit all of them at once was to paste the command into IRC and get some of your fellow peeps to eyeball it and make sure it was sane.
<me> team: hey, sanity check this please: hsh -A "dumb_thing && other_thing --foo --bar"
<teammate> shipit
[ I type the command ]
<me> ok, running as job 1234
The last part was a courtesy done so that they could watch the progress of it too without having to dig to find my request. It also meant they could kill it easily if something went wrong and they couldn't raise me for some reason.
Tools like this are best used outside the solo realm.
I think an automated tool would be preferable since there is no 100% foolproof guarantee that what you type in irc is the same as what you type in the terminal.
> It amazes me that something like this can be done by a single person.
In many dysfunctional orgs, having someone to blame is desirable. They will use all kinds of words for it like "accountability".
But at the end of the day, heros who take stupid risks that succeed get rewarded, cautious people that ask questions and try to understand before acting are smugly dismissed, and would-be heroes that burn the house down because of recklessness get blamed and make everyone else look good. It's all too common.
In shops where stakes are high, it’s not uncommon to do just like you said—have mechanisms that force someone else to verify what you’re about to do, before you do it. If someone else can’t verify, the tool will block you. It’s similar in spirit to requiring code reviews on all shipped code.
This is a great idea, and I'd like to point out that having such a system in place would have prevented one of the largest Internet outages in recent memory - the Amazon S3 outage in 2017: https://aws.amazon.com/message/41926/
> At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
It's kind of funny, since various operations performed in the AWS web console use this model (e.g. type the name of the resource you're trying to delete). As an organization, they're aware of this approach and think it's useful, but (presumably) didn't use it in their own internal tooling.
Terraform prints out the number of resources changed and at least requires a "yes" to proceed. Not quite as onerous as described but at least prevents some type of fat-fingering. Basically all changes with Terraform are risky as they usually involved bringing up and down infrastructure.
Terraform will perform the following actions:
# google_compute_instance.vm_instance will be created
+ resource "google_compute_instance" "vm_instance" {
+ ... <more>
Plan: 2 to add, 0 to change, 0 to destroy.
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
This is exactly the problem the author is referring to. With Terraform, you always type "yes" to proceed, so it turns into muscle memory. You stop reading the output, and you're already typing "yes" before you even see the prompt. Terraform's output is also verbose, and many changes show up as "1 to add, 0 to change, 1 to destroy" because they don't separately list a "replace" category. It's pretty bad; you've got cognitive overload, confusing output summary, and a predetermined continue answer. And this is often an action you're performing under duress. I've been bitten by it plenty of times.
A similar system is molly-guard [1], which replaces the reboot/halt/poweroff/... commands with scripts that make you type in the name of the machine before proceeding. Avoids shutting down the wrong machine because you forgot where you SSH'd.
Many years ago, I made that mistake two or three times, rebooting the wrong machine. Since then, I use molly-guard on all my remote machines. Never happened again.
The first use of a new security product my manager insisted we roll out (as a duplicate to an existing tool from another group) was to quarantine a change in a system file that seemed to be spreading through all of the PCs.
Except the change was to quarantine explorer.exe which was being changed with a patch that just got pushed out. The net result was about 6 hours of the desktop group wondering "why the hell are all of the PCs not logging in right after this patch" followed by about a month of rolling tickets from seldom used computers that had just been powered off.
His excuse was it only showed a file hash in the main screen and you had to view details to see the name plus he had a 3 day change open to roll out the system. Never understood how he got away with that one but such things did catch up to him about 2 years later.
1. Git's Force-with-lease. Git push's "force" is too powerful, you will likely regret this much power, but it's tempting. So force-with-lease is the same power but conditional on you telling git what exactly the state was that you're overriding.
This has two benefits, one is like Rachel's, it is an opportunity for a human to stop for a moment and consider, wait, why are we overriding this state? To find out what it is we might as well read... oh the state says it's an "emergency fix. Call Jerry". Maybe, just maybe, I ought to call Jerry before I force overwrite it?
But the other is about race conditions which Rachel doesn't specifically address. If you are very careful to check that the state you want to overwrite with force is indeed a state that should be overridden, nothing prevents it meanwhile changing and then you overwrote state you didn't even know existed. But force-with-lease fixes that because your lease won't match.
I believe Force-with-lease is a pattern that ought to be far more widespread. I've used several configuration management tools that let somebody say "Temporarily don't mess with config on these machines" and some of them let you write a reason like "James is rebuilding the RAID arrays" but none of them have that force-with-lease pattern that would be let me say "I know James is rebuilding the RAID arrays, this change must happen anyway but if anything else is blocking the change then reject it and let me know".
2. Prefer Undo to Confirmation. If the computer can undo the action, even if that's a bunch of work and you'd rather not bother, put that work in and enable undo. Humans always know they "really" wanted to do the thing you're asking them to confirm so it's somewhat futile to ask, but they often realise they didn't want to afterwards and will undo it if you make that possible.
Not everything can be undone. Undo factory reset isn't a thing. But lots of things you can't undo it was just laziness, try to do better in your own software. Your users (which might include you) will be grateful.
Related but semi-random: it slightly annoys me that force-with-lease goes through the entire effort of force pushing if it thinks the remote is identical to the local. It’s not going to change anything either way, and it could save me the second or two of waiting on it to do nothing. If local is already identical to the last known state of the remote, and I’m trying to force push, the actual error is that I didn’t edit the local branch in the way I thought I had when I decided it was time to force push.
(I realize there is a possible error message case if the remote has changed... but I don’t feel like this command is the best one to use to discover whether the remote has changed, if you have no changes you actually intend to force push.)
That may have helped when Emory University's IT dept. accidentally sent a wipe and reformat command using Microsoft's SCCM to all of the Windows computers and servers on campus back in 2014.
https://it.slashdot.org/story/14/05/17/051214/emory-universi...
This is a topic near and dear to my heart, as I'm often that person arguing to make some slightly less automated because the small trade-off in time is insurance against some of the worst mistakes you can have. Automation to the point of removing humans leads to stupid problems that a human wouldn't make if they looked at what was going on. So we automate tot he point where we minimize human contact, presenting a summary of actions that as humans we can apply our wonderful brains to and prevent those problems. Except some percentage of the time we don't actually pay attention, and depending on how the human interaction was introduced instead of complete automation, some percentage (or multiple!) of errors still sneak through.
Automation to the point of minimal human contact where you assume the human will read the presented information and make an informed decision doesn't work. The point is that we want a human to understand what is being asked, so taking some step to ensure they do understand is warranted. It will never be perfect, but adding steps like she proposes are definitely a step in the right direction, IMO.
This resonates with me. Years ago I took down a service in a cell accidentally (Googlers might empathize: never 'borg' when you meant to 'borgcfg'). If I had been asked to enter the exact number of tasks I was about to nuke, I might have thought twice ;)
I've certainly deliberately downed an enormous number of tasks, though, as part of a cluster turn-down. I love the technique of requiring the operator to echo a key fact, but in the case you're describing I think the key fact is not how many tasks but that that they're serving live traffic. So:
* You could ask the operator to echo the qps figure...but really any number other than zero is likely to be an error, so it can just error out in that case without needing the confirmation.
* Even if it is serving zero qps now, if it's not explicitly drained at the load balancer, downing it is likely to be a mistake. So even better to check that.
Only once in my career have I taken down jobs serving live traffic. (They were serving 100% errors.) It was deliberate, but even so I wouldn't have minded having to supply a --yes-i-know-im-downing-live-jobs.
edit: and if for some reason my assumption is wrong and downing undrained things becomes routine...well, you'd want to fix that, but as a short term measure going back to the confirming a number rather than the force option would be appropriate. Is certainly not good to have an override that's routinely used.
The way we approached this on my SRE team was semi-manual with improved ergonomics. We embedded the live traffic graph in the turndown tool, so it would be right in your face before you took the destructive action. Of course it was always possible to go one level down on the tooling and do everything manually, but it wasn't the usual way.
Seems reasonable, but as you might have seen, rossjudson did accidentally-ish go to a lower layer: he wrote "never 'borg' when you meant to 'borgcfg'". And you're still relying on someone actually looking at the graph in their face which isn't as sure a thing as it'd be if they had to echo something back as Rachel is advocating for.
(For the benefit of non-Googlers/Xooglers: borg is a lower-level tool mostly used when everything else has gone wrong and borgcfg is a higher-level, more routine tool. These days people often layer things on top of that as well, because we love piling up abstraction layers. This approach is completely successful because abstraction layers never leak and solve every problem without making anything hard to debug at all. /s)
In my ideal world, even the lowest layer a human ever uses would do safety checks by default. Eg, imagine if the job specification included "query this safety check service on change" and the borg tool (as part of querying the existing job on a cancel/rm command) discovered that and honored it. Most people/jobs would use a safety check that fails taking down a job unless the load balancer reports all relevant services have that job drained. The safety check service could also specify a confirmation prompt (similar to what Rachel is advocating) that could be customizable (like qps or percent of global capacity rather than just number of tasks). The safety check would be effective no matter what layer you use, and there'd be no good reason to use one that would cause prompt fatigue. The outage rossjudson described (and I know he's not the only one who has done exactly this!) would have been avoided.
I really agree with your philosophy here but I've never been able to perfect it in practice. The imperfection comes from the way there is inevitably some mapping of things to other things by name. I can ask a load balancer whether clients of a service are being sent to a named capacity or not (i.e. is the thing I want to remove "drained") but that doesn't rule out the possibility that another service maps a different name to the same backend and I forgot to integrate that name with my automation. Also impossible to rule out that a client exists which bypasses or ignores the advice of the load balancer. Having visibility into caller identity helps a lot with this kind of problem but outside of Google there is a scary word called "cardinality" which prevents people from monitoring the whole caller×server space.
I agree you can never reach perfection. I expect there'd still be postmortems with "Our safety check was missing/bad" in the "what went wrong" section for various project-specific technical reasons. But I'd expect there to be (a) fewer such postmortems, and (b) an action item to fix the job's safety check service specification and audit the team's other ones, rather than the rather inexcusable IMHO "this tool doesn't support those, /shruggie, maybe schedule more training about which tool to use".
I do like this idea, this is I assume why github makes you type the repo name out in full. I wish AWS followed suit, when deleting any RDS (database) instance on AWS all you have to type is "delete me"... very easy to copy and paste as well as just know what you need to type and be on autopilot. I have even poked support about it and their response was underwhelming.
At least Facebook (where OP worked), Amazon, Google, and Microsoft. Probably Netflix, maybe Apple. There might be a couple more, but no more than that because we've already accounted for a pretty high percentage of worldwide shipments for servers, disks, etc. Fun fact: when you're that big, your demand creates its own inflation and you have to consider that in projections.
If by "machine" we also mean things outside of a 19" rack, I would wager that large telecoms probably have way more devices running Linux than FAANG. Imagine the network of cable modems that Comcast alone must operate. What percentage of their 28+ million broadband customers rent Comcast owned/managed modems? Almost all of them except the tech-savvy crowd? And that's just one device type.
Thanks, so a handful at most, and the "usual" ones, I always thought that those companies keep their machines connected in (redundant) "sets" and that a command affecting all of them was more a case for "never" rather than "once in a while".
Google, at least, has a thing that is supposed to prevent widespread disruption at the machine level, called the "Safe Removal Service"[1]. This is a good idea that in practice isn't perfect. If you write a tool that does not consult SRS, or your service doesn't declare a SRS policy, there can be surprises.
A particular outage that I will never forget took out Gmail delivery worldwide in an instant, because the change was not expected to be disruptive and therefore did not integrate with SRS. As it turned out the change disabled the machines where it was applied, and the process of selecting a subset of machines to canary the change was not independent of the way in which Gmail assigns services to machines, so in the space of a few seconds they created a global outage.
How do you define one location? If it's like, a contiguous plat of land with a bunch of buildings, each containing suites, and each of those containing clusters... then these days, yeah, that's probably not too much of a stretch.
And yeah, physical machines, not VMs. Sometimes they're blades, sometimes they're sleds, but I mean real hardware made out of metal that you can pick up and use to defend the datacenter if you have to.
(Although, honestly, I was talking about global counts in the million+ range when I wrote it since it was referencing the past, but by now, a region with a million+ is not far-fetched.)
Disabling the "run" button for a few seconds was actually done to mitigate another risk -- sites cueing the user to click in a particular location, then triggering the confirmation dialog with the "run" button right where the user was about to click.
Oh god this would have saved me so much stress once. It was early in my career, and part of my duties was to run a merge/purge process on dupe records.
I'd select the dupes for merge using a checkbox, but the vendor's interface for this just had a "confirm" button. So, I confirmed. However I'd selected the "select all" box and.... confirmed. Merging every. single. record. into one (1) record.
I was fortunate, the vendor was able to roll back the changes, and nothing was lost. I also had a very good mentor-like boss who avoided reaming me out before we knew if there was a solution or not, and when there was he simply told me "I'm sure you've learned your lesson, but don't do that again."
> "This might be as simple as printing the number with your locale's version of numerical separators, like "123,456" or "123.456" or "123 456" or whatever else you might use where you are. The trick is then to NOT accept that as input, but instead demand that they remove the separator and jam it in as just digits. "
It's easier to just strip non-digit characters than to parse the input for them and respond accordingly. This is a confirmation step with basically a checksum, so you're not going to get many false positives.
Stripping the non-digit characters would allow "123,456" to validate instead of only accepting "123456" -- which defeats the whole purpose of printing the number with numerical separators (to prevent copy/paste).
Notably, Discord does something like this when you @everyone in a large channel: "You're about to push a notification to 12,000 people, are you sure you want to do that...?"
In this case, usually the very fact that a popup unexpectedly popped up is enough. I use Konsole as my main shell, and like several other shells now it has a "You're about to paste 100KB, yes/no?", and I don't mindlessly click "yes" because it is already a "cache miss" to see that dialog at all.
I've typically used pdsh https://github.com/chaos/pdsh for these types of commands, and I don't think they have any such safety options. The only protection is to be wracked with fear whenever you type pdsh. Obviously this fear wanes with use, and eventually you don't think about a command for long enough before you do it and hit enter on a regrettable one.
Even better than you confirming your own action, is someone else confirming it. If the stakes are high, require two people to turn the keys, instead of just one.
This reminded me that a few years back I worked at a place where (notoriously) Puppet would occasionally go over some random box and remove access to people, just because.
Or to all the machines, on one occasion.
(It was actually some sort of race condition when we massively updated per-project access permissions and asked for SSH keys to be redeployed, but it was annoying as heck, and sure to happen whenever you really needed to access that particular machine.)
But maybe this is enough?
I do this too, but this gives me time to actually read the repo name twice. It's way better than a confirm button for me.
I'm sure it would also wake me up from autopilot. But I don't do this often so I can't really know. It seems like this is good enough for many people, who don't perform this action too often.
I disagree with the many/most. Many/most are probably using uBlock Origin, which doesn’t try to prevent things like blocking pasting (to my knowledge). I’m sure some are using NoScript-like features... but that’s not the same as specifically preventing websites from preventing paste. It’s just a sledgehammer. I can’t name an extension to do that one task (and/or similar tasks) off the top of my head, and I’m reasonably familiar with discussions in these parts. uBlock Origin is known to be very popular, unlike an obscure “allow paste” extension. But, that’s just like, my opinion... as they say.
The point I was making is that copying and pasting seems like more effort than just typing the repo name. Do you commonly encounter long, inscrutable repo names? Do you delete repos frequently enough to have built up the habit of copying and pasting the repo name into the delete box?
If it is common enough, disabling paste would actually benefit the user based on the premise of the article.
This is similar to a UI solution a colleague and I came up with. The action the user could kick off was unstoppable and irreversible (a large batch job), and it seemed like even a confirmation prompt was too easy to simply click through. So we had the UI present a modal dialog asking the user to type in a specific word in all caps to confirm the action. Worked like a charm.
I did a similar thing with a Star Trek program many years ago. One of the commands (22? 23?) was to detonate the warp engines in the hope of taking the enemy with you.
After hitting the wrong number once, I added a confirmation that presented a random six-digit number that you had to enter before it accepted the command.
Reminds me of a study done where a test was given with questions that weren't difficult but likely to make a silly error. Around 85% of participants got at least one question wrong, but when they repeated the same test with a difficult-to-read font, that number dropped to ~25% or so. That's another way to make your brain work, use a terrible font.
I am so adding this to a query api I have, where its all too easy to leave off constraints and end up asking for massive data sets by mistake.
Thinking I can probably enhance it by forcing the user to type in the number as text rather than numeric, so they can't cut-n-paste. Kind of force them to type in "I am sure I want all data ever" or something.
I don't think this is useful for an api. This is only useful when humans are the direct user of the component. Automated users, like those of an API will dutifully provide the required safety value.
AWS sometimes does something similar to this like “enter the name of the thing you’re trying to delete to confirm”. I think it makes sense because you can have such a huge difference between how much you care about certain s3 buckets or CloudFormation deploys etc. In true AWS fashion it’s inconsistent between services though.
To their credit, even if it’s unintentional, every time one of those screens pop up I have to stop and think about what I’m doing because every screen wants something different from me!
Back in the Spiderman 2 days, I worked for a content management company that was supporting a really, really big website. I believe they were playing host file games for Stage/Prod. Was in the room on when they demo'ed something, did a restart of the system - and every pager in the room went off. Yah...
I for one can't fathom any organization managing a million devices / servers / VMs / whatnot. I'm having enough trouble with one, and my biggest employers had maybe a few dozen at best, and they already had a dedicated ops team that worked mainly with infrastructure-as-code.
Once I had to deal with some software-RAID in Linux (mdadm it is), around 2007. There was some -force option that would just print information explaining what it would do and, to perform the real action, you needed to type another flag (that should never be revealed).
i've done this before by displaying unix epoc and asking the user to copy/paste that value WITHIN a 3 second window as an env var. i.e. if you up arrow and run same TIMESTAMP=1603827448 ./foo it won't work because 1603827448 is now way too old.
I've seen this implemented as "Please type: My username is $USERNAME and I will not cry over spilt milk" but that was more to guard against support tickets.
I'm thinking this could also be useful for cases where colleges mistakenly email all applicants saying they'd been accepted, when they in fact had not been.
Promise Pegasus (thunderbolt storage) comes with a GUI that does the same thing - to shut it down you have to type “CONFIRM” before clicking the button
Debian already does this, it asks you to type something like "yes do as I asked" if you want to remove a package that is considered to be part of the core.
It would be neat to print out an esoteric error that gets a single result in Google, where the "forum" in the result has a rando answer about using a certain esoteric flag.
Then you search the logs to see who is trying the command with the esoteric flag and "fix the glitch with payroll" for those employees.
Makes it harder to nest that command inside a script - you have to parse out the number and paste it back? Or do I misunderstand - should it still prompt the user in the middle of the process when that step arrives? That would be problematical if it were included in a web page or whatever.
Cattle, people. Not pets. Just make sure you don't hit all machines simultaneously and are rolling, instead.
Since the post is talking about automation anyway, assume that any machine that can go down will go down. Ensure that any such disruption will be minimal. Oops, you just killed the production database? Whatever, who cares, it has just failed over anyway (or, for a distributed one, a new node was elected, data started replicating, etc).
If one considers having to SSH to a machine to be an anti-pattern, it's amazing how much crap goes away.
In the more generalized case, where it's not about machines, then it makes more sense. Maybe you are running a query that's going to perform updates across multiple clusters. It still should not be done by hand with direct production access - unless you are in the middle of a declared (and urgent!) incident and everything is on fire. In which case there's a bunch of people watching over your shoulder (or more likely, screen sharing in a conference call).
The same job you have (hopefully) run in QA you should be able to re-target to production. Make the question just be a way to "unlock" your automation - for instance, by not copying credentials or environment information until the proper confirmation has been received. One should still have an escape hatch for when (not IF) things go wrong.
I personally took it to heart, it's a good system for forcing a cache miss in the brain - make sure you're on "database production" or "database localhost" etc.
[1] https://en.wikipedia.org/wiki/Pointing_and_calling