Hacker News new | past | comments | ask | show | jobs | submit login
I've screwed up plenty of things too (rachelbythebay.com)
376 points by Lammy on Feb 11, 2020 | hide | past | favorite | 149 comments



One of the fist things my boss told me at my first "real" (good career type money) job was:

"Everyone screws up, you will too, just be honest about it and tell me and it will be good."

It was a job working with this memory mapped wonky hardware tied to mainframes. There was no "undo" and as soon as you wrote to memory there it was. It was inevitable that you would typo something sometime in an important system.

Finally 3 years later a buddy is talking to me over the cube wall "Hey was that #3 you were working on yesterday?" I of course am typing away while talking and say out loud "3? Um..."

So I type something like -RESET SYSTEM 3-

I meant to type 9 a non critical system. 3 tied to a data replication system that was absolutely critical that it be running otherwise all transactions would stop (well for a bit while backups take over).

So if you couldn't use an ATM for a huge bank for a little while decades ago (fortunately it was the middle of the night), that was me ;)

I went to my boss the next morning and told him what I did, and he says "This like your first in 3 years, that's a record or something, it's usually a few within like 6 months. Nice job!"

It was a great place to work, no finger pointing, if you screwed up no big deal, everyone stuck around working with that team for decades.

When there were conference calls it was rarely stated (if ever) who actually did the thing. It was just accepted that it happened and we could discuss how to prevent it and such. "The engineer" or "the support team" and such were common phrases.

Inevitably folks would ask "who was it" and the answer usually was something like "it doesn't matter".


I’m at a place with a similar policy: you can screw up, just own it as soon as you realize you screwed up, do your best to mitigate the damage ASAP, and see that the fix gets prioritized appropriately.

I still screw up, of course, but this attitude has made me feel less paranoid about performance and less like an idiot when I do mess up. The amount of trust a team needs to have to make this work is really the key,IMO.


This approach has been adopted by the SRE movement under the "blameless postmortem culture" alias: https://landing.google.com/sre/sre-book/chapters/postmortem-...


How do they convince people not to use failures as political capital against other teams? It's all well and good to say "We promise we won't do it!", it's quite another to actually not call out the other team's failures when they're competing with you for budget/headcount, or against other individuals when stack-ranking.


You do it by management not making things zero sum, and by correcting/penalizing people for blaming behavior (e.g. saying a root cause is that a team/person made a mistake, vs a system not preventing or mitigating the mistake)


Ideally by not creating systems where people within the same organization view each other as (more than friendly) competitors, and by weeding out the sociopaths who can’t help but view life through that lens.


That's a good read.

Blame is such a complicated thing and the results of any mess are always way complicated even without the human factor.

I'm not convinced in any complex system outside of a guy running through a datacenter with a hammer... that anyone can actually "properly" assign blame or fault.

Humans are going to dork up, but no point in coming down on someone(s) for all the complexity that leads to a problem..... helping everyone avoid it on the other hand has a lot of benefit.


One of the first things I tell new graduates/analysts/data scientists is that you're not a real one until you've broken something important & valuable.

It's the breaking of things and seeing it go wrong in the real world that often boosts one above the level of academic coding and model building, because "holy shit, all that engineering and principals stuff actually matters and the real world is hard and there's consequences".

But I expect everyone to do it also, because without curiosity and an urge to push the boundaries, you'll be a very mediocre data scientist :p


how did you like working on mainframes?


It was actually equipment connected TO mainframes. It was a weird world, very old, with old school people / processes.

It wasn't where I wanted to be, but it was also my first real job for a guy who dropped out of college (I was a poor student / shouldn't have gone when I was that age)... so I didn't mind at all.

But it was weird, lots of old east coast customers with old ways. One of our supersmart engineers was a woman. She would be on a call and would tell them their mainframe was configured wrong and straight up some guys would ask for her manager every time.... every time.

So she would come get me, a lowly n00b with a male voice and I'd pretend to be her manager looking things over and tell them the same thing, word for word, and they'll belive me. Weird world.


For whatsoever little it's worth, I have fond memories of my mainframe days.

One of the wonderful things about them is that they never change. You can sit down and, in a period of weeks or months, read all the black binders containing all of the (usually excellent) documentation, and then you'd know everything about it, forever. Close enough to everything, anyway.

Some people might find this horribly boring, but it meant I could easily partition my time between "the familiar thing that always works the way I expect", and "the experimental stuff on the side". Today, those lines are super blurred; I rarely get the opportunity to become a proper expert on anything at all in the web world, so 100% of my time is "omg something's broken and quick, find the quirky part in the experimental thing that's running in production."

And, speaking of documentation, we just don't have anything like mainframe documentation in modern software. O'Reilly books often come sort of close, or used to, but Unisys for instance had something like a 30-volume set of binders, about 500 or so pages each, containing extremely deep, carefully edited documentation on every single system call. Imagine if your favorite web framework had an entire wikipedia, with a big team dedicated just to testing and reviewing every page for accuracy.

And finally, with modern queues, databases, file systems, and -- much as I'm loathe to admit it -- systemd, we're finally just now catching up to the disaster recovery procedures that were standard for mainframes in 1995. You could literally walk into the data center and yank the power cord for the mainframe while it was in the middle of running payroll, plug it back in, and it would pick up where it left off without skipping or double-counting a single record.

They had gigantic limitations too. TCO was astronomical by modern standards, so the hardware never got upgraded, so you'd never be able to do any of today's big data stuff on a mainframe. Software development, such as it was, was done in COBOL or JCL or WFL or, maybe, Algol or FORTRAN, and git isn't a thing and a lot of that software has decades of history behind every single line of code.

But it wasn't all bad.


I find the premise[1] of this post amusing. The sort of comment "You always criticize others, but what about you?" seems to nearly always signal emotional damage on the part of the commentator who felt blamed for something. And that is nearly always a sign of bad management practice.

People don't always get things right, they screw up, they do stupid things for good reasons, and sometimes good things for stupid reasons. As a manager I always want folks to be observant and thoughtful, and try to keep such things not about "who" screwed up but how that screw up came to be (the good or stupid reasons) and how one might think about the action ahead of time that would alert you to the potential problem that would result from a given action.

And the key of all that is making the discussion about how to think so you don't have the problem in the first place, rather than making it a blame-fest on some hapless engineer who chose poorly.

I was fortunate to have a manager early in my career who was very proactive at solving problems and moving forward, not affixing blame. He would say "Ignorance is the natural state before learning, only if it persists in the presence of learning opportunities does it become a problem."

I've always tried to learn by what I observe and what I do, which is why I enjoy Rachel's stories of finding root causes. They teach the principles that needed to be understood prior to the action. All without experiencing the feeling of dread that you've just taken production off line :-).

[1] That being that commenters feeling badly that the author doesn't seem to show their own flaws in the stories.


That's something I gotta ask on my way to my next job. My company is extremely conservative in every meaning of the word, and is not unlike patio11's description of Japanese megacorps in mannerisms.

"What happened the last time production went down?" should produce a quite illuminating answer. Do they go through a detailed root-cause analysis? Do they answer with marketing-speak meant for a legally minimal disclosure? Do they blame "that moron" whom I'm meant to replace?

As it stands here, the official corporate policy is everything happens perfectly until someone who shouldn't be there messes things up, and the problem is best solved with a public and angry firing letter. Quoth our business partner: "We're allowed to change our minds, but you're not allowed to be in error. Even if we give you bad data, you're expected to infer proper data and give us proper output. BTW we're not paying for testing"


Here is a screw-up/near miss share. I was migrating a pensions service to a cloud vendor. Part of this involved a very large ETL. Practice makes perfect and we ran the custom process regularly to ensure I'd go smoothly on the big day. The last time we ran the practice I somehow managed to get my prod creds mixed up and I started to restore a week old back up over the top of production. Thankfully the first part of the process is a disk check and I realised my mistake and cancelled the job before any destructive actions happened. I was minutes away from destroying the pensions records for two FTSE 100 businesses.

Everyone makes mistakes! It's how we learn :-)


Did you do any changes to the setup after this near miss to avoid doing this again?


Yes. I reconfigured the creds and grants so it wasn't possible to repeat the mistake. The lesson learned was about isolation and diligence.


I was once (partially) responsible for the deaths of dozens of virtual machines at a distance of about three and a half years.

Fun fact: none of these VMs had rebooted in that time, or they wouldn't have crashed.

Anyway, back in 2014 or so I dropped a bunch of transmit packet completions. In most cases I also double completed packets which was immediately fatal. Kernels get mad about that sort of thing.

Turns out, not all of the affected VMs died. Some of them lived on with head indices forever unequal to tail indices (until they rebooted).

In 2018 a developer realized there was a potential bug in waiting for VMs entering a quiescent state -- a truly idle networking stack had retired all Tx packets that it had admitted. Having unequal indices was impossible under correct operating conditions. They fixed the glitch.

This change rolled out gradually.

Gradually, the kernel panics appeared.

The change rolled back, halting the impact, but then the analysis began. What had we broken?

Another fun fact: Linux often includes an uptime in dmesg logs.

Slowly a pattern appeared. The dmesg logs included unusually large numbers for uptimes. Plotting these, there was a clear cliff in terms of a minimum uptime. Historical deployment logs showed a noteworthy release at that date, years past. Noteworthy in that it was rolled back for my bug, years prior.

On the plus side, I realized this was almost certainly my years prior fuckup slightly sooner than anyone else, so at least I got to call myself out :)


Screwing up things is normal. One thing I started doing is that when a junior member of the team "screws up", I'd laugh it off and tell them about a major screw up of mine.

The thing is that often my screw-up as a junior was worse (short version: broke a key part of the 'boot' system, was detected friday evening, and we had a major scheduled release on monday morning) and it just puts people at ease. I'll tell it in a humorous way as well. It's important (I think) that they don't feel bad about it.

I've often had colleagues join in on the conversation as well. We're human, we'll make mistakes, no need to stress out over it.

EDIT: added a bit more explanation of _why_ I do so.


Great, sounds like you are fostering a culture where mistakes are shared rather than hidden.


"To err is human. To really foul things up requires a computer."


I accidentally recursively chown'ed / to myself on a server in Antarctica that was a critical gateway for our geophysical network, the night before we were planned to leave the ice. Luckily, my flight was delayed... I spent the next day wiping the server and setting it up from scratch, after giving up on trying to recover from all the bizarre problems that stem from owning all the system files (including broken ssh). You should try it sometime!


> You should try it sometime!

An overnight mission at McMurdo or a server reset without internet access on another device?

I'll take the overnight.


> An overnight mission at McMurdo

Depending on the time the year, perhaps. What's the longest possible time from sunset to sunrise at those latitudes?


179 'days' (4296 hours total). Missions vary in length though.


I did something similar on a central box used for mail aliases, cron jobs, and kerberos called "cartman" by copying and pasting a cronjob line at a root shell. I only noticed when things started failing, and spent the next day rebuilding the permissions all over the disk.

chmod -R is powerful. :-)


I did this on a VM once - it's fun for about 15 minutes... then burn it all and start from scratch.


Is there a way to save and revert such metadata? I can’t think of anything that stores just FS metadata.


FreeBSD does! You can rescue yourself from this with mtree. Something like:

> cd /

> mtree -U -f /etc/mtree/BSD.root.dist

> mtree -U -f /etc/mtree/BSD.var.dist

> mtree -U -f /etc/mtree/BSD.include.dist

> mtree -U -f /etc/mtree/BSD.sendmail.dist

> mtree -U -f /etc/mtree/BSD.usr.dist


I think rsync has an option to copy everything but file contents. Couple that with 'find' running 'touch' and it should be fine. (create blank files, copy attributes to them)


Sysadminning in Antarctica. This really speaks to the imagination! Can't imagine how different it would be from my normal sysadmin work.


You should meet Holly, who was the main McMurdo sysadmin most of the seasons I was there. I can't remember his last name but that guy was really wonderful. Super competent, friendly, and seemed to love his job.


Wish I could! But my Google-fu turns up empty. You'd think working on Antarctica would make it easier to find someone on LinkedIn e.g.


When I worked at Twitter we wanted to find out if we were finally going to survive New Years. Our head of SRE wanted me to test in prod, which horrified me but he convinced me in the end. In order to simulate the population of Japan I had to make a bunch of fake users. I spent a fair amount of time making sure they wouldn’t get caught up in any analytics, throwing off our active user numbers, and I managed to peg the ‘follows’ service getting them all to follow each other in a reasonable distribution. I also needed to bypass the rate limiter, but since I was in prod I could just reset my own counter in prod and effectively be totally limitless.

Two things broke in a visible way during all of this. During testing everything was wired up to my personal account. I managed to spam all my followers with thousands of happy new year tweets in a couple seconds since I wasn’t subject to the rate limiter. I deleted all but one of those, which I left to remind myself that with great power comes great stories of things going wrong.

The other thing was a bit more dramatic, albeit short-lived. The first big test had everyone ready to go. I hit enter on the job, and at the time (maybe still) I had no way to get metrics out of production at a granularity less than one minute. A very worried minute goes by, and then we realize I’ve DDOSed the authentication service. All my fake accounts needed to auth to actually tweet, and naturally they did that first. Since the whole point of the test was for load to all hit in roughly the same second, the auth load also all arrived in the same second. Oops.

We decided that was an unfair test, I spent a few hours getting auth tokens for my fake users, and we tried again. That time everything worked, and we also survived New Year’s... But it was fun getting there.


The auth DDOS sounds familiar. Was there a blog post or previous HN about this?


Not from me, and I don't remember one. Honestly while it did cause a brief problem, it didn't seem particularly noteworthy to the public.


I once accidentally deleted a large part of a production private nntp spool (used as a tech support forum for a commercial product) while trying to get replication working to a backup. The same day, I did a recursive chown on a different production server from /. Worse yet, I was so distraught that I left the office without telling anyone in a position to fix it. Since then (over 20 years ago now) I always double check where and who I am before doing something destructive and dry run if possible, but more importantly, clearly and quickly communicate when I screw up.

I ask junior folks what their biggest (technical) screw up has been, in interviews. I think it's a bad sign if they won't admit to it or claim they've never screwed up big-time.


Hopefully you ask this to senior folks also. Folks without much experience or responsibility may have not screwed up too badly yet, but I agree it would be a bad sign for anyone with significant experience to have never broken something.


I ran a sql script that migrated a database, it all looked as if it had worked perfectly, but the category Id's had changed. The main website handled this fine but I found out a separate system that sent out daily offers by email proudly advertised Domestos Bleach as the drink of the day.


Well as an idiot who screws up a lot, I can tell you it is a blessing in disguise.

Screw ups lead to uncertainty and research suggests we learn best in uncertainty - https://www.aau.edu/research-scholarship/featured-research-t...


Oh god. I did the binary tree tree thing, too, and also in C. We needed a symbol table for a thing, and I assumed symbols would come in in random order. Oops. Someone suggested AVL trees. The reference I used (which may have been Knuth) left delete "as an exercise for the reader." That led to my next big oops: Pondering how to delete from an AVL tree while slicing onions for dinner. Lots of blood. I still have the scar.


"ifconfig eth0 down" on the production bastion host, instead of on my localhost terminal -- and no hands on in the datacenter which was 160km away. Of course the bastion host was the only one not hooked up to remote power reset services.. and only 2 hours left in the service window.. sinking feeling


For tmux users: put something like this in your .tmux.conf on production servers:

    set -g window-style 'fg=red,bg=black'
It will color the text red, hopefully reminding you to be extra careful. Adjust according to your preferences.

One other "defensive scripting" trick I frequently use is starting any `rm` command with `ls`, double checking its output (or triple checking if it's a recursive one), and then replacing `ls` with `rm`. It barely takes any extra time if you're proficient with emacs-style readline hotkeys:

    C-a M-d rm C-m


In this vein, I set my PS1 to bold red capital letters on bastion hosts and alias sudo="echo 'You're on a jump box moron :p'"

I do the tmux color trick too-- color coded by environment for each bastion.


Something like this to obviously distinguish environments is good practice - at one company we implemented scripts for terminal color-coding like this after some downtime caused by a destructive backup restore accidentally being run in the production environment instead of an acceptance testing system, which was in all aspects identical to the production system.

And I've seen a solution for more secure environments where physical separation was used with the operator having separate monitors/keyboards, and the "important" system having a different color keyboard and monitor frame.


That's nice — I usually start my commands with `# ` but then there's no tab completion. I'll try `ls`


We do this for our database connections in Azure Data Studio / the Windows version of it. Really great idea.


I usually use echo for the same purpose.


Oh, yes, that feeling when you mix up “init 5” and “init 6” on a machine 300 miles away. “Huh, it’s taking longer than usual to reboot...”. 20 years later and I still remember the name of the guy I had to call at two in the morning to go power it back on.


I've done that a couple of times, although in better circumstances; also fun variants like "the new kernel didn't have the right ethernet driver in it".

I think the first one in my career was discovering that "killall" does something very different on Solaris from Linux.


Oh nooooooo


Back in the 1990s, something like that lead to a major local data center having to admit that the “24 hour support” meant 12 hours a day with someone who could answer the phone but didn’t have access to the server room.


Or rebooting a container. But you’re not in the container anymore but on the host...


I once ran a script on production to re-push some old data for a customer based on log entries. This script used the log timestamps to decide which data to re-push. Didn't realize that the timestamps in the log files were UTC, and I just ran it with the default timestamp provided by the library (which is the one the host system uses). Lucky for me, the system's default timezone was also UTC, but nonetheless, the moment I realized it and the 10 minutes it took me to read the documentation and to check the host system's timezone felt like hours.

You live and you learn, I'd say :)


Early versions of the Phoenix frameworks ORM would select every record if you didn't pass it an ID. I didn't know that and wrote a deletion endpoint forgetting to put said ID in. Tests passed (I mean it did delete...) and off to prod it went. Long story short: I deleted data for all our users. Thank God for backups.


I've seen this in 2019! One of our api partners has a deletion endpoint. It's

   .../delete/:id
If you don't pass an id... it deletes all records. Because that is a thing you would want, rather than a bug where you somehow got a null id.


Yikes. So much for failing fast.


I dunno. I bet a lot of things failed very fast.


At one point, we had a "do this thing on every machine" tool that interpreted --regions="" as --regions="*". Guess how we discovered this? Oops.


I managed to send the entire Google datacenter backbone through one 20gb link in Finland. This did not spark joy.

Search SRE had a 5lb bag of shredded money as a "gag gift" that was given to whomever caused the most recent outage that impacted search ads.


I used to work in Market Research R&D, in a non-technical role as a project manager. I was deployed to a project that tried to use affective computing (i.e emotion recognition) to understand consumer responses to advertisements. It was a total disaster.

We'd hook respondents up to a webcam and record their facial expression as they would watch a series of videos. The vendor's emotion recognition machine learning software would then basically assign scores saying that at this second, the viewer expressed xyz emotion.

The project failed for 2 reasons - one was that the theoretical link between what expressions people were presenting, and their actual emotions to a particular piece of media was not fully proven - which meant the model output was not particularly helpful from the beginning.

Secondly, and this is really important - the model was trained on images of western faces (i.e white people) - and because our target audience - southeast asians - emote very differently a substantial chunk of the output data needed to be trashed (it couldn't process darker faces well, it interpreted a grimace as a smile, etc)

So there you have it - this was something I should at least have anticipated - I got in a lot of trouble



"I'm an expert because I've made all the mistakes you can in a narrow field."


Are you Neils Bohr?


I've accidentally taken out a television station before. Was remoted in and clicked on the wrong thing with a new platform I was being walked through and exploring. All of a sudden MeTV in the Dallas/Fort Worth area went down in the middle of the afternoon and a LOT of very angry people began calling in but we had it running again moments later. If you are doing a demo of a live enterprise solution -- probably shouldn't click around to see what happens ;)


The way that I learned to write the WHERE clause of a SQL update statement first was updating an entire column of a very important SQL table in a TV stations automation software database.

I also took CNBC off air briefly, although that was their man's fault as he told me to unplug the wrong video server.


Internet killed the video (star)?


One of my favorite job interview questions for sysadmins is asking about a time that they screwed up and broke production. If they don't have one, then it makes me nervous. Either they are lying, or they don't have enough experience, or they will be too conservative and will block all progress.


I once forgot that a vendor added their own adjustment to a bidding algorithm we had in place. It was significant for certain regions. I created a bidding model without taking the adjustment into account, pushed it to production, and spent ~30kUSD extra in a few hours before anyone noticed the unusually high bids coming from our vendor. We put controls in place to prevent this afterwards ;)


One of my screw up's was when I was SYSAD (head admin) for the Prime 550 at the UK office of a large consulting engineers.

We had our field engineer in doing a PM and he needed a scratch disk and I said oh you can use xxxx and pointed at the sticky label which had all the disk id's on.

Turns out that some one had been using this for a big GIS project in Amman and ended up wiping 6 month's work


Oh no. Did anyone manage to salvage any of that?


Ah well we had some maps printed out so we could redo it from that with out having to fully redo all the work.


> I now put a mollyguard over those things any time there's any chance of them being exposed and having unscheduled activations.

that's what differentiate a good engineer from not-so-good - they learn on own mistakes!


I recall at my first job one of our computer rooms had the emergency stop button on the wall - just at head Hight.

One time I or My Boss (I cant recall who) stepped backwards and hit the off button with his head - we had our electrician fit a molly guard after that.


Several jobs ago, we had a datacenter where the PDUs for the racks were mounted at the very top. Normally, this wasn't an issue as this was well above head height...

One day we hired an engineer who was a Sikh. Turns out the PDUs were almost exactly at turban (dastar) height. Cue the outage alerts (and the installation of mollyguards).


I once remove a single unused DNS record that resulted in a 10gbit/s DDOS consisting of lookups for “a”, “aab”, “aabaaaa”, “aabaaac”, etc. (Hint: Perl.)


My perl is super rusty, but you've got me curious...what happened?


One of my favorite questions to ask technical candidates is:

Tell me about a time you made a mistake that you thought was going to get you fired.

1. Everyone has one. If you don't, you haven't been doing this long enough and I want you to make a couple of those mistakes elsewhere first.

2. If you didn't learn anything from it, you're going to make that and bigger mistakes in your hubris. I'd rather you do that elsewhere.


The younger you are when you commit the mistake, generally the more fearful you are of the result of that mistake. That same mistake might get a "well that sucks" from someone who's been around the block.

Tech for 12 years and I've never made a mistake disastrous enough to be fearful for my job. The worst costing ~$20k in hardware (couple server CPUs). Told my manager right after without hesitation (this was also at a startup).

I would not stress that much anyway now if it were to occur. Having been through mass layoffs from startups twice before, you change and become hardier. I will be careful but will never be fearful of employment. Short of doing a Desk Pop[0], I'm falling asleep every night with both eyes closed. Life is too short as is. Let me go and I will spend my next morning on a nearby beach with a good book.

[0] https://www.urbandictionary.com/define.php?term=Desk%20Pop


Just curious, how did you destroy the CPUs? Was it physical maintenance or did you set the wrong voltage/etc (usually this happens during overclocking in consumer-grade machines which is why I’m curious how this happens on server machines).


I've made mistakes, and even big ones, but I've never worked for an (technical) employee abusive enough that I thought they were going to fire me for a single mistake.

It's not that they don't fire for mistakes, it's that they work with you to correct your behavior first and they give plenty of warning to someone who is in danger of that.

And I've never been given that warning.

edit: Added the word "technical". I've worked for companies that would fire at the drop of a hat, but they were all retail minimum-wage jobs.


Hmm. I'm not sure I've ever had a screw up I thought would get me fired and I've been in tech for 13 years. I've definitely had some pretty big screw ups (turned the lights off for over 200 people, see my reply to parent). I guess I've always had managers who had my back.


I've made mistakes, though thankfully no huge outage-causing ones. The only "I'm going to get fired" mistake was a political one, where I accidentally emailed a report to someone, and was informed after the fact that it was a very bad thing. It's also a fallacy to believe that making mistakes in the past and learning from them precludes making many novel mistakes in the future. Kind of like the disclaimer that's always given about investing. Both the individual and the organization must simply bake in as many failsafes in their processes as possible. And even then, sometimes Murphy's Law manages to slip through all the holes at once.


Every couple years a "failure resume" gets trending on LinkedIn or reddit, and I always love reading the comments.

It's also a refreshing reminder that "just because someone is successful and has a great resume doesn't mean they're flawless".

Resumes and LinkedIn pprofiles are like Instagram posts - enhanced to bring out the best aspects and with enough photoshop/makeup to hide the worst.


A little over a month into my first engineering job, I decided to go for a weekend stroll. I threw my work laptop into my messenger bag on the off chance I'd wind up in a cafe and feel like checking email or poking at some code. Started the day right with a hearty breakfast burrito, popped into the health food store down the street to pick up a couple bottles of local kombucha for later (gotta have those probiotics), and off I went.

After two or three hours of exploring, I noticed something weird: it was a sunny LA afternoon, but I felt something like a drop of liquid hit the back of my leg. I kept walking, but felt another drop, so I stopped and checked. Yep, definitely real and definitely liquid. Also, it smelled like vinegar. Where was it coming from? Who would do such a thing, and how?

Perplexed, I walked on, until my bag started emitting a drawn-out Mac startup tone, and I realized just what I'd done. I opened it up, and sure enough: the seal on one of my kombucha bottles had failed, and its entire contents had emptied into my new work laptop.


I once wrote a new version of a config generator and pusher for a small part of a major service. I knew data pushes were the largest global outage vector at my company, so I wrote carefully conservative validation logic and unit tested it. But I never tested what the caller did when the validation failed, and I had a dumb mistake there. It pushed an empty file, which was worse than pushing the allegedly-invalid config. Oops. That was a ~30 minute outage of the aspect of the service controlled by this config.

Of course an outage is never caused by one mistake. That mistake was mine, so I felt badly about it. There were also mistakes in code reviews, validation in the part receiving the config, and operational procedures. And then the big one: the company as a whole was in this awkward phase where everyone knew quick global pushes were bad but there wasn't good common tooling to support doing staged config files easily. That was the worst mistake behind dozens if not hundreds of major outages.


I was once working on a live MRP server after hours. It was needed to do everything from customer service to shipping to tracking work in progress. So if it goes offline, they would basically have to shut down until it was back.

I needed to reboot at one point and when I did, it started giving me "boot disk not found". I couldn't get it to boot, at all. It seemed the boot disk was corrupted.

I was literally in a cold sweat for 2 hours, late into the night, until I finally noticed that I had left a diskette in the drive which was causing the bios to try to boot from there first.

I have had plenty of other cases where I actually messed something up. But that feeling you get when you think you have irreparably broken something is so terrible.


Some of mine

Was testing code and pushed a file to FTP 2 days early... vendor picked up, processed file.. the people who signed up in the next 2 days were in the file pushed later... but the vendor already processed the earlier file so they didn't get their metro cards that month

Somehow managed to rebalance underlying components for a Trendpilot ETF monthly instead of quarterly... daily audit that compares the values on NYSE vs in our DB caught it.. lucky for me there was no money in it yet

dropped a table once at lunch time right before taking a bite of my sandwich... did restore table within 10 minutes , didn't eat lunch that day ... lost appetite...

In ETL tool hardcoded something to test.... left it there when running for real


Luckily I learned from a young age. When I was 7 or 8 I was using the computer my dad used to run his company. It had dual 3.5" floppy drives, and a new (to me) hard drive. Needing to format a floppy, I opened the format utility and for some reason I thought I should chose "hard disk" (because it wasn't floppy?!? hmm) when prompted.

So I format the "hard disk" and for some reason my 3.5" wasn't formatted. So I tried again and again to no avail and gave up.

The production manager came in to work Monday morning to a fresh hard drive. Some things were backed up and some things had to be recreated.

The outcome of this necessitated learning a new skill: bypassing passwords.


> I was home alone as a kid, watching some movie on TV. I saw some guy grab a beer can and do that thing where you jam a pen in the side to make a hole, and then crack open the top.

Given what I think is her age (judging from using C64s and whatnot) I'm going to go out on a limb and guess this was "The Sure Thing" [0] with John Cusack and Daphne Zuniga. It's a great movie if you haven't seen it.

[0] https://www.imdb.com/title/tt0090103/?ref_=nv_sr_srsg_0


Okay now I have to go check. Thanks for the tip!


One of my favorite recurring team conversations is the one where everyone shares stories of the outages they've caused or the systems they've broken. This conversation has happened eventually on every SRE (sysadmin/PE/devops/whatever) team I've joined, usually when a junior team member causes their first outage and is having an emotional meltdown. I remember my own meltdown of that form, and I remember it helped hearing about the terrible problems my friends and mentors had caused in their turn.

The first outage where I thought I was going to get fired: I was working on a system that had a single-point-of-failure server, and through a mishap with rsync I accidentally destroyed the contents of /etc. That SPOF also had no backups. (I'm not claiming it was well-designed...) Thankfully the job that depended on that server would not kick off until morning, so my team slowly reconstructed its functions on a separate machine and swapped it in behind the scenes. I helped as much as I could while vibrating with anxiety, and my team was incredibly kind throughout. I was not in fact fired. :-)

The most recent outage I caused? Yesterday! I accidentally rebooted most of the machines in a development cluster. It's a dev system, there's no SLA, on the whole I don't feel horrid, but it definitely ruined a few people's work for an hour. This morning I spent a few minutes putting in a guard rail to prevent that particular mistake again...

If you're in this job long enough, everyone breaks things -- it just happens.


Adam Savage and Matt Parker recently had a conversation that spent a lot of time covering the topic of "screwing up" and how we should respond when we do (Matt's new book is about math screw ups that have had real world consequences). It's a great interview in general, in my opinion.

https://youtu.be/ig-2xlXfex4


In a major production launch, we moved traffic between two versions of a backend with a blue/green deploy. The new version was hosted on Kubernetes, and I was pretty new to using it in production. The changeover went well, pretty great, actually. The problem came up the first time we deployed to the new infrastructure - We saw a huge spike of connection disconnects. We did not get a good answer why at the time, except the vague sense that the deployment had gone a lot faster than we intended.

The second time we deployed, I happened to glance at the deployment size immediately after deploying. For about five seconds, our deployment size went from 100 down to 2. The reason for this was simple: The "Replicas" count was specified in the deployment spec, and it was set to the size we used in our staging infra. That had been fine in prod, and was quickly overridden by our autoscaling configuration, but it did cause the Kubernetes infrastructure to take down every existing pod (minus two), then bring up a bunch of new pods very quickly.


The true measure of experience is the depth and variety of our screwups, and the quality of ones character illustrated by what we take away.


One thing I really pride myself on is that because I screw up so often I have a really good intuition for how things get screwed up.


So then there was the one time I was engaged in sysadminery and had my cow-orker sitting next to me while we were trying to debug some issue. She says, "Hey, is there anything useful in the README file?"

I immediately typed "rm README" and hit enter.

Then I crawled under my desk and wouldn't come out until we'd gotten the file restored from backups. Naturally, it had no useful information in it.

Then there was the time, for no readily apparent reason, where I typed "DELETE * FROM table" (in the dev database). Fine, I thought, it's time to go home, and submit a request to get the DB restored.

It turns out that they kept one (1) day's worth of backups, which they took at 6:30pm or so. I submitted the request at about 6:00pm and the DB guy had already gone home; he did the restore about 7:00am the next morning. Yes, he restored an empty table.


In another comment, I pointed out a mistake of mine that was a major factor in an outage.

I also screw up all the time in ways that would cause outages, except we have automated tests, tsan/asan, code reviews, a staging environment, various safety checks, experiment gates, pre-mortems, slow rollout procedures, an alert on-duty SWE and on-call SRE, etc.

Today one of my mistakes was caught early in the prod phase of our push. That's much later than I would like but still before it did any real damage. I submitted the bad code last Wednesday and have been out sick with the flu (and caring for my preschool-aged kids) since then, so my awesome team handled my problem for me.


Well, there was this one: https://www.theregister.co.uk/2018/04/16/who_me/

Then there was the time I broke e-mail for Global Network Navigator, which was a partnership between O'Reilly and AOL. Lost all e-mail for over a million users on what was then the first nationwide ISP. I also submitted that one to The Register as well, but they haven't published it, at least not yet.


Another small screw-up anecdote: I once tried symlinking a file into my home directory, only to realize I had actually symlinked it into my current directory into a file called '~'. I did the only sensible thing I could think of, which was to run `rm -rf ~` to get rid of it... After about half a second I realized what I had done, but by then enough of my home directory had been wiped clean that I needed to restore from backup.

Always a fun one to share. :)


My password update script for one site used this SQL:

UPDATE Users SET Password=?

We had backups. Selective restore. 7 accounts that were new, not in the backup, got a special flag that required a reset.


Very similar situation here although the query had a syntax error which meant it didn’t go through, but initially I didn’t realise that and had to post the dreadful question on Slack “do we have backups of the production DB?”.

Years later I still consider it my biggest screw up. Everything else can be explained by bad processes, documentation, etc but this one is just me being stupid.


I love the depiction of 'The One'. Thank you. Seems like the person you're often asked to try and prove you are in interviews if you want to be employed. Your publicly available code depended upon by serious people across the world is overridden by your performance in this short high stakes moment we've ginned up. I've tried to do better. It's hard and I've made some bad hiring choices.


Non tech, but leadership related screw up for me. I started a surprisingly popular motorcycle group ride. Around 20 people show up to the first ride and I'm a bit nervous. I forget the route pretty quickly and we all get lost and separated. One person crashed. One person got a big ticket for speeding/improper parts. The ride back was nice though.

Overall one of the craziest days on my life.


Remember when Gitlab had their famous DB incident? From that we had some sort of an inside joke in my then-workplace. If you're gonna do something big and potentially prod breaking just "don't be _that_ guy" (said in the same spirit as "break a leg").

I became _that_ guy.

My then-workplace didn't always have enough funds, though as an employer they were generally generous especially considering their actual finances. This is relevant to the story because this employer:

1. was very lenient when it came to office attendance. So we frequently worked remotely in odd hours; that was normal. But as a matter of professionalism, I always tried to be conscientous when it came to the hours I put in. Most weeks I probably did more than usual, merits of which is another discussion entirely.

2. periodically organized events to promote the business. But being short on funds, they didn't have money to hire an actual photographer. So they'd ask me to shoot because I was interested enough in photography to, at the very least, have the gear for it.

The day I became _that_ guy they had this event I'm supposed to shoot but they really communicated the time badly to me. I expected to be able to do at least three, maybe four, hours of work before I'm needed with my camera. This is what I communicated to my TL.

Turns out they needed me _earlier_, such that I only had an hour of work done so far. Again, office culture was lenient about such things so my TL didn't really mind if I left then. The event was some kind of a big deal besides.

I'd generally start my "hours" in the afternoon, way after lunch. So by the time this event was done, it was already pretty late in the evening. I had my dinner and received a message from my TL. Nonverbatim:

"Hey can you update PostgreSQL (9->10) tonight? It shouldn't take too long and here's the steps..."

It was still in to my "usual" working hours but a couple of things that night made this request result to disaster:

1. I was tired from the event. Honest to goodness tired. I should've called it off when I couldn't even entertain myself enough to stay awake waiting for one of the given steps to finish. But I didn't because...

2. I didn't have the heart to beg off on this task when I've only done one hour of technical/engineering work for the day. To be fair, my TL always abided by the rule "Don't touch prod when tired; you will make things worse". Pretty sure he would've understood if I explained the state I was in. We could've done it the next night. But when you're tired and embarassed at having only done one hour of work for the day so far your decision making is exceptionally unsound, for lack of a stronger adjective.

Unfortunately the technical bits of this story gets fuzzy; it's been two years ago. But two years ago we have just migrated to Kubernetes and a couple of months in the team was still adjusting their mental models from servers to containers/deployments/statefulsets/pods. From just thinking between HDD vs SSD tradeoffs to Persistent Volume architecture issues. This is also why upgrading Postgres was such an ad hoc process for us then. We simply didn't know better (if something not "ad hoc" even exists).

Part of the instructions was to "delete the old data directory of Postgres" (cue: I have read this in a postmortem before...). Because I was tired and lazy I wrote a script so the update could go without my (much needed!) supervision. The instructions were sound and deletion would've been safe--assuming all the steps prior to the deletion finished successfully. It did not and I did not use `set -e`. Which meant I just deleted all the prod data in master. I was efficient. The realization woke me up harder than sugar ever did.

To cut this already long story short, I at least had the sense to concede at that point and wake up my TL with the bad news. Much like the rest of this story, what saved me that night came in twos:

1. I at least had the sense to put the site into maintenance mode.

2. I used `rm -rf`, as opposed to issuing DROP statements to psql. Which meant that my fuck-up did not replicate. So we just promoted the replica to master and downgraded the master to replica and monitored replication.

These two together ensured no data loss. Apocalypse canceled. Everyone in the company went to work in the morning none the wiser.

This story actually had a less fortunate sequel but that story is not for me to tell. And besides, I've written long enough.


What does SEV stand for?


Depending on who you ask, severity, site event, serious event, or something else. Rachel probably picked up the nomenclature at FB, and even there the origin of the term is kind of lost in the mists of time.


SEVs are severe on-call issues at FB. They look like SEV3, SEV2, SEV1, etc.

Other places may use similar terminology but OP is at FB.


> Other places may use similar terminology but OP is at FB.

She's not worked there for nearly 2 years now: https://rachelbythebay.com/w/2018/03/10/free/


yeah, she's at Lyft, I believe


Some places use eg P1 P2 P3 instead of SEV1 SEV2 SEV3

the classification can be in terms of impact to the business. E.g. "P1" could be reserved for issues severe enough to prevent the business from functioning as a business. (E.g. the bank that cannot process customer transactions, the cdn that cannot distribute content)

P3 might mean some features of a service are broken and 1% of your customers are pissed but there is a workaround available or the features aren't really critical


'Severity'. Normally stands for a severe issue on going. They even have levels. SEV3 -> SEV2 -> SEV1 goes from low to highest form of severity.


If memory serves, it's Facebook's incident response process. I don't know that it stands for anythign specific.


"SEVerity" level I think...


I've only made one technical screwup in my career, and a minor and fixable one at that, but it left a deep impression.

These days, when running as root, I concentrate hard on every command, asking myself whether this is really exactly the right thing in every respect.

When my hands start shaking, I know my mind is in the right place.


Either my memory is much worse, or my level of screwing up must be much more impressive. I feel like many of those wouldn't be notable enough to stick in my memory years later. There's no way I'd remember some random electric shock from decades ago...


On blowing the C64 fuse, there was a commonly known soft reset technique that involved crossing two connectors on the game port. Cross the wrong ones and you would blow the fuse.


i love this blog


I used to do plant floor support in an automotive assembly plant. Think desktop support, but super extra. Think multiple serial single points of failure, on a one minute metronome. Here are some things I've screwed up.

---

A simple, early one: We used VNC for remote desktop support of line-side production computers. One of my team leads was walking me through what was on the screen and what was going on. I was used to right clicking on these screens to see more of what was going on, but this one happened to be running a script that was interrupted by this. When I right clicked, my team lead freaked out, and the operator on the floor freaked out, and started moving the mouse themselves and clicking everywhere. After a while they started just doing their job again, but shortly after we got a call from a supervisor.

---

I got a call saying tracking was off on a production conveyor. This means that operators were getting incorrect instructions and work was being recorded on incorrect units. I adjusted tracking to match what I was being told. All good.

I shortly got a call from the same conveyor saying tracking was off. I told them "Yes, I just fixed it". "Well it is wrong now, it was fine a minute ago". So I adjusted it to match what I was being told now. Who knows what that other guy was smoking.

Right as I finished re-adjusting tracking I got another frantic, high energy, expletive filled call saying the tracking was off.

Dear reader you may have guessed what was wrong.

Since I got multiple sets of contradictory information, I decide to go out to the floor. This is what I see (simplified):

Footprint 1,2,3,4, etc

FP1 FP2 FP3 FP4 FP5 FP6 FP7 FP8

008 007 006 ___ 005 004 003 002

You see, the first person was on the first half of the conveyor, and the second person was on the other half. They were both correct, but neither had the full story. There was an empty carrier in the middle of the conveyor.

---

Last one. One of our weekly ops tasks was to verify that the 3 (three!!) scheduling services agreed with one another and also the production schedule. Unfortunately sometimes we got late breaking schedule changes, like running extra time or extending or moving lunch. On a Friday night/Saturday Morning, I got one such change. We were going to run an hour extra to make up for earlier lost units.

I made the requested change and went back to "compiling code" etc. (Perks of night shift)

Some time later... I get a call on one of my radios (Nextel) saying the lights were off in the back of the shop. I say, "hmm, that's odd" and go to the screen to turn the lights on in that area. I get a call on my other radio, saying the lights were off in the middle of the shop. Oh sh*t. For context, the lights were now off for over 200 pissed off people who just wanted to finish their overnight shift and go home. I continue to press buttons to turn lights on, hindered by the fact that the lighting controllers were on a very very slow daisy chained serial bus. My radios continue to go off with people urgently and excitedly informing me that the [expletive deleted] lights were off. I also got a visit from the Plant Shift Lead (2 or 3 steps down from the plant manager). I was pretty surprised to see her, as I was kind of wedged in a corner with a bookcase blocking half of the entryway to my cubicle.

Anyway, I eventually got the lights turned back on. Looking at the schedule changelog, I had successfully extended the shift, but for the wrong day. I had done it for Saturday, as the clock was past midnight when I edited the schedule. Oops.

---

These were all relatively early in my career, but I think they're pretty colorful.


> (literal and electrical) ground

:)


Maaan her posts always fly on HN


Real life stories tend to be more interesting than startup navel-gazing :)


True, although startup navel gazing is not as common as one would expect. Current top 10:

  Swift Playgrounds for macOS (apps.apple.com)
  Judge Orders Navy to Release USS Thresher Disaster Documents (usni.org)
  Where are all the animated SVGs? (getmotion.io)
  Stage is a minimalistic 2D, cross-platform HTML5 game engine (piqnt.com)
  How the CIA used Crypto AG encryption devices to spy on countries for decades (washingtonpost.com)
  N26 will be leaving the UK (n26.com)
  The coming IP war over facts derived from books (abe-winter.github.io)
  Growing Neural Cellular Automata: A Differentiable Model of Morphogenesis (distill.pub)
  A popular self-driving car dataset is missing labels for hundreds of pedestrians (roboflow.ai)
  Investigating the Performance Overhead of C++ Exceptions (pspdfkit.com)


She's smart, a great writer with lots of experience to draw from, and a nice person to boot. They should!


From 6 to -2 karmic points! Controversial?!


There must be a lot of SysAdmins here.


We might not be making the news with flashy new tech, but we're here and we're many.. Quietly making sure systems work. :)


You need to call yourself SREs and start getting paid properly :-)


Tangent: I hate this trend of following titles like that.

To my mind SRE != Sysadmin; SRE is a principle of tackling "Sysadmin" as if there were no sysadmins- engaging software solutions and engineering to track recurrent problems with a top-down approach, often with little understanding of high-availability in hardware or OS design.

Sysadmin is historically a role of automation and reliability, but working from the bottom up. I (and others) make sure operating systems are not exhausted and that the hardware can support various reliability metrics.

Personally, I think these roles are complementary because an auto-healing system that has a stable platform is going to be more reliable than something that is very over-engineered to deal with hardware faults as a common occurrence.

I don't think title inflation is necessary.

Don't get me started on "DevOps" engineers. It's either rebranded sysadmins doing the same thing but maybe with some CI/CD. Or Developers who have been thrown to the wolves. Hardly anyone is actually using the "there are no fullstack people, only full-stack teams" mantra.


I don't disagree with your analysis, but my (semi-serious) point was rather that most good SAs could do both, and that there might be a pay differential between SA and SRE as SRE is more popular currently.


Aha, fair enough I didn't mean to seem overly critical. Even though it's tongue in cheek, I think you're right.

I just lament the truth of your statement. :(


I think good SAs more than anything else suffer from the "Our systems never go wrong! What are we paying those people for?"


> Don't get me started on "DevOps" engineers. It's either rebranded sysadmins doing the same thing but maybe with some CI/CD. Or Developers who have been thrown to the wolves.

This made me laugh! This is so true.


Something about her writing style to me is offputting. I prefer more formal blogs in general. There is a sense of adventure though which obviously people find appealing.


Name a better blog.


> Prepare for maximum navel-gazing!

There is some truth in advertising!

I feel like so many of the posts I read and enjoy could lead with that statement.


"Everybody is a genius. But if you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid" --Einstein



I admire that the author actually responded to some of criticism, accepted it and took it in stride. It's something I find myself having difficulties with more often than I'd like.

However, "even in turds you can sometimes find a peanut". I mean, come on...


> However, "even in turds you can sometimes find a peanut".

Hardly an unreasonable description of hackernews comments.


I thought that line was amusing. I wouldn’t read it too literally.


It reminded me of Dennis Ritchie’s “Anti-Foreword” to The UNIX-HATERS Handbook :-)


It's definitely off-putting. I'm not sure what your mean by not reading it "too literally". Obviously no one thinks the author is speaking literally... It's still reasonable to think the phrasing is gross.


Sometimes the comments are gross.


Sure? I'm not sure how that's relevant...

I'm confused about why folks seem so upset by people expressing this opinion .


Personally, I'm confused about why people insist on dissecting every single sentence in this article and some others like it.


Ok... Do you have any evidence of me (or the OP commenter) doing that? Rachel is one of my favorite tech bloggers, and has been in my RSS feed for years.

At least it's clear that this is just you White-Knighting....


Is it bad form to cricise an article?


What is your criticism exactly? You quoted the article and said simply, "come on".

IMO this is only validating the criticism the article levels at comment sections like HN's. You have picked out some random sentence and expressed no more than idle disagreement. Maybe I personally wouldn't compare your comment to a turd, but there's not a whole lot of nutritional value in it either.

Perhaps the reason the article expresses this concern in this specific way is because it is warranted. Because people insist on disassembling articles coming from this domain sentence by sentence and posting comments that really don't say anything helpful or sometimes anything at all.


Do I really need to spell out why that phrasing is

1. unpallatable 2. indiscriminantly rude towards an entire community?

> Because people insist on disassembling articles coming from this domain sentence by sentence

I feel like I may have walked into something where I don't have much context. I'm not sure what you mean by that.

Also, I find it strange people are so fixated on my criticism, and nobody has commented anything about the praise I made in the very same post.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: