Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Best “I brought down production” story?
273 points by Ozzie_osman on June 26, 2021 | hide | past | favorite | 293 comments
What is your best "and then I brought down production" story?



Not me, but a colleague - he wanted to look around the system as the `uwsgi` user, so he ran `sudo -u wsgi -s /bin/bash`.

Except that he typoed, and instead ran `sudo -c wsgi -s /bin/bash`. What that does is instead of launching the (-s)hell as the uwsgi (-u)ser, it interprets the rest as a (-c)ommand. Now, `wsgi` is also a binary, and unfortunately, it does support a `-s` switch. It tries to open a socket at that address - or a filesystem path, as the case may be. Meaning that the command (under root) overwrote /bin/bash with 0 bytes.

Within minutes, jobs started failing, the machine couldn't be SSH'd into, but funnily enough, as /bin/bash was the login shell for all users, not even logging in via a tty through KVM worked.

Perhaps not the best story, but certainly a fun way to blow your foot off on a Monday morning :)


That's beautiful. I'm not sure I'd have had a clue what just happened even if it was me making the typo.


`ssh $host /bin/sh` (or another shell) should work?


That won't work because sshd runs the command using the user's login shell. From https://man.openbsd.org/sshd#LOGIN_PROCESS:

> When a user successfully logs in, sshd does the following:

> ...

> 9. Runs user's shell or command. All commands are run under the user's login shell as specified in the system password database.


Logging into another shell would be attempted only if someone knew at the time why the logins were failing.

But thanks, I've just added another technique to my toolbox.


Depending on the distribution, /bin/sh might be a symlink to bash


On a Linux box isn’t that just a link to bash?


It depends on the distro; on RHEL yes it is BASH, but Debian uses dash, and Alpine uses busybox sh.


Now I'm curious how you managed to recover. I only know enough of my way around a shell to be dangerous and I'd be SoL if I ended up in this situation.


Recovery disk, then either copy the disk's copy of bash (if it doesn't depend on a later glibc version), copy another shell to /bin/bash (as the system probably doesn't depend on bash-specific commands to boot), chroot and use the package manager, or use the package manager with an explicit sysroot (e.g. pacman --sysroot). The first two steps are very easy compared to the latter two, but should be followed by a reinstallation of the package that provides bash.


You should also usually be able to just reboot and as long as another shell is installed, choose that shell as your startup command in the bootloader.


i wonder, wouldn't copy over sftp work? i don't think it depends on shell


Does the proc entry for a running process still link to the now-deleted file in that situation? If so, you might be able to save yourself from a running bash shell by doing a “cat /proc/$$/exe > /bin/bash”


Probably not if it was overwritten (": >/bin/bash") rather than removed and recreated ("rm -f /bin/bash; : >/bin/bash"). The former will cause all processes to see the empty file, the latter would leave processes with access to the old contents.

In this case if you noticed and still had a shell, you could just copy another shell over ("cp /bin/sh /bin/bash"), to at least get back to probably able to login, until you could pull a copy from another machine or backups.


On Linux you can't overwrite a binary that is currently being executed (you'll -ETXTBSY when you try to open it for writing, and truncate will also fail).


In the wsgi case of replacing a regular file with an AF_UNIX socket, the destination would have needed to to be unlinked first - the bind() call will otherwise fail.


I was unaware of the first part of this, and decided to find a reference.

This stack overflow answer[0] provides some detail on ETXTBUSY and the fact that writing directly writing to a running executable file will fail. (I had thought it would unlink the file like in the second case, simply because that's what I always see when recompiling something that's already running. This obviously uses the second method though).

[0] https://unix.stackexchange.com/questions/187931/modifying-bi...


Back in the days of MyISAM and before Google had their own ad network I worked for the world's largest advertising network. It had a global reach of 75%, meaning 3 / 4s of people saw at least one of our ads daily.

I was trying to learn MySQL and the CTO made the mistake of giving me access to the prod database. This huge network that served most of the ads in the world ran off of only two huge servers running in an office outside Los Angeles.

MyISAM uses a read lock on every SELECT query. I did not know this at the time. I was running a number of queries that were trying to pull historical performance data for all our ads across all time. They were taking a long time so I let them run in the background while working on a spreadsheet somewhere else.

A little while later I hear some murmuring. Apparently the whole network was down. The engineering team was frantically trying to find the cause of the problem. Eventually, the CTO approaches my desk. "Were you running some queries on the database?" "Yes." "The query you ran was trying to generate billions of rows of results and locked up the entire database. Roughly three quarters of the ads in the world have been gone for almost two hours."

After the second time I did this, he showed me the MySQL EXPLAIN command and I finally twigged that some kinds of JOINs can go exponential.

Kudos to him for never revoking my access and letting me learn things the hard way. Also, if he worked for me I would have fired him.


I’m confused by the last part of your post.

Sounds like you appreciated that your boss gave you space to learn, and understood that you made an honest mistake, but you’d fire someone who made this mistake if they were working for you?

How do you square those two things internally?


It's not good to punish people for making mistakes in the course of their work (especially if that work is meant to be educational)

It is good to punish people who give access to production databases to people who shouldn't have it. And the guy learning MySQL should not be given that access.

Taking down prod is always a symptom of a systemic failure. The person responsible for the systemic failure should see the consequences, not the person responsible for the symptom.


> The person responsible for the systemic failure should see the consequences

You don't see the contradiction of terms there? A systemic failure is by definition not the responsibility of one person. You're saying people should be able to make mistakes. But not those people.


Having worked in a very large, bureaucratic company I can say that I strongly suspect that just ignoring systemic failures as learning opportunities is also not a sustainable. Too many times I’ve had to yield on something and say “I guess they’ll learn when this fails” only to see them easily move on or get promoted before the failure occurs. They don’t learn their lessons.

I suspect the solution is to find a way to make sure the consequences of the decision are fed back properly into the system directed at the right person. How to do that, I have no idea.


It's not that you ignore them. But your first step if someone makes a mistake should not be to fire them. Maybe they need some coaching, maybe they have too much access or authority, but everyone makes mistakes. The key is whether or not they learn from them, or keep making them.


A piece of the system (a junior developer) is allowed to make mistakes. The person responsible for architecting and protecting the system (the CTO)... less so.


That depends, this might have been this CTO’s first time as CTO. Without knowing the story, they could so have been pretty far out of their element and just lucky to be a founder or something.

Even C-level people always have to have their first day as C-level and of course they will make mistakes.

The important thing is learning from them of course.


> The important thing is learning from them of course.

Which is why I find it unacceptable that he repeated the mistake.


Well, one could argue that "giving access to production databases to people who shouldn't have it" was just "making mistakes in the course of their work" for this dude.


Wait, but if you choose the guy giving access as the mistake made in the course of his work, then what's the new plan?


I’ve never understood the logic of firing someone over a mistake like that. They’re now the person least likely to make a similar mistake and they will maintain the institutional knowledge to help ensure it doesn’t happen again.


It's a poetic way of saying the boss was/is a better person than the OP.


I think the hint is in "after the second time I did this". I would also wonder if to keep them on at that point.


> How do you square those two things internally?

Easy: (1) He wasn't my boss. (2) He allowed a person not associated his team or even the tech department to conduct potentially harmful operations on the production database without supervision. (3) Those actions resulted in millions of dollars of lost revenue and make-goods. (4) He did not coach the person who brought the database down. (5) He repeated the mistake.


Wait, this happened twice? Weren't you at great pains to avoid it reoccurring after the first time?


Yeah, about firing him for making the same mistake twice that took down production both times.

Sounds like the boss was cool.


> some kinds of JOINs can go exponential.

How? AFAIK a single join can be at most quadratic, and multiple joins should at most be polynomial, where the exponent is the number of joins. To go exponential, you'd need some kind of recursion or self reference, but I know no way to express such a thing as an ordinary join statement.

(of course quadratic performance is already prohibitively slow on large tables, so there is no need to go exponential in order to take "forever")


Unfortunately exponential sometimes means “grows rapidly” and not exponential in a mathematical sense.


In our software, a minority codepath sometimes reported database deadlocks. Nothing critical but it littered the ops error logs and probably displayed error messages to a few customers. So I added a pessimistic exclusive lock to a query which basically solved the deadlock problem (not a great solution but it worked). However, what I missed was the query, even though in a minority code path, was touching another table used basically in all hot-path queries. So basically the code seemed to work fine until deployed to all servers when all operations of the whole cluster got basically serialized through this single lock. So, yeah, database locks can bite you hard!


Not as bad as yours, but MySQL and also blocking prod table: on my first job after graduate, I once run a delete commands on about 20 rows on a quite large table (maybe 500M+ rows), but was causing deadlock because of gap lock, it has been 6 years so I don't really remember the details.

I was not expert but knew about MySQL optimization at that time, but it looks like sometime you just do things and not think through.

15 minutes later, sysad team PM me and ask WTF am I doing, and I realized what happen.


Someone hogging the database with an analytics query is a honest error because of an insidious footgun inherent in the technology stack. On the other hand, the CTO permitted access to the production database ... why? To learn MySQL, it would have been sufficient to set up a local instance. Or connect to testing/staging environments to get at some data.


Can It be a story I was involved in but I didn't do it?

I used to work for a major university as a student systems admin. The only thing that was "student" about it was the pay-- I had a whole lab of Sun and SGI servers/desktops, including an INCREIDBLE 1TB of storage-- we had 7xSun A1000's (an array of arrays) if memory serves.

Our user directories were about 100GB at the time. I had sourced this special tape drive that could do that, but it was fidgety (which is not something you want in a backup drive admittedly). The backups worked, I'd say, 3/4ths of the time. I think the hardware was buggy, but the vendor could never figure it out. Also, before you lecture me, we were very constrained with finances, I couldn't just order something else.

So I graduated, and as such had to find a new admin. We interviewed two people, one was very sharp and wore black jeans and a black shirt-- it was obvious he couldn't afford a suit which would have been the correct thing to wear. The other candidate had suit, and he was punching below his weight. Over my objections, suit guy gets hired.

Friday night, my last day of employment I throw tapes into the machine and start a full L0 backup which would take all weekend to complete.

Monday morning I get a panicked phone calls from my former colleagues. "The new guy deleted the home directories!"

The suit guy literally, had in his first few hours destroyed the entire labs research. All of it. Anyways, I said something to the effect of, "Is the light on the AIT array green or amber?"

"Green."

"You're some lucky sons of bitches. I'll be down in an hour and we'll straighten it out."


Hilarious story, thanks for sharing.

> "Is the light on the AIT array green or amber?"

Can you explain this? What is an AIT array?


As the others have correctly intuited-- its a tape backup system. We couldn't afford a proper tape library with a robot. I can't remember why, but there was a limitation of the software or SunOS that we needed to be able to get an L0 onto one tape. This thing was two Sony AIT tape machines that had a special SCSI board that made them look like one single drive to the host thus doubling the capacity. It was just enough to skate by. I didn't have much faith in differentials and didn't like to let them run more then a few days as well.

I always assumed the fault was in that SCSI board. The hand-off between tape-1 and tape-2 was what usually failed. The problem might occur 24 hours into a backup so it was difficult to get good backups. Also, I was not a full time employee (being a student), so I couldn't babysit this thing 5 days a week like a full time employee. Also, it kind of killed the performance of the system, so I had to do them at odd hours.

I am proud to say I never lost a single bit at that job.

In essence if this backup had failed months of research would have been lost (maybe two to 4 week old backup x 12 researchers).

Anyways, thank you for reading my silly story!



I assume the 3/4-reliable tape drive?


I take it deleting everything was an accident, right? Not that the guy applied for a job only to destroy that information... lol


I do not actually know what happened. But something basically resulted in an rm -rf /home and was accidental.

By all reports, he eventually became a well liked and good admin. I just don't think he knew that much Unix when he started.


I was a system engineer at Amazon from 2001-2006. Sometime around 2004/2005 or so there was a development team working on the "a9 search engine" (meant to complete with google) down in SF. They were sort of an official "shadow IT" offshoot and asked for special treatment and they got me assigned specifically to them to build out the first of their two webservers.

They did the usual mistake of wanting to jettison all the developer tooling and start from scratch. So there was a special request to just install a base O/S, put accounts on the box, and setup a plain old apache webserver with a simple /var/www/index.html (this was well outside of how Amazon normally deployed webservers which was all customized apache builds and software deployment pipelines and had a completely different file system layout).

They didn't specify what was to go into the index.html files on the servers.

So I just put "FOO" in the index.html and validated that hitting them on port 80 produced "FOO".

Then I handed off the allocated IPs to the networking team to setup a single VIP in a loadbalancer that had these two behind it.

The network engineer brought up the VIP on a free public IP address, as asked.

What nobody know was that the IP had been a decomissioned IP for www.amazon.com from a year or two earlier when there was some great network renumbering project and it had pointing at a cluster of webservers on the old internal fabric.

The DNS loadbalancers were still configured for that IP address and they were still in the rotation for www.amazon.com. And all they did as a health check was pull GET / and look for 200s and then based on the speed of the returns they'd adjust their weighting.

They found that this VIP was incredibly well optimized for traffic and threw most all the new incoming requests over to these two webservers.

I learned of this when my officemate said "what is this sev1... users reporting 'foo' on the website..."

This is why you always keep it professional, kids...


Awesome.

Sadly, it seems that the Web Archive didn't happen to grab any pages from Amazon during the (presumably-small) window this was live.

Specifically, I ran the CDX query hxxp://web.archive-dot-org/cdx/search/cdx?url=amazon.com&matchType=domain, and then grepped through the 174MB of results (1,274,038 lines) for response lengths 9999 bytes and less (ie, [0-9]{1,4}), on the assumption this should find every tiny response. The only such responses are 30x redirects and a couple of 503s. :(

(That's a normal URL above - s/xx/tt/ and s/-dot-/./ - but since it spits out 174MB of text I figured I'd save IA the bandwidth from crawlers fetching everything they see on HN etc.)


> This is why you always keep it professional, kids...

In the late 90s someone in a company I worked put a placeholder HTML file on a client's production web server that instead of the usual lorem ipsum stuff had something like "this shit is beneath me", except much worse. The placeholder was never removed, and while it wasn't the index page and if I remember correctly wasn't even directly linked from anywhere, someone still found it. The incident found itself on the national newspapers.


Hahaha


Back when I was working on proof of correctness, when that was a very new thing, I was using the Boyer-Moore theorem prover remotely on a large time-shared mainframe at SRI International. At the time, you needed a mainframe to run LISP. I was working on proofs of basic numeric functions for bounded arithmetic. So I was writing theorems with numbers such as 65536.

This caused the mainframe to run out of memory, page out to disk, and thrash, bringing other users to a crawl. It took a while to figure out why relatively simple theorems were doing this.

Boyer and Moore explained to me that the internal representation of numbers was exactly that of their constructive mathematics theory. 2 was (ADD1 (ADD1 (ZERO))). 65536 was a very long string of CONS cells. I was told that most of their theorems involved numbers like 1.

They went on to improve the number representation in their prover, after which it could prove useful theorems about bounded arithmetic.

(I still keep a copy of their prover around. It's on Github, at [1]. It's several thousand times faster on current hardware than it was on the DECSYSTEM 2060.)

[1] https://github.com/John-Nagle/nqthm


> Boyer and Moore explained to me that the internal representation of numbers was exactly that of their constructive mathematics theory. 2 was (ADD1 (ADD1 (ZERO))). 65536 was a very long string of CONS cells. I was told that most of their theorems involved numbers like 1.

Can we have a moment for the folks who managed to turn numerical purity into integers being O(n)? It's unbearably beautiful... And I do mean unbearably...


> Can we have a moment for the folks who managed to turn numerical purity into integers being O(n)?

It’s not an unreasonable approach given the application was automatic theorem proving: https://en.wikipedia.org/wiki/Peano_axioms

Both Boyer and Moore are quite sharp so I wouldn’t jump to the conclusion that they didn’t know what they were doing.


could have been much worse: https://en.wikipedia.org/wiki/Dedekind_cut


Long ago this one (close to 2000), but we were hosting some of our clients on machines in our office because it was (a lot) cheaper for our startup to do so. We had 3 rather large (for our company) clients running at the time in the server room in the office. The servers where hooked up to 2nd hand APCs so power failures went unnoticed if they were short. One Friday afternoon, we had some drinks (tgif) and a bunch of us (I was the cto....) were fooling around in one of the rooms throwing tennis balls; I threw one straight into the firealarm. This was just a glass with a button behind it: if you pressed the glass it would trigger for the entire (20 story) building. This cut the power and switched on the emergency lights. There was no fire obviously, but, for bureaucracy rules, the firebrigade had to pull up, inspect the building, we had to sign docs etc and then they switched off the alarm and on the power. Too late for our APCs: everything was down and (like said: long time ago for Linux etc) we had to run fsck and basically spent a large portion of the evening getting it all back. We moved to xs4all co location after that incident...


A friend of mine ran a large and relatively popular (as in at least 30 users online at any given time ...) PvP MUD on a server of mine back in the (late?) 90s.

I didn't play muds and my experience was mostly limited to helping him fix C programming bugs from time to time and fielding an occasional irate phone call from users who got my number off the whois data. But because of the programming help I had some kind of god-access to the mud.

One afternoon I had a ladyfriend over that I was presumably trying to impress and she'd asked about the mud. We hopped on and I summoned into existence a harmless raggedy-ann doll. That was kind of boring so I thought it would be fun to attach an NPC script to it, -- I went through the list and saw something called a "Zombie Lord" which sounded promising. I applied it, and suddenly the doll started nattering on about the apocalypse and ran off. Turned out that it killed everyone it encountered, turned them all into zombie lords, creating an exponential wave of destruction that rapidly took over the over the whole game.

I found the mental image of some little doll running around bringing on the apocalypse to be just too funny-- until my phone started ringing. Ultimately the game state had to be reported to a backup from a day or two prior.

[I've posted a couple examples, -- I dunno which one is best, but people can vote. :)]



That was an annoying time in the game. Not horrible, but you pretty much had to avoid town because people were purposefully trying to spread the plague. Certainly memorable though.


Little did we know how realistic that would end up being.


I was there for it, but I literally walked into Ironforge right after the first wave hit, and it was just corpses everywhere. I had no idea what was going on at the time!


Oh man. MUD's in the late 90s. Just randomly reminded me of all the time I spent in Medivia (I'm fairly sure that was the one). Those were good times. Surprised the name even came to me since I haven't thought about those in... 20ish years.


Medivia? Judging from screenshots that looks a lot like Tibia (https://www.tibia.com)


Never heard of Medivia before, but looking at screenshots and their site, that is absolutely some sort of fork/copy of Tibia. Even the world map is basically same shape, with locations shifted slightly and renamed, and the game interface is the same. I think they may even be using the old sprites from before a UI revamp a bit over a decade ago, from when I played Tibia.

Edit: Found this [0] which states Medivia originally started as a private Tibia server that eventually evolved into its own game. People were only just starting to make private Tibia servers when I left (or maybe they were only just starting to become popular, not sure anymore), which explains why I hadn't heard of this.

Tibia was created in 1997, I played roughly 2002 - 2006. Looks like Medivia was first created in 2009 [1], I would guess probably still just a private server at the time.

[0] https://www.reddit.com/r/MMORPG/comments/i480jd/medivia_4_ol...

[1] https://www.reddit.com/r/MediviaMMO/ (right sidebar)


Since they mentioned the late 90s, I think it's more likely they meant Medievia[1], which has been online since 1992.

[1]: http://www.medievia.com/


This was 20 years ago now - it was my first day in a new job working for a startup.

Our startup was based in the garden office of a large house and the production server was situated in a cupboard in the same room.

The day I started was a cold January day and I’d had to cycle through flooded pathways to get to work that morning - so by the time I arrived my feet were soaked.

Once I’d settled down to a desk I asked if I could plug a heater in to dry my shoes. As we were in a garden office every socket was an extension cable so I plugged the heater in to the one under my desk.

A few minutes later I noticed that I couldn’t access the live site I’d been looking through - and others were noticing the same.

It turned out the heater I was using had popped the fuse on the socket. The extension I was using was plugged into the UPS used by the servers. So the battery had warmed my feet for a few minutes before shutting down and taking the servers down too.

And that’s how I brought production down within 3 hours of starting my first job in the web industry…


quite close , the first few mins of this Apple WWDC from a few years ago . https://youtu.be/oaqHdULqet0?t=32


2 days before I got married, I dropped the production database by accident from a GUI tool where “right-clicking” can be destructive if you click to fast. The application scheduled radio and television commercials and within 48 US states for a large international advertising group. The bigger problem was that the DBA had only been doing incremental backups and didn’t have a full back against which to run the incremental backups. He had never created a full backup.

Fortunately, given the nature of media buys in that time, all placements were printed and faxed. My team sent me to my wedding rehearsal dinner and spent the next two days collecting printed orders and re-keying them into the system.

I am forever grateful to that team.


> where “right-clicking” can be destructive if you click to fast

That’s pretty amazing in the most terrible way possible


> a GUI tool where “right-clicking” can be destructive if you click to fast

Fun fact, there’s a current bug in VS Code where right clicking on a file in the sidebar immediately triggers the menu entry that appears under the cursor, which happens to be delete. I’ve had it happen randomly for several months now.


Exactly the kind of thing that I am still afraid of to this day. :)


Backups aren't important, restores are.


First time I got into the computer room back in the days when one unix mini-computer ran hundreds of terminals. I was asked to put a reel tape into the drive to load a tar archive into the Oracle database.

I couldn’t get the tape drive door open so I looked around and saw a key next to the door.

That didn’t open the door either. I was stood there scratching my head when the double doors burst open and half a dozen sysadmins came running in like a SWAT team.

I was a bit surprised until I glanced down and notice all the lights were off.

Yes the key that turned the power off had been left in the machine.


Nothing crazy, but something I always laugh at.

I was so excited to meet a legit/professional dev team the first day of my career.

I was paired with a Sr dev and sat in his cubicle so he could show me some backend update to a prod app with 20K internal users... "normally I'd run this on the dev server, but, its quick & easy so Ill just do it on prod"

...watched him crash the whole thing & struggle the rest of the day to try and bring it up. I just sat there in awe, especially as everyone came running over and the emails poured in, while the Sr Dev casually brushed it all aside. He was more interested in explaining how the mobile game Ingress worked.


I had to reread this to check if you're really not talking about Ingress servers, because Niantic really seems to be doing changes directly in production all the time

also I completely understand how invested one can get in explaining Ingress


The amount of "yep, I know" I have to do any time anything goes down... at least Sales usually only has one person report it to me, other teams 3 different people will tell me at the same time...


You could send them an email the moment you know so that they wouldn’t feel like they need to report that.


They wouldn't read my email until after they reported it... so then I'd get two emails from each person: one saying there's an issue, and another saying "oh sorry just read your email that you already know about it"!


Near miss: my first job I was working on a CRUD app for a huge bank. I was dumb and it was early in the enterprise era of software and I had built my own simple O/R tool based on codegen. Not a terrible tool all things considered and I was pretty pleased with myself.

One night in bed I realized that if someone hit submit on the delete screen without filling in any criteria it would just delete the whole database.

Not a fun drive in.

Yes, we drove in in those days.


And this is why we use things like database users that don't have delete permission, and row-level security so users can delete things that don't belong to them.

I have learned this from a very similar experience.


And why it’s often best to mark a record deleted and then have a reaper remove the records at a later point. But you must make sure all normal queries don’t see deleted items.


That's over engineering. At this point, just rely on PITR. FWIW, postgres does have a "reaper" via vacuum, but not for the purpose of safety. but rather to allow for mvcc.


Hi, if I may, what does PITR stand for?



Thanks!


There are other advantages for soft-deletes, like not having to worry about FKs.


>One night in bed I realized that if someone hit submit on the delete screen without filling in any criteria it would just delete the whole database.

how


Lack of validation?

I.e if no criteria, it could be sending a DELETE message with no where clause in SQL land.


i remember sitting next to someone who screamed after realising he just did a ’DELETE FROM company' without a WHERE clause. Our database was way too big to backup, so we only had production, but luckily I had rolled out database logging in the inyerface that recorded all UPDATE and DELETEs a few weeks before the event.


I had a DBA call me on my way back from lunch absolutely sobbing after fat fingering a semicolon before the where clause while logged in to Prod as root.

we got things restored and back online in a couple of hours. I let her go home afterwards, heh she had suffered enough.


I was taught to also use a transaction and then check how many rows were affected before committing.



I wonder if archive.org was around enough back then to capture any of it.


Many mistakes were made prior in order to get her in the situation to begin with. The semicolon was just the icing on the cake.


When your database is too big to backup that's a lot like a bank being too big to fail. Unwise long term strategy.


Terabytes across 25+ databases not co-located and all being masters replicating to each other, in 2009. Not sure how you would have done it.


I got myself into the habit of opening transactions before doing anything that could alter data by hand. Huge panic-saver.

Even earlier, I got into the habit of typing the where clause before typing the from clause. Not much of a change, but it feels more fail-safe.


in one web shop i worked with, developers who were fanaticslly using constructions "WHERE user LIKE '%$user%' "

we've had some really heated debates, but nobody would listen to me. and then one day DELETE happened where $user was empty. they spent half day restoring the database, but nobody made any conclusions and nothing changed...


Easy: no form input no condition on the delete

aaand it’s gone


How about breaking all e-mail across the entire world back in 1996?

This one got a write up at The Register: https://www.theregister.com/2018/04/16/who_me/

The story as i told it to them is slightly different. They made some changes that I think do make it a better story, but not quite as close to the truth of the situation as I remember it.

I think that’s my biggest mistake so far.


Here's a good reminder of the dangers of scale.

When Need for Speed (2015) came out I was one of the software engineers in the war room, monitoring crashes and usage statistics.

At one point we saw a big drop in active users and it turned out it was because servers kept crashing. That was a big deal since a server crash was usually rare and naturally meant disconnecting all players on it.

After a bit of searching I found the responsible code. It was a client side bug which crashed the server. The line even had a comment mentioning that it wasn't entirely correct, but that the probability of a crash was about one in a million.

Well, we had millions of players.


A client side bug was crashing the server? I always thought the client depended on the server, not vice versa.


It's a slight simplification. Technically a client side bug lead to bad state which was then reported to the server. The server lacked validation in this case and crashed when processing the invalid data.


> The server lacked validation in this case and crashed when processing the invalid data.

As a backend dev, I'd point to this as the root cause, not the client sending invalid data.


I see where you're coming from. I would say it's the cause of the crash but not the root cause of the issue as a whole. Both obviously need to be fixed.

Fixing only the client side issue removes this crash entirely, but leaves one open for similar issues in the future.

Fixing the server side issue fixes the crash and related ones, but doesn't address the root cause of the faulty data being generated by the client.


I have no idea, but I'd guess it was the client generating a random number to use as a unique ID, with insufficient length.

AFAIK, anything UUID-sized or bigger doesn't have this problem on Earth.


This seems like a very specific guess and it's not related to the actual issue. I wonder how you arrived at this idea?


When I read "wasn't entirely correct", and "probability of a crash was about one in a million", I wondered how you'd work that value out.

Predictable statistics screams RNG, to me. Especially if the code author knew it was an imperfect solution, and didn't know of one that wouldn't cause this crash. (I've done stuff like this before.) Also, for the client to crash the server, it'd need to give it some form of bad/confusing data, but for some reason this doesn't happen with fewer users.

I think that was my logic at the time of writing that comment, though I've thought of another few guesses while typing this out/reading your follow-up.


I went to a residential high school in a very rural state. Our internet went through a local college, then the state university.

This is 1998. I was setting up my new main desktop with linux that year. When it got to network services, the installer asked me if I wanted dhcp and dns. "Of course I want dhcp and dns, I don't have a static IP here."

It was asking if I wanted to install DHCP and DNS servers on this machine. I can only guess as to what kind of configuration allowed this to spread as far as it did, but for about 2 days the state university shut down the entire lower college's network because my machine was apparently responsible for DHCP for everyone for just a little while.


And thus DHCP snooping was born.


On Linux killall lets you killall all processes matching a name.

On Solaris killall kills all processes.

To make matters worse, I used the command on a sever with a hung console -- so it didn't apply immediately, but later in the middle of the day the console got unhung and the main database server went down.

Explaining that this was an earnest error and not something malicious to the PHBs was somewhat ... delicate. "So why did you kill all the processes?" "Because I didn't expect it to do that." "But the command name is kill all?" ...


Genuine question: what's the usefulness of Solaris' default behavior? Why kill all processes?


It's used in the Solaris init scripts for shutdown, shortly before the system tries to unmount all mounted disk volumes.

I made the same mistake nullc made once, as I was more accustomed to linux than solaris. That was after hours and the effect was immediate, but it was still a pretty jarring and memorable moment.


I oversaw a Solaris machine for a very short period, and never got any reason to use it. But if you can change the signal, sending SIGHUP into everything looks like reasonable thing to do.

Still, it's not something common enough to deserve it's own program.


Yep, and as weird as killall on solaris is-- the naming of killall on linux is kinda weird. It's "killbyname".


I don’t know? ‘Killall node’ sounds like kill all node processes to me.


HPUX also did the same if I recall. Had to be careful swapping back between it and Linux.


I assume it’s part of the shutdown process. Makes sense in a way: kill -> killall.


Sometimes I use kill -15 -1 on Linux to log me out, but obviously not as root.


PHB?



Pointy Haired Bastards


It was 1985, I was in a VAX computer lab with about 40 other people typing on the VT100 terminals... I ran a program to compute something, and forgot that I had bumped my priority wayyy up... everything in the room stopped, even the line printer, everyone went ohhhhhh.

10 seconds later, my program finished... and everything snapped back to life.

Another time, I walked into a different, bigger lab, with 100 terminals... snooped around the system, saw that the compiler queue had about 40 minutes of entries... bumped the first one up a bit (the queue was set to lower priority than any of the users, which was a mistake)... it finished in 2 seconds, instead of 2 minutes...

15 minutes later, the queue was empty, 30 minutes after that the room was empty, because everyone had gotten their work done.


Wait, what? So the system was purposefully hindering everyone's productivity?

Sounds more like a "I brought up production" story...


My guesstimated translation: some large number of those users were running compile jobs, which were running as background tasks instead of foreground tasks.


Yes, because the system administrator, in his great wisdom made the compiler a batch job, but left the default priority way too low, below all interactive task.

Which meant the users had to wait around, so the labs would always end up full of logged in users, frustrated that the system was so darn slow.


resource starvation with a human element!


No, it was just a bad configuration choice by the system administrator.


Sent in a SQL command deleting multiple million records (intended) wrapped in a single transaction (not intended). The replication queues could not keep up and failed, bringing down most of the replicas. Master server kept trying to recover and maxed out all connections - no DBA could log in to perform manual recovery. We had to hard reboot not knowing what state the system is in and how long it'll take to fully recover. Did I mention that this was a few hours before trading was going to begin?

TBH, my team was very gracious about it and the RCA focused purely on the events that occurred and how to never let if happen it again. No blame game at all.


> TBH, my team was very gracious about it and the RCA focused purely on the events that occurred and how to never let if happen it again. No blame game at all.

Which is how a PIR, PER or PCR should be. If you don't understand why someone makes a mistake, you can't avoid future mistakes.


I understand SQL, DBA and TBH, but what do RCA, PIR, PER, and PCR stand for?


RCA is "Root Cause Analysis" and I assume PIR is "Post Incident Review". I don't know PER or PCR.


"Post Event Review" and "Post Change Review".


Hmmm. I can't help but wonder if maybe the database engine should have caught the queue overflow (even if the event occurred over the network) and failed the transaction.


I worked in online advertising and pushed infinite loops that froze browsers to millions of unsuspecting victims.

On another occasion I had a division operation happen on integers instead of floats, and the code was running on some hardware that steered antennas for radios on airplanes. Much time was spent by pilots flying in circles over LA while I gathered data and found the "oops". It was fixed by adding a period to an int literal.

On another occasion my machine learning demo API failed due to heavy load, but only when India's prime minister was looking at it.


> On another occasion my machine learning demo API failed due to heavy load, but only when India's prime minister was looking at it.

Okay, I'll bite.

You can't just drop a war story headline like that and walk off. Please expand.


We deployed some machine learning demos via API and they got used in some exhibits for some powers-that-be. But we ran it in the cloud on shared infrastructure like fools, and the code was pretty bad, so it failed due to heavy load while Modi was looking at it. There was an outpouring of anger afterward. Here is Modi staring at one loading spinner. https://imgur.com/a/kLklVZV


> On another occasion my machine learning demo API failed due to heavy load, but only when India's prime minister was looking at it.

This really sounds like something that can only happen when unpredictable machine learning models are involved^^


This was long ago and the details are hazy.

Back in 2003 or so, I was in tech support for a company that used desktop computers running java applets to connect to a mainframe via Telnet (IBM Host-on-Demand IIRC). Most of the core business processes were handled by mainframe apps, which the company largely developed. I used to hang out in the data center with the mainframe guys who coded in COBOL all day.

On a Friday afternoon, I was working on testing deployment of an update to the java terminal client applet. Everything seemed to work fine in testing, and it was a minor update, so (idiot me) I went ahead and pushed it to the server.

Shortly after I pushed it out, the mainframe guys' phones started ringing with complaints that the mainframe was down. Then my phone started ringing. Then all of the phones started ringing.

Turns out, something I did in the update (I honestly can't remember the specifics now) reset every local users' mainframe connection information for the applet. Across the whole company. So as soon as they exited the applet, they couldn't get back in.

That was a fun weekend.


Yayyy.

If you remember, how did this end up getting fixed? Did users have to re-input their connection info (meep) or were you able to re-fill everything in through some heuristic?


Many years ago (decades, in fact) as a fresh new excuse for a unix admin, I needed to hide the passwd binary so that users couldn't find it and change their local password on the terminal box (this was early ISP days). SunOS 4.1.4, as I recall.

Anyway, I hid that binary. In /etc, where they'd never think to look.

Gosh we do some dumb things, eh? LOL. That took a while to find a solution for, and no small amount of luck. The owner of the ISP walked back in the office a couple hours later and said "I heard you had some excitement?" I said, "Oh yes, it was pretty ugly for a bit." "Is it fixed now?" "Yup." "Carry on."

For sure thought my ass was fired and I'd only been on the job a month or so.


I don't get it - how would moving that binary break a running system? Is that binary somehow involved in something else beyond password changes?


/etc/passwd contains the user database on most Un*x systems. GP replaced it with the executable file, thus wiping out the system's users. Ouch.


> Ouch.

Ouch, indeed. We ended up getting lucky and found a workstation where someone had left themselves at a root prompt on another machine that had a shared NFS mount. This was before protection from this kind of attack, so we were able to create a setuid root script and run it on the main server to get root access to fix the broken passwd file.

Our next step was going to be rebooting the server. We were pretty sure that faced with a corrupt passwd file, SunOS would drop to single user mode. Never tested that theory. Glad we didn't have to, the server in question was a hack job as it was. Copied over (literally, as files) from a previous server, it wasn't even 100% in agreement with itself on its own hostname, so I always kinda wondered how it would react to any big changes.


My bad. I somehow expected him to put it into /etc and rename it to something else. Indeed if he overwrote /etc/passwd then all hell would break loose.


Why did you write it Un*x? Is there a Unex or Unox?

I've seen it written *nix to grab Linux and Unix.


That has precedent going way back, at least 34 years:

https://unix.stackexchange.com/questions/2342/why-is-there-a...

Doesn't explain why exactly the asterisk was put in that particular position. Maybe someone felt like it was odd to lead the word with an asterisk. :shrug:


The asterisk is there to avoid writing the word out in full, like "G*d" or "f*ck". There was a time when Unix[1][2] was a trademarked name and if you used it you had to attribute it. If you wanted to refer to the general family of Unix/POSIX/SysV/BSD/etc systems, you might be tempted to write "Unix"[1][2], but to avoid the presumably-Sauron[3]-like eye of the trademark holders, you'd bowdlerize it a little.

[1] Unix is a registered trademark of AT&T Bell Labs.

[2] Unix is a registered trademark of The Open Group.

[3] Sauron is a registered trademark of Tolkien Enterprises


Linix?


EVE Online.

I designed and wrote most of the code for the hacking minigame in EVE. Sadly and unknown to us there was a small memory leak that happened once per game. This was basically unnoticeable until it hit production. Our little game was part of a bigger rework of Exploration in EVE and that plus the game being an entertaining way to make money meant there were upwards of 150k game instances being played a day. EVE does away with the Python GC as well so the memory leak caused us to have to restart nodes every three hours IIRC. My tech lead and I had to comb through and find it which he eventually did and sanity was restored very quickly.

I don’t think it even cracks the top ten of production fuck ups at CCP though.


(Not me, but someone I worked with)

My first job out of school, working away from home and learning the ropes of embedded software.

The office was using on promise databases, email servers and the like, as was somewhat common at the time, but nothing much more than a a few robustified PCs and some networking infra. We were having internet problems being too far away from the exchange and so the telephone company was coming into replace the exchange over the weekend so everything was shutdown on the Friday night.

Monday morning comes by and we boot things up again, but no connectivity… Office is dissolving into chaos as phones were also down. British Telecom is demanded to return this very minute and figure it out!

An hour later a very flustered gentleman turns up and begins to debug a few sockets but finds them all dead. 1 minute later he is at the new exchange (that was inside our office), only to emerge from the room after 30 seconds looking extremely confused.

It turns out Dave, an extremely helpful chap who was in charge of some product final assembly had turned up at the office as normal at 7am and thought he would helpfully uninstall the old exchange and throw it in the skip we had rented for just that purpose. A quick wonder around to said skip found the exchange in there with a bunch of wiring - the helpful chap had really gone to town on this. Sadly, I was quick to identify that this was the new exchange, not the old by observing simply how fresh it looked and the BT chap came over to confirm. Because of the damage that had done to the wiring, it was not trivial to simply wire back the old exchange and so that was the end of office operations for a week.

A small company meeting was held where it was announced that “an error of judgement” had occurred and that we were to have some vacation - much of the in office equipment was taken offsite to get temporary connectivity so that sales could continue whilst we vacationed. Internet remained terrible until I left that gig, now blamed on all the wire patches needed to get the office back on line.


I wrote some software to handle charging customer's credit cards. It worked fine in the dev environment so a week later we deployed it since we hadn't been charging customers at all until that point. We would run bills once a day until all the unprocessed billing was caught up.

Well, in dev, the database was refreshed with prod data every night at midnight, so we never saw the bug in my code. I had a sign error in updating the customer's balance so instead of lowering their balance by the payment amount, my code increased their balance. Geometric growth is an amazing thing. A few days later we had calls from angry customers because we had maxed out their credit cards. Miraculously, I was not fired. In retrospect, I think that it might have been because the manager would have then had to explain why he had not made sure there was not adequate testing on something so central to the business.


I'm in the fortunate position of having been able to tell our story in detail on our blog after a major outage involving Cassandra and bootstrap behaviour that we didn't fully understand. This is a story of how I bought down the bank for two hours.

https://monzo.com/blog/2019/09/08/why-monzo-wasnt-working-on...

In summary, we were scaling up our production Cassandra data store and we didn't migrate/backfill the data properly which led to data being 'missing' for an hour.

In a typical Cassandra cluster when scaled up, data moves around the ring a single node at a time. When you want to add multiple nodes, this can be an extremely time and bandwidth consuming process. There's a flag called auto_bootstrap which controls this behaviour. Our understood behaviour was that it will not join the cluster until operators explicitly signal for it to so (and this is a valid scenario because as an operator, you can potentially backfill data from backups for example). Unfortunately it was completely misunderstood when we originally changed the defaults many months prior to the scale up.

Fortunately, we were able to detect data inconsistency within minutes of the original scale up and we were able to fully revert the status of the ring to it's original state within 2 hours (it took that long because we did not want to lose any new writes so we had to carefully remove nodes in the reverse order that they came in and joined the ring).

Through a mammoth effort across the engineering team across two days, we were able to reconcile the vast majority of inconsistent data through the use of audit events.

This was a mega stressful day for everyone involved. On the plus side though, I've had a few emails telling me that the blog post has saved others from making a similar mistake.


I have a Cassandra story as well. At a previous employer our org used a database that was a custom wrapper around Cassandra. This was a fairly large organization and this particular database was the keystone to the vast majority of operations of this particular organization. Well, one day I was giving a demo to some junior devs on how to use the REST API for the database which just so happened to take in raw Solr queries. I always liked to point that out to the newer devs as a way they could do some nice things that were otherwise fairly limited by the REST API.

Well, one of the junior devs just so happened to be playing around with various different Solr queries to see what he could get back and somehow issued a query that caused the entire staging database to fall over. That was a fun phone call to get. It wasn’t the junior dev’s fault, of course, but it really did wonders to expose the fragility of poorly optimized/unindexed queries against the database.

My experience in general with Cassandra is that outside of a few experts working with it, it was pretty poorly understood throughout the org and no one except those select people could really do anything when it all fell over.


After having spent many years working with it and interacting with it deeply, I would strongly recommend folks stay far away from Cassandra if you remotely care about your data. It provides way too many footguns to lose or corrupt or outright ruin your data.

Unless you work at Apple or Netflix or Spotify, finding Cassandra experts is going to be nigh on impossible and the community just isn't there unfortunately.


That’s a great writeup, thanks for all the detail!

I was always worried about something like this happening so only ever provisioned (via ansible) one server at a time. When the logs showed it was fully synced, we provisioned the next node. It could take two days to add 10 nodes but I always felt much safer


On the cloud, it is likely simpler and faster to just spin up a new cassandra datacenter, and then do a rebuild from the old datacenter to the new datacenter, either all nodes at once in parallel or in smaller batches. This procedure works fine regardless of using static tokens allocation or vnodes, and adds very little load to the old datacenter which is still serving traffic.


This is the standard approach and the one we have detailed runbooks for. We've scaled the cluster fine one at a time after this experience. It also prompted us to get a much better understanding of all the other flags that have been changed beyond the defaults.


Around 1999/2000 I worked at Universal Studios in Burbank. Shortly after I started working there in the networking group, a fellow employee asked me to reset a port on the core token ring switch. Being a helpful type, I clicked "disable" on the network management tool for the port and promptly everyone lost connectivity. To everything. A quick walk over to the NOC confirmed that yes, the entire campus was down, as well as a few satellite campuses.

I explained what I had done (which was met with incredulity, I mean, surely I had fucked something up and didn't just disable the described port)..

After the longest five minutes of my life, the network came back alive. And sure enough, only the port that I had disabled was disabled. After some troubleshooting with the vendor, it was determined that there was a software bug in the switch firmware. If the device was under heavy load, disabling a port caused the switch to reboot.

So that's how I took down the entire Universal Studios corporate network.


Nice. Reminds me of the night where I switched off 30% networking of an entire country. MySQL was involved. It didn't make the front page of the news though, which was our cutoff


I was sitting in an escalation-call, called by one of our most important customer and India’s largest ISP. There were 13 of them, 3 of us, and only 1 technical (that’s me). Things were super intense, extremely heated debates were going on, I was fixing a few issues, while answering the questions coming my way, in parallel I was also chatting with my teammates, who were back in the office and were on support calls with the same client’s ops team. And, then it happened - rm -rf /var/lib/mysql (i was already sudo su)

Escalation got further escalated - “Dashboard is not opening”

For a moment, I was shell-shocked, couldn’t hear a thing, just sitting there frozen.

Then i remembered the backup, i, against my usual style, took and the mysql replica i hosted, just as an insincere effort to calm the people in the room, i got to work. Restored the database, re-ram a few etl jobs, and we were back up online.

I was relieved, and actually quite happy, started interacting with the people in the room again, and showed them the system robustness even under a disaster, two of the 13 caught my bluff, but were smiling, They winked, turned back to face the others, and the heated debate continued.

The escalation went up till the CIO, but the guys who caught my bluff, never gave my deed away (in return i had to code a few more features and reports, just for them).

Don’t multitask, especially with a sudo access.


BTW, the software is still being used by their OPS team, it’s used to monitor a third of India’s internet backbone infrastructure.

The guys who caught me were the actual users of the software, their team actually.

Saves them a millions of $ per annum, because of being able to meet SLA.

It’s now running for 7 years in their NOC.

Users revolted against their CIO (second or third replacement CIO), who proposed replacing this software (and others in their NOC), with an HP unified monitoring suite(a part of CIO’s digital transformation initiative). Apart from this software, rest of the monitoring systems have been replaced.

:)


"shell-shocked, couldn’t hear a thing, just sitting there frozen" Been there.


I was using -1 as a dummy timestamp in an ETL job, for some reason I can’t remember. Well, the MySQL version in prod had a bug handling queries that do that. My job failed because database shard number one had instantly crashed. For about ten minutes, nobody at Harvard could use Facebook.

This being the height of “move fast and break things,” they asked me to do it again to confirm our diagnosis.


Back in the Windows Server 2003 days, when you hit "Shutdown" there was a screen that pop'd up to ask you what you wanted to do. Shutdown, Restart, Etc.

I hit Shutdown, instead of restart. The server was in a colo a thousand miles away. Unfortunately at that time, a colo that didn't over overnight staff onsight (oncall yes, but not on site). Also we had no IPMI.

I had to page some poor dude at 1am to drive 30mn (each way) into the colo to push the "on" button. I felt terrible. (Small company, it was our one critical production server)


Six hours before a trans-Pacific flight on Friday afternoon my key production database started experiencing high latency (PostgreSQL on Heroku). Had recently installed an add-on to sync data from the DB to Fivetran. Three most experienced engineers including myself paired to remove the add-on to ensure the weekend was drama-free, and instead deleted the entire database - as a result of a UX issue within the Heroku console.

Recovery took 30 minutes (Through Heroku support as Heroku did not allow backups-via-replication outside their own backup system), but that was a very long 30 minutes.

Second worst was a cleanup system that removed old CloudFormation stacks automatically by only retaining the latest version of a specific stack. Deployed a canary version of an edge (Nginx+Varnish) stack for internal testing. Cleanup script helpfully removed the production stack entirely.


One Tuesday morning, my code brought down 67 restaurants piloting the new point-of-sale (POS) system.

I'd written the code to reformat the mainframe database of menu items, prices, etc, to the format used by the store systems. I hadn't accounted for the notion that the mainframe would run out of disk space. When the communications jobs ran, a flock of 0-byte files were downloaded to the stores. When the POS systems booted with their 0-byte files, they were... confused. As were the restaurant managers. As were the level 1, level 2, vendor, and executive teams back at headquarters. Once we figured it out, we re-spun the files, sent them out, and the stores were back in business. I added a disk space check, and have done much better with checking my return codes ever since.


Huh, one doesn't hear about mainframes in the context of restaurants much. But I'm honestly not sure how far back to guesstimate - 20 years is probably reasonable, but it could even be 15...


I had been messing around with some new tool to generate bespoke weird packets all day when in the afternoon I was introduced to my new intern.

I was talking about what garbage our ethernet switches were (this was one of the earliest L3 switches-- full of blue wire bodges and buggy firmware), and how I'd already encountered a dozen different ways of crashing them.

While typing I started saying "and I bet if I send it an ICMP redirect from itself (typing) to itself (typing) it won't like that at all! (enter)" --- and over the cube partitions I hear the support desk phones start ringing. Fortunately it was a bit after the end of the day and it didn't take me long to reboot the switch.

I didn't actually expect it to kill it. I probably should have.


During my training, ~2004, I managed to kill the TCP/IP stack on an IBM mainframe running z/OS by accidentally creating a fork bomb with a perl script meant to test the performance of the newly installed BIND9.

Fortunately, it was on a testing system, SNA continued to work, and the system was due to be rebooted on the weekend anyway, so it was not that bad.

Not myself, but a few years ago, a new coworker managed to accidentally delete all user accounts from our Windows domain while trying to "clean up" the Group Policies. Our backup solution, while working, was rather crappy, so we had to restore the entire domain controller (there only was the one), which took all day, even though it was not that big. Fortunately, most users took it rather well and decided to either take the day off (it was a Friday) or tidy up their desks and sort through the papers they had lying around. A few actually thanked us for giving them the opportunity to "actually get some work done".


3 years ago. Took on a large traffic media site running on Wordpress. The usual good stuff - outdated plugins, running on a cheap VPS box, hosting vendor washed hands off saying they can't handle the tech support and traffic.

I was working in a different company back then and I was contacted by this consultancy that specialized in Google Cloud (which was always my personal favorite anyway). I was offered a very handsome pay for just what seemed to be just 4 days worth of work. To me, it sounded really simple, like get in, migrate and get out.

After I signed the contract and everything I got to know the client was promised a $200/mo budget and a very wrong technical solution proposal by an Engineer in the consultancy from what they were paying then which is definitely in multiples of what was quoted. And to make matters more interesting, this guy just quit after realizing his mistake. And that's how I even got this project.

So, I went in, tried many cost effective combinations including various levels of caching and BAM!, the server kept going down. They had too much traffic for even something like Google's Paas to hold (it has autoscaling and all the good stuff, but it would die even before it could autoscale!). Their WP theme wasn't even the best and made tons of queries for a single page load. Their MySQL alone costed them in the 1000s. So, I put them on a custom regular compute box, slapped some partial caching on bits they didn't need and managed to bring the cost to slightly higher than what they were paying with their previous cheap hosting company. All this lead to a 4 hour downtime.

I apologized to them profusely and built them a CMS from scratch that held their traffic and more and dropped their cost to 1/4th of what their competitors are paying. Today, this client is one of my best friends. They went from "Fuck this guy" to "Can we offer you a CTO role?" :)

I make it sound like it's so easy, but it was a year long fight almost bundled with lots of humiliation for something I didn't do just to earn their trust and respect. Till date, they don't know the ex-consultant's screw up.

In retrospect, this downtime is the best thing that happened to me and helped me to understand how you handle such scenarios and what you should do and not to do. In such situations it is tempting to blame other people around you, but in the long term, it pays off if you don't and solve it yourself.


I used to work in an automotive assembly plant doing IT. Me and a co-worker were out on the plant floor when he mentions a ping tool he found in the office he wanted to try out. He plugged it into a network drop and fired off a ping. The whole line stopped shortly afterwards but we hadn't connected the dots.

Lot of communication started happening on the radio, apparently all the stations were failing their quality checks. After a couple minutes, the plant manager, head of IT, and other people pulled up to where we were. Co-worker unplugged the device and plant slowly comes back up.

Turns out the ping tool was last used years ago to test connectivity to the new quality server and was configured with the static IP of said server. After the ping tool announced itself on the network, the real quality server stopped receiving the quality events so the line stopped.


I cofounded a start up which was a call canter using software I wrote to manage maintenance at small pad retail sites such as fast food restaurants. We had investors. One showed up, a couple, who were extremely angry due to the repayment of their investment. It was a small building, but handled hundreds of calls per day, with 12 call-center reps. The investors were loud. At the same time, our database seemed to have locked up, it was on a Pyramid Unix server running Sybase, my choice. So I needed to address the stall, at the same being attacked with irate, high volume screaming, at me. While typing at my terminal, I accidentally "drop table"'d. I bricked the database and after 15 minutes on the phone with Sybase realized we had no choice but to restore from backup. That's when I found out from the tech in charge of backups that he had missed the last day. We lost 2 days of data and had to run with paper and pencil until we had the database up and working.

So we restored what we had, and over the next 5 days, as people called in for status etc., we reconstructed most of what we had lost. The reps worked with the callers, and we worked things out over a week of 24-hour days.

Embarrassing, but we lost zero customers. Unfortunately there were hundreds of small mishaps, and many unhappy people, but ultimately the business held on, went forward, and a few months later there was no lingering effect.

Lesson: Don't let investors visit where the business is done, and verify your backups by restoring them. I recommend avoiding situations like this, but if one happens you can work it out with great effort, and it is rarely the end of the world.


Best one I've seen was a former manager of mine who was testing the performance of an API which was a descision tree underneath but insanely poorly implemented. He ran wrk against it and... Yeah, the rest you can imagine.

Mine was an unfortunate chain of events. Connecting to a databases was done through a specific server with port forwarging. A sysadmin changed the ports for whatever reason without notifying us so the port that was supposed to point to an alpha environment suddenly became production environment. We had done changes to an authentication system and the old wan was being depricated so I was going to get rid of it in the alpha environment. Soooo... Drop table sessions on production. Luckily that was an internal system so all that happened was that ~800 people got logged out.


I was testing backups for a CMS; to test, I had to destroy the test database. I destroyed the production database. While my manager was querying the CMS to do my annual review.

Good news is I was already planning on restoring the test database from the production backup, so i had the database up in under 45 minutes (slower than it should have been because Oracle's docs were flat out wrong).

A more senior engineer told me he was impressed by how quickly I got things running again; apparently in the Bad Old Days (i.e. a year before I started) the database went down and, while everybody was pretty sure there were backups somewhere, nobody was sure where; customer interactions were tracked by pen and paper for almost 3 business days while this was figured out.


That honestly hurt to read. It's my worst nightmare to do something complicated and to accidentally mess up production data. Good thing you managed to restore it so quickly...


I worked for a company back around 1999 that used MS Access to do the entire companies payroll. Including the CEO. Once I was awoken at 1am because payroll wasn’t running. In my sleepiness I accidentally manually committed a few database values (that I had never touched before) that gave the executive team 0.00 paychecks. Not a single exec including the CEO noticed for about 6 weeks.


At ITA Software, we had servers that the customer airlines would connect to, and they would sometimes get TCP connection resets for no apparent reason. (This was back in the 'aughts, when companies still ran their own servers.) ITA had bought a super-expensive router with back-up power supplies and lots of administration features.

So, we configured logging. Mysteriously, resets increased the next day. So we added more logging. On the third day, the whole system collapsed. Turns out the resets were coming from the super-expensive router itself because it was getting overloaded. More logging meant more load. Ooops!

After that, ops needed CEO approval to turn on any logging. Good times!


Learned the hard way to properly label two identical bits of hardware before working on them.

One ISP router in production with 20k active connections... one "backup" router fresh from the box.

My job was to backup the production firmware and flash the config to the spare box.

The opposite happened and the customer support telephones lit up like a Christmas tree.


The year is 2002, OS is Solaris, trying to compile some httpd add-on straight on the production server (because why not) kept giving some weird error about /etc/ld.so not being right. So junior me does:

$ rm /etc/ld.so*


That one hurts


this was many years ago, as a junior dev .. management was stressing out over how to make our app faster -- long query runtimes. Naively, i pitched that we should run the queries in advance so they would be cached. simple enough. we did some dry runs. it looked good to go. we pushed to prod. its sunday night and i'm asleep, the query runner activates .. and our app proceeds to DDOS our data layer. Not only taking down our prod, but the prod of every app subscribed to the data store.

I wasn't oncall, and the oncall didnt have access to the query runner script -- it was on my laptop. So, oncall was desperately trying to fight a fire they couldn't put out while i slept like a baby .. that was a fun monday morning meeting.


About 5 years ago when I was just starting out I found myself designing a responsive course builder. My solution to a responsive interface at this time involved sending a very large stringified HTML file over websockets.

This wasn't a huge problem, but the configuration on Action Cable (Rails wrapper around websockets), logged the entire contents of the message to STDOUT. At a moderate scale, this combined with a memory leak bug in Docker that crashed our application every time one of our staff members tried to perform a routine action on our web app. This action resulted in a single log line of > 64kb, which Docker was unable to handle.

All of this would have been more manageable if it hadn't first surfaced while I was taxiing on a flight from Detroit to San Francisco (I was the only full time engineer). I managed to restart the application via our hosting providers mobile web interface, and frantically instructed everyone to NOT TOUCH ANYTHING until I landed.


The long-running one at Uber for a while was that someone accidentally pasted a Spotify link into some JSON directly after a bracket or something, causing it to be invalid, and brought down the entire API layer after the deployment failed to detect any errors.

Personally, on my first day of my last job, I was brought on to improve the backend (turns out, only how the non-programmer CEO decided was okay, but that's beside the point). I slow-loris'd production just in the off-chance it was vulnerable, because no way would a site hosted on AWS by a tech lead that claimed he was competent would have a slow-loris-vulnerable setup.

It was indeed vulnerable, and not only did I bring down production, I also (somehow) brought down our own wifi router in the office. Caused a few hours of downtime on a monday.

Learned my lesson that day - even though everyone on the team says soandso is super elite, wait until they show you they are and don't trust that they did anything right.


Years ago we were using NetApp Filers as storage for our database servers in a colo facility. During a planned maintenance window I installed a NetApp OS upgrade and brought everything back on line. At first it seemed fine but as soon as the database servers got some load they started dropping their connections to the NetApps and everything crashed.

Of course I blamed NetApp and called their tech support screaming for help with their OS "bug". After hours of troubleshooting we finally figured out that the NetApp OS upgrade had included a network performance optimization and it was now sending out packets fast enough to overflow the buffer on our gigabit Ethernet switch. The packet loss rate was huge. Fortunately we had a newer switch back in the office so after swapping that out and repairing some corrupt databases I was able to get production back on line. Didn't get any sleep that night though.


Go-live of a migration of three web apps and their DBs from AWS to Azure resulted in one of the hot customer facing APIs grinding to a halt during peak hours. Compute and DB SKU specs were exactly the same but as it turns out, Azure DB PaaS performance is nowhere near that of AWS RDS. pg_bench reported about 4x more transactions completed in AWS, even though in Azure I had upgraded to postgres 11 enabling parallel query plans.

Another fun time was when a reboot of an ubuntu 14.04 VM previously migrated up to Azure from an on-prem VMware esxi decided it wanted to sit waiting for user input in grub. We had no extension installed to use the serial console for user input. Since it was hosting a production postgres db, I received a call at 12am and was up til 4am transplanting /var/lib/postgresql from the old OS disk onto a fresh ubuntu VM's disk.


So, I was working for a direct competitor of Expedia in EU.

I was in the team handling flight research and I was adding some special rules for the markup engine.

While adding this new feature I noticed there was an instance of class getting initialized directly in a method, no dependency injection, and it was really hard to test this way since you could not fake this object.

The purpose of this object was to generate an hash for the request.

So I naturally let the IDE extract the dependency like I did many many times.

It was a really small feature and the part of the code was really old so nobody really knew all the secrets. Code review went quite ok so I just deployed it on our 65 servers.

And that's when I actually learned what a lock does on your CPU.

The flight search engine was usually hit by 350 search/s and on our dashboard I could see our CPU being constant 100%.

And that's how I learned that that class included a library that, a creation creates one write lock, and whenever you get the call to generate the hash you will use that lock.

So 350 req/s on a server were sharing only one lock.

Luckily we liked observability as a concept so production was down for less than 5 minutes and that's how I discovered why that class wasn't injected.

Eventually I had to remove that class because conceptually, without sharing the lock you would not get unique hashes but turns out that there was a workaround in place by using the name of the server and added to the hash...

Anyway that's when I caused my company to get silenced by flight metasearch engines across the world because we didn't respond for 5 minutes.

The next two hours our flight business people needed to call all our partners to unlock us.

It would have naturally happened in 6 hours but that was a lot of lost money.


Impacted iOS Kindle users ... We had an oopsie.

https://www.google.com/amp/s/techcrunch.com/2013/02/27/bug-i...

Customer fears were far worse than reality, but my management team (up to and including JeffB) we're not amused.


Had a script that managed disk space on windows servers. The config file was xml, and Powershell did not treat the commented child element as nothing, it imported a blank child element in the PSObject.

So when it ran it looked in blank directory (default c:\windows\system32), for blank conditions. Zip filetype matching blank. Delete files over blank age.

These servers were rebuilt over a weekend, and the script was scheduled again, and broke these servers again, requiring another rebuild.

When I came in on Monday, I was told that the script had caused this carnage. I didn't believe them until I read the debug logs in horror.

Luckily the config was specific yo a subset of servers, but they happened to be servers that were responsible for police GPS radios to function.

Suffice to say it now has a lot of defensive programming in it to test the config file and resulting config object before doing anything.


We pushed a CDN config that triggered a CDN provider bug. Took down the CDN's entire presence on one continent. Broke a whole bunch of recognizable sites for a bit.


Are you referring to the Fastly outage that happened a few weeks ago, or is this more common than I realize?


No, not that particular outage.


Although, the recent Fastly outage spanned more than one continent


I have some of these.

Does self DDOS count?

We worked for Flanders radio and television (site was one of Flanders biggest radio stations). The site was a angularjs Frontend with a CMS backend.

The 40x and 80x pages fetched content from the backend to show the relevant message (so editors can tweak it). The morning they started selling tickets for Tomorrowland I deploy the frontend breaking the js fetching a non existing 5x page, looping to doing this constantly. in a matter of seconds the servers were on fire and I was sitting sweating next to the operations people. Luckily they were very capable and were able to restore the peace quite quickly.

And also (other radio station) deleting the DB in production. And also (on a Bank DB2) my coworker changing the AMOUNT in all rows of cash plans in stead of in 1 row (and OR and braquets and trust you know).


I was part of a team that maintained end of day commodity trading software for a bank. Out of nowhere they got an $800 million inbalance one night. It was very odd and of such a magnitude that it was seen at board level (not good). We spent 12 nights watching end of days and debugging it in production before we found the issue.

It turned out that a new type of trade had been introduced which had a 0 days length. No trade we handled had ever had a zero day length; it was unheard of and the software did a divide by the trade length without checking for zero hence the bam. The div by zero exception was caught but only after it had skipped a Yen to $ fx conversion which is why the inbalance was so large.


My first IT job was in a large call center. I was sweeping up in a data center and there was a keyboard cable stretched across a walkway. The keyboard wasn't movable (I don't remember why) so I unplugged it from the PC, swept around it, and plugged it back in. About 2 minutes later half a dozen people run into the room. Apparently the SUN workstation I had unplugged the keyboard from was a critical component of the call manager and there was a bug that forced a reboot when the keyboard was plugged in while the system was running. I had hung up on ~36,000 people. Lesson learned and I only had to keep the data center clean for another year.


"there was a bug that forced a reboot when the keyboard was plugged in while the system was running."

That's not a bug, it's a feature! -- Sun

God, we fought with them about that. Even if it was a reasonable feature on a Sun 3 workstation, why put it on a high end server?


Wrote a stored procedure that joined with our `sessions` table. Worked fine in local dev and staging of course. And it was fine in prod too for 6 weeks or so because the other end of the join didn't have any rows yet.

Eventually some changes landed so the other table starts receiving data and the cost of the join begins to increase. Some time around midnight on a Friday night, performance degrades sufficiently for queries to start timing out. No-one can read a session, the site is effectively down for everyone (30m DAU). Ops team lost their whole weekend tracing back to what I'd broken, because the delay since deployment confused the issue so much.


In early 00's I got my first gig as a web developer for an oil company. They had no other developers.

There was some kind of security incident, and the PHB (literally) wanted to beef up security. For some reason, infosec was a kind of hobby for me, and I suggested (and volunteered) to install Snort as a company-wide wire tap.

So, during a maintenance period I installed the tap and server where the cable came in.

And it worked great!

A month or so later, all network activity stopped working after a new server was plugged in. It took days to figure out that the cause was a duplicate MAC on the NICs of my Snort box and this new box - a real one in a million thing that I've never heard of since.


Yeah that’s really not supposed to happen - was one of the MACs redefined or was it a manufacturer error?


Manufacturer error!

IIRC, the NICs were the same make & model.


Wow very surprising! That really shouldn’t happen lol


I had web server access, but they wouldn't give me a DB login.

So I crafted an .asp to do my maintenance.

Only I was calling CreateObject() in the for loop to get a new AdoDb.Connection for each of the array entries of the data.

That creaky IIS server crashed like the economy.


Drilling through a wall, routing a new network line; hit the power line to the server rack. There was a UPS, but it didn't like that kind of short apparently, and folded up into a sulk immediately.

Best part is that I did the wiring in that building when it was built 5 years before that; I really should have realized it was there.


About 15 years ago, when ssh-ing into servers was quite normal.

In eterm on my gentoo linux laptop with enlightenment desktop I typed: su - shutdown -h now

Because I was tired and I wanted to go to bed. Came back after brushing my teeth. F### laptops and linux! Screen still on. The thing didn’t shutdown!

Strange thing was: in the terminal something said it got a shutdown signal.

Then I realized I shutdown a remote server for a forum with 200k members.

It was on the server of an isp employee, which happened to be member of that site. All for free, so no remote support and no kvm switches. Went to bed and took a train next day early morning to fix it.


Is SSHing no longer normal? What do the cool cats do these days to manage their servers?

I use K8s and docker to run software on my server, but initiate these via SSH. I suppose CI is perhaps modern approach or what else is everyone using?


Managed stuff like AWS fargate and ECS is what I want to use at work. ATM I've got an ec2 server instance with SSM taking care of it, I don't have to shell in too often.


With sshing i meant: ssh to server, install software and configure server. No docker, no ansible, nothing of that.


The deployment process for a site I used to work on would entail changing the owner of a folder. You would change the owner from the web server to your user, upload files, and then change it back.

sudo chown -R www-data:www-data [folder]

I’d made some changes and was ready to update the owner only I was inside the folder that needed updating. In the moment I decided the correct way to refer to that folder was /

I noticed the command was taking far longer than usual to execute. I realised the mistake but by then the server was down with no way to bring it back up.


chown and chmod on root by accident is such a common fuckup. i've done it on my personal linux desktop before.


As an intern at a major automobile parts manufacturer, I took a hub home for a LAN party. Brought the hub back after LAN party and promptly plugged it into a network port and connected the wrong power cable. This was in the server room next to an AS/400 running production.

Took a long vacation weekend as my error proceeded to shut down all production due to network issues causing the AS/400 to freak out.

Cant run a conveyor belt, or robot, or sensor, production line, or or or if your mainframe isn't working.


FWIW, I work for an automotive parts manufacturer today and if our AS/400 is down we still can't report production.


I did the reverse. Using DOS batch scripts I wrote I imported orders twice for a 100 person generic drug company. It was a heavy set of orders as it was. Everyone in the warehouse had to work late that day. Sales guys loved it because it was near the end of the month. No one at the company seemed to mind the mistake. I was mystified that I did not get in trouble. They had tight relationships with thier customers and just shipped less over the next month.


Server running bind with multiple people jumping in and making edits to the config. A body goes in and makes an edit but never reloads the service. After that change I went in with my own change - my change was very minor and I knew it was correct so like a fool I didn't run a syntax check - and then I reloaded the service. I didn't even check after the case to make sure it was still running.

Narrator: Bind was not running.

Down goes a media organizations web site.


Here is what I did to GCE: https://status.cloud.google.com/incident/compute/15046

Can’t say more than the provided link, but that was definitely a messy weekend which is also my boss’ birthday… it is the only time so far that I chatted and discussed tech issues directly with so many SVP/VPs in Google. Hmm, miss it :P


I spelled "tariff" incorrectly with "tarriff" in the config file that's parsed on every page load.

My code reviewers didn't notice and we didn't have linting or warnings on that file, so I brought down production :)


I was moving customer's two servers to virtual machines. They told me "do not touch server A, it has network share with pdf documents that are critical for our business. Take server B and make VMs there".

So I format the drive, install win server on it, and the owner runs in with red face "omg did you delete it? it was on server B!"

And of course there were no backups. And pdf files were indeed crazy important. The customer is in panic and despair, he is crying that we must remove the hard drive and take it to some data restoration company. We shut down the server, and they ask me to check accountant PC before I leave, she was complaining that it is very slow.

I turn PC on and I get pop up notification "network share is off, but your autonomous files are available". It turns out she put the checkbox on that network share by accident, but now we were sitting in front of her desktop and looking at absolutely intact folder with all pdf files. We succesfully finished moving both servers to VMs after celebrating.

That was the only time in my life when I saw that checkbox in real action.


I worked with someone who pushed out a config change to production and went home.

Soon, he got a call asking what he did because all of our credit card processing went down.

Shortly after that we got a call from our credit card processor (one of the largest in the country) asking what we did because it was causing a cascading failure in their systems and had taken down a big chunk of online credit cards processing including Apple and Walmarts websites.

We tracked it down to the commit which had accidentally nulled the a username for authenticating with our credit card processor. They had a previously unknown bug that caused a memory fault when the username was null. The processing was queue based so when one machine failed, another would pick it up and try and process it and would fail. This happened until the entire data center fell over.

A few of the lessons learned: 1. Always check the length of user inputs. 2. Build in circuit breakers to prevent cascading failures. 3. Don’t push code without review to production, minutes before leaving.

Everyone involved kept their jobs and we learned a lot of programming lessons that day.


I had to code a feature in one of the services that essentially made a simple SQL call every x minutes to check up on something. The amount was supposed to be configurable. I put 20 minutes as the default config and everything seemed fine. Next week I found out that I took down prod because the service made calls every 20 milliseconds instead of every 20 minutes. I remembered to convert "20" to an actual time representation but I accidentally used the unconverted value anyway and it turned out there's an another constructor for the method I used to execute the SQL query and it accepts integer values as well and treats them as milliseconds. What's weird is that this version of the service was running for a week in our test environment and absolutely no one noticed the degraded performance. A temporary workaround was to change the value in the config to 1200000. It taught me the importance of having configurable values in the application as we'd have to build and redeploy it again if the value was hardcoded.


Most of these are way better and I didn’t quite bring down production. I am a neophyte DB user (working in Finance, not IT.). I once wrote a query to try and abstract a simple table of about 8 years of dates by calendar month, fiscal month, holiday, etc (fiscal was different than calendar.). I ran a fairly open query in our oracle 12 dB. I waited about 20 mins and had no results so I figured I’d let it run overnight after I left for the day.

When I got in the next morning, it was still running and people were having issues with getting their sales data and much of the sales reports didn’t run overnight.

A few hours later I got an email from a DB admin I worked with closely on custom pulls and she was like wtf. Apparently the way I wrote the query I really killed the performance of the database and it broke lots on of processes.

My badge of honor from that situation is the email from the DB admin that was the forward of a long email trail of the IT folks trying to figure out what happened with the subject “Nasty Query”. I saved it and occasionally share it with people.


I had already resigned. First day of the last two weeks. I was updating a firewall rules in Cloudpassage. Their UI sucks, and doesn't make it clear what changes were made when you click "save". Apparently, I had accidentally changed port 443 from allow to block.

Well, when the site goes down the CEO charges into the DevOps eng room and starts screaming "we're under DDOS attack!" This was his go-to cause of any problem ever, and of course is never true.

Well anyway, with all the screaming and ruckus I had forgotten that I was just in Cloudpassage a few minutes prior changing something unrelated. So we are investigating the problem for almost an hour.

As services in AWS autoscale - they work for about 15 seconds, and then are no longer reachable. That's when I realize....once cloudpassage updates the firewall rules, the instance becomes blocked. Doh. I switched it back and everything went back to normal.

Two years without incident and then boom, ruined in the last two weeks. Felt awful, but fortunately they were cool about it.


When I was in college and working on campus as a web developer I did a find-replace. I was in a hurry to get to my next class so I committed it in SVN and ran to class and muted my phone. When I went back to work a couple hours later I learned that my quick find-replace had taken down the entire campus email system. Apparently, I had broken something which caused automated error emails to be sent. However, I also learned that I had broken the method that sends the emails so it triggered a new error after sending the email and then tried to resend the original email. After a short period it had overloaded the campus email server. Thankfully, my boss was understanding but we had a nice long talk about testing code before committing it and not relying on automated tools like find-replace. There was also a new policy created about doing code reviews shortly after.


I bought down one of Australasia's biggest banks for 8 hours when I requested a backup of the production database for a test cycle we were about to start.

Unfortunately the database structure had changed and the backup script hadn't been updated and it locked all the tables it tried to access, while also waiting for incomplete transactions to finish... which never happened.

$8 million dollars of deposits and withdrawals vanished into thin air.

Luckily for me, there were perimeter logs for all transactions so a team of people replayed the lost transactions and fixed everything. I had to read the incident reports (which were pretty cringing - people put money into their personal account and there was no record of it ever happening) and write a response which said what went wrong and what I would do differently next time.

No repercussions for me though... I made another request for the database and it worked the next day.


I was 18 and had just taken over the website of a car and horse trailer dealership. I typed “;rm -fR ~” into an e-mail form to show my coworkers what would happen. On purpose. I quickly restored it (I was ready for this.) They were pretty amused. We had no concept of “on prod”. “Dev” was my local PC. Damn teenager.


I brought down production for a project — by running the deployment CI pipeline for already deployed commit. A couple of minutes later, production was thoroughly dead.

Turned out that my coworker had set up the CI process to use a PHP-based zero downtime deployment scheme where each release was deployed into a folder with the commit hash as name and then a symlink was updated to point the web root to this new release folder.

But, critically, he also configured CI to delete old releases at the end of the deployment pipeline - by removing all release folders older than three days. And by re-deploying a commit older than three days, after uploading the code and updating the symlink, the release‘s folder was considered old and deleted at the end of the pipeline, leaving the webserver with an empty directory as web root.


Once, I was going to make a quick backup copy of /usr on a busy multi-user system. I did:

$ (cd /usr; tar cf - *) | (cd /mnt; tar xvpf -)

Only I typoed ’cd /mnt’ which made the changedir fail and my current working dir was /usr so I had one tar-process archiving /usr then streaming the data to another tar process that used it to overwrite everything under /usr. After a few seconds of nothing appearing under /mnt I got really nervous I’d done something stupid. Then I realised what was happening and had a couple more seconds of hoping it’d be fine (given that I was overwriting /usr with its own data). Then the system log lit up like a christmas tree with error messages and a few seconds later the machine froze. Can’t recommend doing backups this way.


So during my last job, we were getting a new warehouse up and running, which involved a lot of mass inventory updates to get things moved around as quickly as possible; my boss and I were tasked with doing these updates in the database directly in order to bypass some really annoying checks in our warehouse management system that'd slow the process down to a crawl. I wrote up a T-SQL script that'd swap the inventories of two of our "pallet drop" locations, and each of us would accompany someone from the warehouse management team running the pallet jack to do the physical swaps (during which we'd then run the script with the relevant location IDs). During this, the new warehouse was already operational, so as we were swapping things around, warehouse pickers would be picking things from the very locations we're finagling (and part of the script I wrote handled updating the pickers' pick paths to ensure they didn't get sent on crazy goose chases).

So my boss and I are both running this script, and it goes well enough for a few minutes, until all of a sudden we start hearing people shouting "system's down!" throughout the warehouse. Everything's at a standstill, and at this point I start immediately combing through my script, wondering what the hell I could've fucked up. We also get on the phone with the WMS vendor; this being very much a "warranty void" situation, my boss and I didn't expect them to be able to help us, and the whole warehouse being down meant the clock was ticking to get this fixed.

At some point, the WMS vendor's resident SQL expert noticed that there were a bunch of uncommitted transactions that were deadlocking the DB. Turns out that when my boss copy-pasted the script I wrote out of my Slack message to him, he forgot to include the last line: "COMMIT TRANSACTION". We repeatedly COMMIT'd in his session for a few minutes until there was nothing left to commit, and then everything was working again.

So lesson learned: if you're gonna copy and paste some script, make sure you copy and paste the whole thing, lol


Early on in my career, I worked for a secret unit of a secret government law enforcement group that handled surveillance. Being young, full of verve, and not nearly as smart as I thought I was, I was always trying to improve things and tinker. Knowing nothing about networking, I plugged a switch into itself. Due to the configuration it knocked the entire surveillance network offline and everyone was freaking out. I was cool as a cucumber, because it couldn’t have been me. Must have been a coincidence right?

Right?

The sense of dread the dawned on me as the former Navy Seal turned Network Engineer (and later doctor) started sniffing around the switch I had just touched was palpable. Luckily for me, He kept my mistake quiet and fixed it quickly.


As a new junior, remote, part-time hire, I got frustrated that my test suite wasn't passing on my laptop. So I checked out the codebase on a production server and ran my tests there.

The test suite initialized by truncating all tables and loading fixtures instead.

Using production database credentials.

Oops.


Well… did the tests pass in production, at least?


No, because the calls to format the Hadoop filesystem failed... (luckily).


> rewriting a repository to use a ORM, jOOQ in this particular case with Java8

> spend two weeks on this, write extremely meticulous tests, I am a junior/mid level engineer at this point in my career and this is a game changing ticket

> day of the switch approaches, sweating bullets

> switch happens, so far so good

> This was for a food delivery company so the volume of orders changes throughout the day (lunch, dinner, evening etc)

> at lunch time orders suddenly start disappearing, but eventually over time things go back to normal

> goes on like this for hours, at least a few

> tech team is confused ... why are these orders disappearing into nowhere

> senior engineer is suspicious and reviews my PR from the night before

> I forgot to remove `LIMIT 100` on the GET /order query when I was testing

Still makes me chuckle to this day


I was taking over a server that had software, that was not in version control.

I created a git repository where I wanted to start versioning, started picking the important non-cache files with individual git add X, but I accidentally added a folder I didn't want.

I quickly typed git reset --hard (as I would do working on the main production app, in some scenarios) and deleted most of the production system...

Thankfully I had a full backup on S3 (a hand-rolled backup tool), but I didn't have software to restore the files.

I turned to my right, where the CEO was sitting, and I said "I just accidentally deleted X, but it'll be back up in 30 minutes."

45 minutes later and after the fastest I've ever thrown something together I had the site restored.


When i was working for one of the biggest regional websites:

We had blue/green deployments using AWS Elastic Beanstalk - our actual deployment process was still manual and we had an informal (i.e: not written down but everybody knew) checklist to follow - one of the steps involved checking that capacities matched.

Well, everything looked good and I switched environments, as I was checking that the website was still up, i was getting 503's. I realised immediately that I forgot to actually scale up the environment that was now receiving traffic - 1 server was handling what we generally had a minimum of 30 to handle.

Immediately swapped back, scaled up, and swapped again. ~5 minutes of downtime in the morning.


Not too bad. Unless you have a contract or regulations that says otherwise, 5 mins downtime once in a while is really nothing to lose your sleep about. Did you even have to report it?


I honestly don't remember, but i definitely remember the checklist became more formal as we made a lot of errors like this.

I left soon afterwards but the stack was also a transition-to-AWS stack so was later moved to something more suited.

5 mins downtime was definitely small for that place - we regularly had larger issues. I learned a phenomenal amount in my time there though.


Years ago, my employer was light on funds so we cobbled together plugs to use as a loopback when testing and identifying network jacks. It worked great. Insert the plugs in cubilces then test the open ports in the wiring closet. Worked great many times until one day we plugged it in and went to lunch. When we came back, we were told the network had slowed to a crawl and captures showed floods. This was during the days of primitive DoS via broadcast floods. Well, this flood was self induced. The loopback plug was inserted into a jack that had a connection back to the network hub. It dutifully retransmitted everything it saw back onto the network. Whoops.


Exact same scenario but it was connected to the on premise data center. I was imaging devices and figured why not use a switch and provision 4 at a time. Started seeing everything go down figured the network team was doing things. Checked my images and they had stopped, unplug a cord and everything is working. I don't think anything of it and plug it back in and went back to waiting for the images to deploy but the storm started. After about 5 hours customers started receiving their electricity after the network team found my device.


My colleague asked me to run crontab -l, which I misheard as crontab -r (a switch I was unaware of, which instantly removes all Cron jobs)

I still wonder why crontab has that switch? Who needs that?

We had to piece together all scheduled commands from run logs.


Done this one fat fingered on production. Wanted crobtab -e, did crontab -r. That was a fun night.


I just started working in an E-comerce hosting company less than a year ago with very little experince, I mean other than a C Programming course in university I had absolutely no experience so I literally had to learn everything from the beginning, starting with using Linux to Databases and some programming languages and more... but I was really Happy that it was working.

a few Months in, while working on my second sql export that i developed in a test environment which had the scheme of the companies Database, the test environment froze and then the whole Operating System of the laptop froze too. I repooted not thinking anything of it giving that my work laptop had done that a few Times already, i then proceeded to testing my export using the live system thinking i just needed more computation power, after all what's the worst that can happen !? sure enough the web interface of the Database freezes too at which point i went back to developing the export in the test environment. Some 40 Minutes later i get a question from the senior Programmer at the Time if i did anything on the live Database cause the whole thing froze up and with it no costumer service was possible for that time, thankfully he have already fixed it by repooting the Systeme and proceeded to tell me that the export that i have tried had more possibilities than the number of atoms on earth... so yeah i did need more computation power


One of a prior boss's war stories was that our CTO once logged in the main production server and did a 'rm -rf '. This is back when it'd actually delete everything. This was also back before most companies had back up prod servers.

From what I understand, it was a long weekend bringing that server up from tape backups.

The funniest thing to me wasn't that it happened. It was who it happened to. The CTO has created (and was granted patents for) multiple utilities still used very frequently in nix environments today.


Most of these stories seem to be from 20+ years ago. Do newer sysadmins just not make mistakes anymore? Or is 20 years the timespan that is needed to get over the embarassment? :-)


This is actually a great question.

I suspect there are multiple reasons, but the increased reliance on cloud DevOps is likely one of them. Truth is that far fewer companies roll their own critical infrastructure these days.


The mistakes made in these comments led to process improvements. Also, the price of computing has come down substantially such that you can actually build highly available and fault tolerant systems.

New sysadmins make mistakes, sure. But they're typically not as critical because recovery processes and correct system architecture insulate the admin from making a mistake with such catastrophic results.


I'm pretty sure it's the latter.


Very first job, as an intern, I was tasked with building a "free text search engine" for the product, using their api. Maybe my first week or so there, I left a script running over lunch. Turns out the internal IP addresses weren't subject to the rate limiting, and my script's queries were growing exponentially (I was sending the response back to the same endpoint which was querying with the response, and giving me back a larger response etc..) Within 20-30 minutes or so every production machine was stuck running one of my queries. And it happened on the day that the engineering team were taking the new intern out for a team lunch...

At the time I was mortified, but in hindsight the fact that I was able to do that in the first place was really the issue, not my script.


At a $job, I took down prod once and restored it twice after the database got emptied. Well, we didn't quite empty it - I imported an old dump into prod instead of a test database while trying to replicate an issue. My coworker, on the other hand, was running some script he had written to truncate a few tables... and ended up running TRUNCATE individually on every single table in the database.

They've been good tests of our backup systems, actually. In fact, one of the incidents revealed that our on-site backups had been broken for ~a couple of weeks. (Our off-sites use a different backup system and were fine, but we restored that gap from binlog instead, as they were still present locally and it was faster than the 100Mbps upload from our off-site...)

Each one was 15-20 minutes of downtime.


I had a product that had mostly stopped growing in usage. It was running on say fifty machines. I had put considerable effort into some memory optimizations, which was the scaling point for new hardware, so I talked Ops into bunching the active traffic load onto fewer machines. All of the active traffic load. Started hitting the memory limits (32 bits linux) and our server framework exited on malloc failure, so lots of exiting of long lived processes and loss of expensive state, delayed alerts etc.

I still think it was over-provisioned, but they told Ops to stop listening to me unless someone else agreed. Probably ran on the 50 machines till it was discontinued 10 or 15 years later, but I left so who knows.


I'm a guy in two person startup team, mostly handle the technical stuff. Woke up to see some error on our main work machine, which also hosts some of the services, after a sudo apt update. Wasn't the first time and was usually able to fix by just googling what the error is as on SO and running it. Did it again, told me to uninstall nvidia drivers, proceeded to do so, and bricked my hard disk. Was completely gut wrenching. Although most of the important stuff was backed up on repos, still had to rebuild the damn thing. Still not sure what happened exactly till this day


Could have been coincidental- the drive was about to die and that triggered it.


Very curious what you mean by "bricked the hard disk"?


After the bios screen screen would go black as it tries to load what I assume is the os.

Then after a really long time an error message shows up, something like Acpi error, namespace lookup failure, Ae_not_found

Then it goes back to bios and tries again.

What I suspect happened is removing the nvidia driver led to some sort of circular dependency or lock on the system. This was when Ubuntu 20 first came out and official Ubuntu 20 cuda and nvidia drives weren't out yet so I was using ones for Ubuntu 18. Never figured it out...


Ahhhh, I see what you mean now.

First of all, that tiny little detail about the ACPI error is actually incredibly helpful: it's one of the few messages that tend to still leak onto the screen when the system is configured to boot in quiet mode. Thus, Linux was actually partly booting 100% fine.

If the system was then just automatically resetting after a bit, that definitely sounds like a driver fault, and if you were still on the Ubuntu 18 drivers it sounds completely reasonable (for proprietary values of "reasonable" ._.) that you'd encounter a kernel panic or hardware lockup/reset or something like that.

--

I was curious why that ACPI error message leaked onto the screen, and presumed/guesstimated it was because it was being printed with a high log level/priority. I decided to go digging to see if my theory was correct.

Thanks for the verbatim quote, "namespace lookup" found the source of the message immediately: https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5.... So this uses acpi_os_printf() (defined at https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5...), a va_args thunk to acpi_os_vprintf() (defined immediately after), which... does a few things. It's honestly going to be shorter to just

  #ifdef ENABLE_DEBUGGER
    if (acpi_in_debugger) {
      kdb_printf("%s", buffer);
    } else {
      if (printk_get_level(buffer))
        printk("%s", buffer);
      else
        printk(KERN_CONT "%s", buffer);
    }
  #else
    if (acpi_debugger_write_log(buffer) < 0) {
      if (printk_get_level(buffer))
        printk("%s", buffer);
      else
        printk(KERN_CONT "%s", buffer);
    }
  #endif
there we go.

This is weird: it uses different paths if kdb support is compiled in. If it is, it'll only ever use printk() functions, but if it's not, it tries calling acpi_debugger_write_log() first and only does printk() things if that returns < 0.

The printk_get_level() thing, added in 2016 (https://github.com/torvalds/linux/commit/abc4b9a53ea8153e0e0...), checks to see if the last line of text was a continuation line, and only starts a continuation line if the last line wasn't one. (Orthogonally relevant: https://lwn.net/Articles/732420/ coincidentally happened a year later)

I think that acpi_debugger_write_log() (https://github.com/torvalds/linux/blob/master/drivers/acpi/o...) is just a circular buffer sink. It dispatches via a function pointer to acpi_debugger.ops->write_log ("oh no, where does that go"); LXR to the rescue, which cross-references (https://lxr.missinglinkelectronics.com/linux/drivers/acpi/os...) (via the tiny usage link) to https://lxr.missinglinkelectronics.com/linux/drivers/acpi/ac..., which is... just a circular buffer writer (https://github.com/torvalds/linux/blob/master/drivers/acpi/a..., https://github.com/torvalds/linux/blob/master/drivers/acpi/a...). Huh.

If there's something spinning in the background continuously flushing the contents of the ACPI buffer to the screen, I have no idea how I'd surface that. But in terms of this particular call graph, I think the only potentially-interesting area is actually the KERN_CONT mechanism itself. I was fascinated to learn that the message prefix system actually works by writing { 0x01, <character> } into the buffer (https://lxr.missinglinkelectronics.com/linux/include/linux/k...), where continuation lines are marked using "c". Interesting.

Now I'm wondering, if the last message to be printed to the console had a proper level and all, and the next line was a continuation line... what level does it get? I now see that the chances are this is not the reason why it leaks onto the screen, but that was actually my first thought.

I'm still learning/limping/stumbling through understanding all this, so this was just poking around for fun/practice :)

Practically speaking I do generally prefer to have an Absolutely Blank Screen™ while the system is booting, save for what I put on it :), and to that end the nuclear option is to add "fbcon=map:1", which basically reroutes the console to /dev/fb1 (assuming you don't have an fb1, aka a 2nd screen :) ) - but given that this literally gives you no console at all, yeah, not great for everyday usage (and unfortunately not great for many embedded scenarios where RS232 or network access would be trickier than just switching to a console on a ~VGA display). Hence my interest in seeing if it is in fact possible to squirrel away all the text, but still have functioning CTRL+ALT+F1 et al.

--

Also - when you're in the GRUB menu (which you can usually show by spamming ESC nonstop immediately after POST, if it doesn't automatically sit at the menu for a couple of seconds), hit 'e' to edit the selected item, identify in the wall of text the bit of the 'kernel' line that says "quiet" and/or "loglevel=..." and insert "loglevel=9 verbose debug", then hit CTRL+X to boot the modified entry. This can be made permanent by editing /etc/default/grub then DON'T FORGET :) to run `sudo update-grub` afterwards.

Generally you can effectively learn how to play with this in a VM, since once you're in GRUB most things are identical to real hardware. The majority of default configurations have like a 5-30 second timeout as well, so it's conveniently less necessary than it used to be to identify the exact nanosecond to start mashing the keyboard...


This is pretty amazing how you got all that from my simple hint. I also had a more detailed question regarding this on stack overflow but it seems to have been deleted from lack of response. I also have screenshots if you happen to be interested but by all means don't let me take up your time


Feel free to email them over, sure. (Full disclosure, I'm sometimes terrible with reply latency.)

And FWIW, you actually tripped over a small longer-term cluster of internal grumbling - I like my screen to be completely blank at startup, and I've stared unimpressed at AE_NOT_FOUND-like messages (including that exact string) on my own system before I get to a login prompt.

So what kinda started out at "hmm, that's probably a misclassified priority or something..." ended up as a small wall of GitHub torvalds/linux links. Woops. (And then I didn't figure it out anyway... hmph)


Changed the outbound firewall rules for our production server. And suddenly it couldn't speak to MngoDB Atlas. We definitely had rules to allow mongo traffic but still somehow it wasn't working . We noticed every time a connection to mongo was being made some connections were also going to a CDN on port 80. Looked pretty suspicious until we discovered it was just OCSP which verifies cert revocation status. So ultimately it was going to the cert issuer - letsencrypt who had issued mongo's cert.


heard a few years back, someone accidentally deleted everything across all AWS accounts at our company; had to reach out to AWS and they helped recover everything. Took 6-12 hours to recover


A similar thing happened on a previous team: One of the engineer's was experimenting with Ansible, and apparently setup some command that Terminated all instances in a VPC... he ran it for production. Suddenly I start getting alerts of unavailable instances and whatnot. Fortunately, most of the important servers had "termination protection".

Another time, one of the Engineers was doing some analysis in the MongoDB production Database, and found it easier to setup a tunnel to 27017 in his computer to connect to the DB directly. While multitasking, he decided to run the test suite of some changes he was doing to one of the programs.... but that test had as a first step to delete all tables from your local Mongo database. Next morning when I come back Ops people tell me that "the system is empty". My first thought was "What do they mean the system is empty? maybe one of the dashboards failed to update or something". My surprise when I login and see what really happened...oh boy. We spent like 5 or 6 hours getting everything back. Fortunately a combination of Backups and MongoDB replication logs allowed us to recover everything.


$ rm -rf /var/mysql/data

*Thinks for 5 seconds...is this prod? ....@$1£!? ... ctrl + c....should I tell anyone that just happened....walks over to line manager...*

I did not taste anything that day.


Hopefully there was a backup...?


That was a live read-only replica which we were able to switch to within 20-30 mins of the issue. Then snapshotted that and restored to the master later that night. The CTO was on holiday at the time so writing that email to him was brutal.


Slightly off topic but I think in most cases responsibility for accidentally bringing down production likes with management.

However I’ve only ever heard stories where management lays the blame.


Where I have worked, its devs and ops who bork the systems then have to set about fixing it while management run off to face the customer. I wouldn't change place with management for love nor money.

You see a lot of mea culpa outage stories on HN, many written by management. If you have never read one you should. The outcome of a mistake is a learnin.

Altho it is also fun to share fail stories just for the scale of the fsck up. ;)


For those who are old enough to have used the original Visual Source Safe, they'll remember that "Get" and "Delete" were right next to each other in the context menu (edit: or at least very close by) of the code tree.

My first week at a game studio, I right-clicked the root of the tree and... yep... chose delete. And it "just worked".

Obviously not just my fault alone, but yeah, that was a pretty rough first week.


Not mainnet but production testnet used by hundreds of people on a blockchain I worked for. I noticed a certain type of transaction returned me funds, so I set up a script to see how far it could go, next stop was mainnet. Eventually my script was blocked by the testnet relayer, so I inquired if we had a blacklist set up.

The respond was, "No dude, but you just blacklisted everyone."


Partition with the mSQL database on filled up; I moved it to /tmp. On a Solaris box. Which rebooted some weeks later.


Can't tell you how many weblogic "won't starts" I fixed over the years caused by people staging ear's and war's in /tmp on Solaris.


But it ran much faster until the reboot... LOL


Incremental backups and full backups on weekends. Operators doing the backups didn’t know it was incremental. With time, there were only two tapes dedicated to those. This was just an ad newspaper and they lost 2 days of ads. As it was only a local failure, they just ran at a loss for to two days and the business didn’t suffer.


I gave my boyfriend at the time root access to my personal server

In a sleepy haze he accidentally ran a chown -R him:him / && chmod -R 777 /

Everything basically imploded on itself. I had to boot from a LiveCD and slowly correct the chmod/chown of every file and folder, using my local machine as a permission reference structure


I ran a bunch of scripts on some aux boxes, and Tokyo stopped working. Can't talk details... sadly


Reading through these stories, every one of which is fascinating, I’ve never been more glad that my job does not involve doing anything where the consequences could be summed up as “Tokyo stopped working.”


You mean the city stops functioning?


No, that's how I think about it. Internet had loss of availability in that region.


When I worked in adtech, we would load ads from a sqlite file. I rolled out a schema change but forgot to update the code which retrieved query data by column index rather than name.

We served a few million ads that contained nothing but the text "255". It was quite expensive.


Doom on a single PC was fun. But the same game on our network with all those broadcasts running through our net, which hosted a university's PCs. That was bad. And I started it several times until I noticed that "I" was the cause for network slowdown.


Best part was that a sysop entered the room and within a second knew that I was the problem and that I was playing doom. He just told me "if you wanna play games instead of studying than play doom 2. This does not broadcast"


He knows, he plays at night ;)


I accidentally dropped the production database thinking it was a development environment. I actually remember thinking "Stupid MySQL, stop asking me to confirm. I know what I'm doing!" We restored from nightly backup but lost the day's data.


Not me, but someone pinged out slack chat asking for us to "please revert".

We didn't realize the issue until he admitted that he had run an update on the full user table (forgot a where clause) and every single email was now being funneled into his email account.


This is a fairly common theme in discussions about outages and production incidents. I'm surprised at this point with various SQL servers there's no simple ACL to the effect of "can run UPDATE, but must include a WHERE clause" or "Can run UPDATE constrained to x rows, will error if more are selected". Obviously there are times where this will be intended but I think "UPDATE ALL ROWS" is a pretty rare requirement.


I didn't do this personally but one enterprise I worked at ran their production DNS off an old sun ray desktop sitting on a random desk in the computer room. Eventually somebody who didn't think it was being used for anything just yanked the plug.


When I started as an intern, I cannot access a file under /etc (or some other privileged top directory). So I need to sudo cat or something each time I want to access the file. Then I decide chmod 777 /. Then the machine cannot boot up anymore...


Not me, but I had to wake up at 3:00am and fix it. Someone ran “User.destroy_all” in a production Rails console. This was a social network for a ‘well know sports agency’. Production console’s are a sharp tool. Be careful who you hand them to.


I was working on Nationbuilder, a horrible all-in-one master-of-none thing, back in 2016 or so, and ran so many concurrent tasks in our BE that it started affecting all of their other clients. Waking up an engineer in CA was fun


That thing is a pile of absolute dog shit. New up and coming UK political party are using it - I volunteered to help with setting up / maintaining their tech stack, realized within 5 minutes all their eggs were in that godforsaken basket and changed my mind.


I locked GitHub’s user table in the middle of the workday by adding a column. Oops.


While working at one of the top 3 Global airlines (around 2015), I deployed an experimental feature that streamed the real-time airport indoor location (activated upon entering a geo-fence) into the airline's iOS mobile app used by hundreds of thousands of customers daily.

Setup was, mobile app -> detect beacon & ping web endpoint with customer-id+beacon-uuid -> WAF -> Web application -> Internal Firewall -> Kafka Cluster -> downstream applications/use cases

It was an experiment — I didn't have high expectations for the number of customers who'd opt in to sharing their location. The 3 node Kafka cluster was running in a non-production environment. Location feed was primarily used for determining flow rates through the airport which could then predict TSA wait times, provide turn by turn indoor navigation and provide walk times to gates and other POIs.

About a week in, the number of customers who enabled their location sharing ballooned and pretty soon we were getting very high chatty traffic. This was not an issue as the resource utilization on the application servers and especially the Kafka cluster was very low. As we learned more about the behavior of the users, movements and the application, mobile team worked on a patch to reduce the number of location pings and only transmit deltas.

One afternoon, I upgraded one of the Kafka nodes and before I could complete the process, had to run to a meeting. When I came back about an hour later and started checking email, there were Sev-2/P-2 notifications being sent out due to a global slowdown of communications to airports and flight operations. For context, on a typical day the airline scheduled 5,000 flights. As time went on it became apparent that it was a Sev-1/P-1 that had caused a near ground stop of the airline, but the operations teams were unable to communicate or correctly classify the extent of the outage due to their internal communications also having slowed down to a crawl. I don't usually look into Network issues, but logged into the incident call to see what was happening. From the call I gathered that a critical firewall was failing due to connections being maxed out and restarting the firewall didn't seem to help. I had a weird feeling — so, I logged into the Kafka node that I was working on and started the services on it. Not even 10 seconds in, someone on the call announced that the connections on the firewall was coming down and another 60 seconds later firewall went back to humming as if nothing had happened.

I couldn't fathom what had happened. It was still too early to determine if there was a relationship between the downed Kafka node and the firewall failure. The incident call ended without identifying a root cause, but teams were going to start on that soon. I spent the next 2 hours investigating and following is what I discovered. ES/Kibana dashboard showed that there were no location events in the preceding hour prior to me starting the node. Then I checked the other 2 nodes that are part of the Kafka cluster and discovered that being a non-prod env they were patched during the previous couple of days by the IT-infra team and the Zookeeper and Kafka services didn't start correctly. Which meant the cluster was running on a single node. When I took it offline, the entire cluster was offline. I talked to the web application team who owned the location service endpoint and learned that their server was communicating with the Kafka cluster via the firewall that experienced the issue. Furthermore, we discovered that the Kafka producer library was setup to retry 3x in the event of a connection issue to Kafka. It became evident to us that the Kafka cluster being offline caused the web application cluster to generate exponential amount of traffic and DDoS'd the firewall.

Looking back, there were many lesson learned from this incident beyond the obvious things like better isolation of non-prod to and production envs. The affected firewall was replaced immediately and some of the connections were re-routed. Infra teams started doing better risk/dependency modeling of the critical infrastructure. On a side note, I was quite impressed by how well a single Kafka node performed and the amount of traffic it was able to handle. I owned up to my error and promptly moved the IOT infrastructure to cloud. In many projects that followed, these lessons were invaluable. Traffic modeling, dependency analysis, failure scenario simulation and blast radius isolation are etched into my DNA as a result of this incident.



... was accidentally running critical routing for a billion dollar website from my workstation once for about 15 minutes before I could fix it all.


how about read all the anecdotes posted not even two weeks ago about the HBOMax email incident

c'mon

https://news.ycombinator.com/item?id=27546017


Good friend of mine once fat fingered something like: sudo chown -R 777 /


Mine is boring, stopped the wrong EC2 instance




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: