Hacker News new | past | comments | ask | show | jobs | submit login
My $500M Mars rover mistake (chrislewicki.com)
1028 points by bryanrasmussen 10 months ago | hide | past | favorite | 343 comments



Really well written story.

As a software engineer, I have a couple stories like this from earlier in my career that still haunt me to this very day.

Here’s a short version of one of them: Like 10 years ago, I was doing consulting work for a client. We worked together for months to build a new version of their web service. On launch day, I was asked to do the deployment. The development and deployment process they had in place was awful and nothing like what we have today—just about every aspect of the process was manual. Anyway, everything was going well. I wrote a few scripts and SQL queries to automate the parts I could. They gave me the production credentials for when I’m ready to deploy. I decided to run what you could call my migration script one last time just to be sure I’m ready. The very moment after I hit the Enter key, I realized I had made a mistake: I had just updated the script with the production credentials just before I made the decision to do another test run. The errors started piling and their service was unresponsive. I was 100% sure I had just wiped their database and I was losing it internally. What saved me was that one of their guys had just a couple hours earlier completed a backup of their database in anticipation of the launch; in the end, they lost a tiny bit of data but most of it was recovered via the backup. Ever since then, “careful” is an extreme understatement when it comes to how I interact with database systems—and production systems in general. Never again.


Your excellent story compelled me to share another:

We rarely interact directly with production databases as we have an event sourced architecture. When we do, we run a shell script which tunnels through a bastion host to give us direct access to the database in our production environment, and exposes the standard environment variables to configure a Postgres client.

Our test suites drop and recreate our tables, or truncate them, as part of the test run.

One day, a lead developer ran “make test” after he’d been doing some exploratory work in the prod database as part of a bug fix. The test code respected the environment variables and connected to prod instead of docker. Immediately, our tests dropped and recreated the production tables for that database a few dozen times.


Verbatim from my current code:

    if strings.Contains(dbname, "prod") {
        panic("Refusing to wipe production database!")
    }
    Truncate(db)


Ours are not named with a common identifier and this also needs constant effort to maintain while refactoring and there's still scope for a mistake.

*ideally* devs should not have prod access or their credentials should only have limited access without permissions for destructive actions like drop/truncate etc.

But in reality, there's always that one helpful dba/dev who shares admin credentials for a quick prod fix with someone and then those credentials end up in a wiki somewhere as part of an SOP.


That‘s why you do credentialing via ssh keys, and keys are explained and map to a user, and non-dba keys should expire.

If you need access for a quick prod fix, your key gets added to the machine with that explanation and a week (or lees) lifetime.


I also have a table with one row in it indicating whether the database is prod.


I've added a similar safety to every project. It's not perfect, but this last line of defense has saved team members from themselves more than once.

For Django projects, add the below to manage.py:

    env_name = os.environ.get("ENVIRONMENT", "ENVIRONMENT_NOT_SET")
    if env_name in TEST_PROTECTED_ENVIRONMENTS and "test" in sys.argv:
        raise Exception(f"You cannot run tests with ENVIRONMENT={env_name}")


I think runtime checks like this using environment variables is great however what has burned me in the past is that when debugging problems, not knowing what happened at runtime when the logs were produced was problematic. So when test protected environments environment variable needed to be updated I might have a hard time back tracking to it


Everybody replying to you that this is fragile is missing the point. This kind of code isn't the first line of defense—it's the last.


Exactly- it's layers of prevention rather than being just one screwup away.


And when your last line of defense fires... you don't just breath a sigh of relief that the system is robust. You also must dig in to how to catch it sooner in your previous lines.

For instance, test code shouldn't have access to production DB passwords. Maybe that means a slightly less convenient login for the dev to get to production, but it's worth it.


Yup, I have 3 prompts if you want to wipe anything.

One of the reasons I put interactions between databases behind a cli.


This is bad because if someone forgot to add prod or for whatever reason the code executed beyond the panic, you’ll wipe out the db.

There is no code that will protect your db/data. Only replication to a read-only storage will help in such situations.


If code is executing past a panic, I think it is unlikely that you can trust the integrity of your database anyways.


But what if you have a chron job that auto replicates and then deletes everything after you forward it?


And then it turns out that the order of the parameters was mixed up... just kidding.


Just yesterday, I did C# Regex.Match with a supersimple regex: ^\d+ And it seemed not to work. I asked ChatGTP and he noted that I had subtle mistake: the parameters were other way around... :facepalm:


That's indeed a drawback of function call syntax compared to method call syntax where the object comes before the name of the method.


#metoo


We had this - 10 years ago. In our case there was a QA environment which was supposed to be used by pushing code up with production configs, then an automated process copied the code to where it actually ran _doing substitutions on the configs to prevent it connecting to the production databases_. However this process was annoyingly slow, and developers had ssh access. So someone (not me) ssh'd in, and sped up their test by connecting the deploy location for their app to git and doing a git pull.

Of course this bypassed the rewrite process, and there was inadequate separation between QA and prod, so now they were connected to the live DB; and then they ran `rake test`...(cue millions of voices suddenly crying out in terror and then being suddenly silenced). The DB was big enough that this process actually took 30 minutes or so and some data was saved by pulling the plug about half-way through.

And _of course_ for maximum blast radius this was one of the apps that was still talking to the old 'monolith' db instead of a split-out microservice, and _of course_ this happened when we'd been complaining to ops that their backups hadn't run for over a week and _of course_ the binlogs we could use to replay the db on top of a backup only went back a week.

I think it was 4 days before the company came back online; we were big enough that this made the news. It was a _herculean_ effort to recover this; some data was restored by going through audit logs, some by restoring wiped blocks on HDs, and so on.


Our test suite expects that the database name has a `_test` suffix, so you can't run the tests even locally without the suffix.


Our test harness takes an optional template as input and immediately copies it.

It’s useful to distribute the test anyway, especially for non-transactional tests.

If the database initialisation is costly that’s useful even if tests run on empty, as copying a database from a template is much faster than creating one DDL by DDL, for postgres at least.


(distribute as in parallelise, possibly across multiple machines)


Our test suite uses DB user that exists in docker DB but not in prod, so droping prod database cannot happen.


This is why I always delete by ID when cleaning up after tests.


At a place I was consulting about 10 years ago one of the internal guys on another product dropped the prod database because he was logged into his dev db and the prod db at the same time in different windows and he dropped the wrong one. Then when they went to restore the backups hadn't succeeded in months (they had hired consultants to help them with the new product for good reason).

Luckily the customer sites each had a local db that synced to the central db (so the product could run with patchy connectivity), but the guy spent 3 or 4 days working looooong days rebuilding the master db from a combination of old backups and the client-site data.


> logged into his dev db and the prod db at the same time in different windows

I am very worried about doing the wrong thing in the wrong terminal, so for some machines I colour-code my ssh windows, red for prod, yellow for staging and green for dev. e.g. in my ~/.bashrc I have: echo -ne '\e]11;#907800\a' #yellow background


This is a good idea, especially when tired, end of day, crunch time work is happening !


About 10 years ago I literally saw the blood drain from a colleagues face as he realised he had dropped a production database because he thought he was in a dev environment.

A DBA colleague sitting nearby laughed and had things restored back within a few minutes....


Isn't that almost exactly what happened at github too?


This happened to Gitlab.


Yup.


Anecdote: I ran a migration on a production database from inside Visual Studio. In retrospect, it was recoverable, but I nearly had a heart attack when all the tables started disappearing from the tree view in VS…

…only to reappear a second later. It was just the view refreshing! Talk about awful UI!


Around 15 years ago, I was packing up getting ready to leave for a long weekend. One of our marketing people I was friends with comes over with a quick change to a customers site.

I had access to the production database, something I absolutely should not have had but we were a tiny ~15 person company with way more clients than we reasonably should have. Corners were cut.

I write a quick little UPDATE query to update some marketing text on a product and when the query takes more than an instant I knew I had screwed up. Reading my query, I quickly realize I had ran the UPDATE entirely unbounded and changed the description of thousands and thousands of products.

Our database admin with access to the database backups had gone home hours earlier as he worked in a different timezone. It took me many phone calls and well over an hour to get ahold of him and get the descriptions restored.

The quick change on my way out the door ended up taking me multiple hours to resolve. My friend in marketing apologized profusely but it was my mistake, not theirs.

As far as I remember we never heard anything from the client about it, I put that entirely down to it being 5pm on Friday of a holiday weekend.


That's why I always write a BEGIN statement before executing updates and deletes. If they are not instant or don't return the expected number of modified rows I can just rollback the transaction.


That, and I start the line with /*, write the where clause first, and immediately before I execute the query I check the db host.

Oh, and I absolutely refuse to do anything but the most critical stuff against prod on Fridays.


Lesson is never attempt to do anything on a Friday afternoon that will take far more time for recovery.


Or is the lesson to _always_ attempt such critical changes on a Friday? After all, in this instance the client didn't notice any problems, apparently because they were already off to their weekend.

For me personally the much bigger issue would be harming the client, their business or our relationship. Doing a few hours of overtime to fix my mistakes would probably only feel as well deserved punishment...


One place I worked (some 20 years ago) had a policy that any time you run a sudo command, another person has to check the command before you hit enter. Could apply the same kind of policy/convention for anything in production.


I'm not sure this doesn't just lead to blind rubber-stamping unless this is done very very rarely


The trick is to have good access controls so confirmations happen often enough to be useful, but not so often to be rubber-stamped


I guess most of the common task are scripted / automated. Running a "raw" sudo command should be very very rare.


That's not a good advice IMO, as most sudo commands will mess-up just one host, and it's something you should generally be prepared for. You're more likely to develop a culture where engineers think about hosts as critical resources whereas they should be generally considered as instances that can be thrown away. It's better to identify hosts that are SPOF and be cautious on those only.

I can think of a larger blast radius when deleting files on a shared mount point for example but it's not representative to the regular use of sudo.


I have a rule when working on production databases: Always `start transaction` before doing any kind of update - and pay close attention to the # of rows affected.


If you use postgres you can put

  \set AUTOCOMMIT off
In your .psqlrc and then you can never forget the begin transaction; every statement is already in a transaction, its just the default behaviour to automatically commit the statements for some ungodly reason.

Years ago I hired an experienced Oracle developer and put him to work right away on a SQL Server project. Oracle doesn't autocommit by default, and SQL Server does. You don't want to learn this when you type "rollback;". I took responsibility and we had all the data in an audit table and recovered quickly. I wonder if there are still people who call him "Rollback" though.


> its just the default behaviour

That's good from the DBA perspective, but relying on that default as a user is risky in itself, when you deal with multiple hosts and not all are set up this way.


What strikes me as remarkable in all such stories is how almost always, the person committing the mistake is a junior who never deserves the blame. And how cavalier the handoff/onboarding by the 'seniors' working on the projects are.

Having worked in enough of these though, I am aware that even they (the "seniors") are seldom entirely responsible for all the issues. It's mostly business constraints that forces cutting of corners and that ends up jeopardizing the business in the long run.


As I said on Slack the other day in response to a similar story, "If, on your first day, you can destroy the prod database, it's not your fault."

(One of my standard end-of-interview questions is "how easy is it for me to trash the production database?" Having done this previously[1] and had a few near misses, it's not something I want to do again.)

[1] In my defence, I was young and didn't know that /tmp on Solaris was special. Not until someone rebooted the box, anyway.


> /tmp on Solaris was special.

I’ve had a search but can’t work out why it’s special.


it gets wiped on reboot. I remember around 2007 on Gentoo Linux, this behavior changed. I was using /tmp as pretty much a "my documents" type folder, I updated, and one day all my stuff was gone! I was flabbergasted. But yeah, it was reckless to store things on a folder that pretty has "temp" in the name!


This is rude, but I'd like to reply a comment you deleted in a separate thread.

"why didn't they have a hot-spare" They do! Flight spares are complete, flight-rated copies of spacecraft built for exactly this contingency: https://en.wikipedia.org/wiki/Flight_spare After launch the flight spares are used for terrain testing and troubleshooting. (The "mars yard" has flight spares for Curiosity and Perseverance https://www-robotics.jpl.nasa.gov/how-we-do-it/facilities/ma... which were used to test some wheels to destruction after Curiosity started showing some wear https://www.planetary.org/articles/08190630-curiosity-wheel-... )

The blog post lays it on a bit thick with the $500 million number and the "launch only two weeks away" given that the article itself is illustrated with a photo of the Sojourner flight spare. Spirit had the SSTB1 test rover. If he had actually blown out the entire electrical system, they could have launched it instead. Swapping out the entire vehicle right before launch would have been an awful job, but it's not flat out impossible.


Not rude at all! I appreciate the reply. Only reason I deleted my message was because right after posting, I scrolled down and saw someone asking the exact same question at the top level, so I felt like it was best to conserve effort and not repeat them.

I liked that other people pointed out that risk could have been eliminated by using polarized connectors (I hope they started doing this after the incident), but also made me wonder about "back-EMF" caused by solar flares. In other words, maybe all thick wires and ground/power planes should be hardened against current surges simply due to a solar event hitting mars (which may incidentally cover the case of back-powering the driver circuits).


Thanks you.

I have been burned by this in some version of Ubuntu and have assumed it was normal behaviour ever since.


A friend once had to remotely do an OS update of a banking system. Being cautious, he thought he'd back up some central files, just in case and went "mv libc.so old_libc.so". Had to call some guy in that town to throw in the Solaris CD on prem at 2:30 in the morning...


Never this simple and calling someone still probs right thing to but fixing stuff like this is what /sbin is for.


One way I mistake-proof things in SQL Management Studio is to have different colors for production vs test databases.

To do that, on the "connect to server" dialog, click "options". On the tab "connection properties" in the "connection" option group, check "use custom color". And I pick the reddest red there is. The bottom of the results window will have that color.

edit: my horrible foul-up was restoring a database to production. The "there is trouble" pagers were all Iridium pagers since they loved climbing mountains (where there was no cell service back then). But then that place didn't use source control, so it was disasters all the way down.


>The very moment after I hit the Enter key, I realized I had made a mistake

This brief moment in time has a name: an ohnosecond.

https://en.wiktionary.org/wiki/ohnosecond


Seems this is very typical, first time launches usually lose some data.

We never hear about first time launch deploys that wipe ALL data because whoever is so unlucky probably never got to browse hacker news


As a young consultant, I was once one Enter away from causing a disaster, but something stopped me. I still shudder even though it didn't actually happen. Nothing of the sort in many years since, so a great lesson in retrospect I guess.


We used to have another engineer watch over your shoulder when you do Prod stuff, can be very helpful.



I’d love to know the long term physiological effect on the body of these events. Have had a few. Still feel shakey :)


Hope you bought that guy a beer.

Great story, thanks for sharing.


to be fair, it's a rite of passage to do something like this.

But you should definitely have bought that man a beer :)


Hats off to the backup guys.


Backup saving the day.


Nightmare fuel


I'm reminded of the phrase - if your intern deleted the production database you don't have a bad intern; you have a bad process.

Whether this was a process problem or a human one we don't really get to judge since we do expect more from a FTE.

I'll just say putting myself into his shoes made me tear up as I read the dread and pangs of pain upon realizing what happened - then to have life again after the failure of the ray of hope. That weight, I've never had a project that so many people depended on.

All heroes in my book.


At a major brokerage firm I accidentally hit prod with a testing script that did several millions of dollars of fake FX test trades.

The first thing mentioned in the post mortem call was “No one is going to blame the guy who did those trades. It was an honest mistake. What we are going by to do is discuss why a developer can hit the production trading API without any authentication at all”.


Were the trades any good though?


No, it was caught by our trading ops guys. A few minutes after I hit enter I got. Rather chilling phone call from them. So that part of the system worked


Plot twist: It made so much money that that’s now their strategy.


Back in school, my roommate's mom worked for a hedge fund and he did part-time work for them. He factored out a common trading engine from individual strategies, and one day the head of the fund asked him to run a strategy that had made a bunch of money in the past, but had been retired after failing to make money for a while. So, he put the strategy back in production without any testing, forgetting that he had recently done some minor refactoring of the trading engine. He typo'd one variable name for a similar variable name, so in the loop where it broke down large orders into small orders, it actually had an infinite loop. Luckily the engine had an internal throttle, so it wasn't trading as fast as it could send messages over the network.

I was chatting with him when he noticed the stock the strategy was trading (KLAC) was gradually declining linearly. He looked at the L2 quotes and saw that someone using his brokerage was repeatedly putting out small orders, and then he realized they were his orders.

The fund got a margin call and had to shift some funds between accounts to make margin, and they had to contact regulators and inform them of the bug, and they had to manually trade their way out of the massive short position they traded. However, they ended up making $60,000 that day off of his mistake.


This is such a cool story.


That's an excellent postmortem culture.


You should never blame the individual for organizational failures like this. I see two process issues:

1. The plug was allowed to be connected backwards. Either this should be impossible or this hazard should be identified and more than one human should verify orientation

2. In use tools like multimeters should never be disconnected. At worst you get problems like this at best you annoy whoever was using it

Blaming individuals only gets them fired and weakens the entire organization. You just fired the one person who learned an expensive lesson.

The only time when an individual should be blamed is when they intended harm, at which case the law could kick in


You can't apply process thinking here, where the scenario is custom testing a unique probe, and you don't know what other constraints are in play (for example, the reason for the plug design). If NASA were sending these things to Mars by the dozen, then you can start to formalize things like test procedures and look for places mistakes can happen. But in this scenario, you're just disempowering your staff by not letting them choose the most effective and low-risk way to do one-off, highly specialized testing work.


I can't say about NASA, but I can say about my experience at ESA (European Space Agency), where I worked on Mars lander hardware. You have very very formal procedures and detailed checks as soon as you approach any parts which is going to fly.

The simplest task you can imagine takes incredible proportions (for good reasons).

Disconnect and reconnect that plug? Please inform persons X and Y, person Z must be present, only person W can touch that plug, and do perform a functional test according to the procedure in this document before and after and file these reports etc ...

Cleaning a part? Oh glob. Get ready for 3 months of adventure talking to planetary protection experts and book the cleanest room in the continent.


The Hacker News mic drop strikes again. I have nothing super substantive to add except to agree with your point and add that yes, it feels like work to put in the formal policies and procedures, but when the stakes are high enough (rocket to mars? its high enough), even the work that doesn't intuitively feel 'worth it' to someone is DEFINITELY worth it.

"It's a waste of time" is very often a fallacy, especially when the risk cannot be easily undone.

I (mostly mentally) complete the phrase "It's a waste of time" with "what's the worst that could happen?", and when I'm actually saying the phrase out loud, stare at whoever said that for 5 full seconds.


Exactly :). The funny part is, the thing actually crashed! [1]

Why? Bad error handling in the software (primarily). What is the worst that could happen? An instrument saturate, a variable gets stuck at a value, but keeps being integrated, the spacecraft computes a negative altitude and thinks it'a below ground level (negative altitude) but is in fact in full descent and at 3+ km from the surface. Oopsie !

[1] https://exploration.esa.int/web/mars/-/59176-exomars-2016-sc...


And we all know how reliable ESA landers are. The laughing stock of the industry.


Aerospace-grade connectors are specifically designed to support multiple keyings that prevent this kind of thing. It's definitely a problem preventable by careful design if the interface supports making this kind of mistake.


Can confirm. Source: I used to work for NASA, and I'm a private pilot. There are literally millions of electrical connections that get made on aircraft and spacecraft on a regular basis and I can't think of ever hearing of an incident caused by one of them being made backwards. (Now, mechanical connections getting made backwards is not unusual. That's why you check to make sure that the flight control surfaces move in the right direction as one of the last checklist items before you take off. Every. Single. Time.)


So how do you prevent them from grabbing the wrong break out box?

Like say they have one that is setup to test the motor driver circuitry and another one that is setup to test the motor?

Or say the breakout box intentionally has both sides of the connection on it, so that you can get in-between the driver and motor?


I can think of very few kinds of connectors for which this type of error is even possible. You would need two cable terminations which can connect to each other, for which either side can plug into the same jack.

So either the ends are literally the same (e.g. Anderson Powerpole), or there is some kind of weird symmetry or inadequate keying. Or maybe the two cables don’t connect directly and instead go through some of kind of interface? The latter is fairly common in networking, e.g. “feed-through” patch panels and keystone jacks and quite a few kinds of fiber optic connectors.

All of these seem like utterly terrible ideas in an application where you would take the thing apart after final assembly and where the person doing the disassembly or reassembly could possibly access the wrong side of the panel.


One guy in our workshop had to provide DC to a display with a round 4-pin connector. He soldered two neighboring pins to Gnd and the other two to Vcc. There were two chances to short the powersupply, one to brick the display and one to get it right. Guess what we had to replace until we found out.


A break out box could very sensibly have both sides of the connector on it and then have the various pins broken out into individual connections for flexibility.

In that case keying or whatever isn't going to prevent you from connecting to the wrong side, because both sides are present.


Looks like the author tried to double-power the motor with both the spacecraft motor driver and through the breakout box that MITMs the driver and the motor. In such event, the free-wheeling diode in the driver will allow reverse current to be fed back to the driver's power supply up to certain amounts. This will absorb back-EMF, or energy from "regen" from the motor.

I'm suspecting the breakout wasn't literally sitting between the driver and the motor, but rather all internal connections are broken out to the box for testing; and likely the author's mistake was to not mess with the spacecraft to temporarily disconnect the driver.

But I'm not sure if I'd "just" made the right call and done so nonchalantly on a Mars rover to launch in few weeks.


> flight control surfaces move in the right direction

How … how often does that go wrong?!


There have been several cases of the landing gear up/down lever getting wired backwards during maintenance. Not to worry, the gear has a 'squat switch' sensor that prevents the gear from being raised when the plane is on the ground. Unless you taxi over a bump and the switch decides it's now airborne. Crunch.


It depends on what you mean by "that". Getting control surfaces actually reversed is not very common, but it does happen, typically after maintenance when a mechanic inadvertently re-connects a control cable backwards.

Control cables also can and do break, but that too is fairly rare.

What is not rare is control mechanisms jamming. Here is an example:

https://www.ntsb.gov/news/press-releases/Pages/NR20230928A.a...


How not to check your flight surfaces: Air Astana 1388.


From the Wikipedia article of that flight:

"The incident was featured in season 23, episode 5 of the Canadian documentary series Mayday . . ." [1]

Season 23 - I'm glad I don't fly!

1. https://en.wikipedia.org/wiki/Air_Astana_Flight_1388


This is also the case for medical gas connectors in operating rooms, at least in Europe.


The one process part that can be controlled and jumped out from the first paragraphs is not letting people touch billion dollar equipment at the tail end of a 12-hour shift.

If you are putting people in a situation with absolutely no safeguards, you can’t have them go into it fatigued.

I’m guessing the people working on that team also weren’t getting great sleep by the discussion of high stress and long hours. Recipe for disaster.


> only gets them fired

Agree on the blame point, but not on firing point. As a manager, sometimes you need to fire people, that's a necessary part your job. And no, changing the hiring process cannot prevent that.


Firing people for incidental mistakes instead of overall bad performance is pretty shitty management.


For one incidental mistake of course not. For repeated inattention (like plugging mars rover's cables wrongly several times) at an attention-demanding job -- yes.


So you'd not blame them for their simple mistake, but still fire them for it?


Firing somebody for a simple mistake with grave consequences doesn't make your organization stronger. There will be plenty of better examples to make.


Anecdote:

At my first real job as a web dev after school, I crashed the production website on my very first day. Tens of thousands of visitors were affected, and all our sales leads stopped.

Thankfully, we were able to bring it back up within a few minutes, but it was still a harrowing ordeal. The entire team (and the CEO in the next room) was watching. It ended up fine and we laughed about it after some minor hazing :)

But by the time I left that job a couple years later, we had turned that fragile, unstable website into something with automatic testing, multiple levels of backups and failover systems across multiple data centers, along with detailed training and on-boarding for new devs. (This was in the early days of AWS, and production websites weren't just a one click deploy yet.)

That one experience led to me learning proper version control, dev environments, redis, sharding and clustering, VMs, Postgres and MySQL replication, wiki, monit, DNS, load balancers, reverse proxies, etc. All because I was so scared of ever crashing the website again.

That small company took a chance on me, a high school dropout with some WordPress experience, and paid me $15/hour to run their production website, lol. But they didn't fire me after I screwed up, and gave me the freedom and trust to learn on the job and improve their systems. I'm forever grateful to them!


Not in this case. It's a one-off very custom-built rover, the first of its name. There's already all kinds of processes established, but no one can foresee everything. Yes, they probably fixed the process after that, but remember that it was their first time.

PS: Also, more rules and better processes are not necessarily a good thing. Sometimes there are just too much red tape and bureaucracy that makes already super-slow NASA even slower. In those first-of-its-kind missions sometimes you need to risk and depend on people, not processes.


I can't help but think about what would've happened if the Rover had indeed been destroyed though. It seems the only thing that stopped that from happening was sheer luck as they could as easily ( I guess ) have connected to another wrong lead that wouldn't have the protection required to survive the charge? That is, it was outside the author's actual abilities to have stopped that and he could just as easily have been the destroyer of the Rover and forever remembered for that fact, as he had feared he would.


The fact that they were still being made to work after completing a 12 hour shift (which is already too long to be safe) means this was a process error


This really resonates with my experience. Working at a major airline, I was the one who would pick the most difficult and risky projects. One was a quick implementation of a new payment provider for their website. That website sold millions of euros worth of tickets every day. Seconds after deployment, it turned out that I had failed to recognize the differences between the test and live environments as one of the crucial variables was blank in production. I could have expected this if I had spent more time preparing and reading documentation. Sales died completely, and my heart sank. After a lengthy rollback procedure that resulted in a few hours without sales, a massive surge of angry customers, and a loss of several million euros, I approached the CEO of the company. I still remember catching him in an elevator. I explained that this incident was all my fault and I had failed to properly analyse the environment. I assured him that I was ready to bear full consequences, including being fired. He burst into laughter and said something like this: "Why would I want to get rid of you? You made a mistake that you'll never do again. You are better at your job than you were yesterday!" This experience was formative to me on many levels including true leadership. I successfully completed many high risk projects since than.


The language is just so anodyne and there’s just that bit of implausible detail in the story (approaching the CEO yourself when you’re the one who fucked up, also how parent claims to be a “top performer” and “I made my company lose millions” at the same time) makes me think this comment was written by an LLM, or at least a fabrication.


The suspicious part for me would be the CEO laughing like it was nothing. Also yes, one would expect it goes the other way around, when you messed up big, someone will come to you. But the world is big and maybe it happened like this.


Surely this is a variation of this anecdote attributed to IBM's Watson

https://news.ycombinator.com/item?id=13419313

"> A young executive had made some bad decisions that cost the company several million dollars. He was summoned to Watson’s office, fully expecting to be dismissed. As he entered the office, the young executive said, “I suppose after that set of mistakes you will want to fire me.” Watson was said to have replied,

> “Not at all, young man, we have just spent a couple of million dollars educating you.” [1]"


A variant is in From the Earth to the Moon, where a junior engineer at Grumman confesses messing up vital calculations for the Lunar Lander to his boss, and finishes with "So… I guess I'll go clean out my desk." "What for?" "I figure you're gonna fire me now."

The boss's response makes a lot more sense than the usual fluff, though: "If I fire you now, the next guy to make a mistake won't admit it and we won't find out about it until it's too late."


I wonder how much of those stories are rather wishful thinking of how it should work and not how it does work, when a major screw up happened and some heads need to roll for the sake of it.


I wonder how would the boss explain it to his bosses/shareholders. Was that totally a known possible outcome that merely surfaced by chance and subsequently handled without issues under his leadership, or...?


Apollo was very much pushing the envelope of bleeding edge technology, while the bosses were probably not too happy, it was far from the only occurrence, and didn't threaten the contract.


Thanks, I knew I heard that story somewhere before. (but I would not rule it out, that recent CEOs heard and learned from that episode as well)


In the context of “suspected AI” I at first thought you meant a different Watson!


Story is real or not aside, why would you not laugh it off? At that stage nothing can be changed, money was lost and bug was fixed. You can only look forward and plan for the future and the guy is going to be paranoid in the future deployments to make sure not to fuck up again.


In general yes, and people with enough Zen can do this. But if also the CEO is looking forward to hear from the board and the investors to explain the incident, he might not be in the mood to laugh.


Has it been edited? I don't see "top performer".


Also airlines doesnt sell millions of $ in tickets every day


The quote was:

> That website sold millions of euros worth of tickets every day.

The claim wasn't that a single airline sold a million dollars per day, but that a third party on seller sold a million euros worth of tickets a day.

Is that plausible?

Consider The City in the Sky:

    Every day 100,000 flights criss-cross the globe with more than 1 million people in the air at any one time. Dallas Campbell and Dr Hannah Fry explore the world of aviation.
https://www.imdb.com/title/tt5820022/

At any instance there are one million people aloft.

At any instance there's at least 50 million dollars worth of ticket sales in play - how much during a 24 hour day would you estimate?

Is it possible for a single third party seller to capture a million euro per day?


A CEO at the office and not in the golf course...that gives it away....


That story has been repeated in some form for as long as I was alive.


You worried about that?

I'm a frequent flyer and I got a feeling that most airline ticket booking pages are broken in some way more than half the time. Maybe not often broken to the point that they're blank, but definitely broken to the point that booking a ticket isn't possible (I prefer blank, so that I don't waste like 30 minutes on not being able to book a ticket).

Also most of the internet seems often broken. Oh hello Nike webshop errors upon payment (on Black Friday) for which helpdesk's solution is: just use the App.


Hell, I used to worry about down time for my tiny blog. Didn't want to let down my readers.

Everything can be a guilt trip if you try hard enough.

Then I met a guy, now a good friend, that made me do my first "pull the plug migration" on his most important website. He lived on this.

I looked at the site going down, horified. He mocked me, then proceeded with the udate. It didn't work. The site stayed offline for hours.

Then it worked again. And nobody cared. It had zero consequences on traffic.

User were pissed off for a few hours, and life goes on.


What always got me is that, for at least the first several years, Google couldn't get their store page to handle the load when they were releasing a new phone and it'd be crapping out for days.


Steam still craps out during large sales. I really wonder how Valve calculates that it's fine to keep losing out on (cumulatively) hours of sales each time.


Woa, and I always wonder why it's only me that seems to have to use the developer tools to enable that stupid submit button when I filled out every field on the page correctly, shaking my head and wondering how normal people use the internet. I keep thinking it's got to be something about using firefox instead of a big tech browser, my mouse gestures extension, I don't know but normal webshops are broken so often it's insane. Thanks for sharing that it's not just me!


I think the loss may not have been as much as you think; sure, nobody could buy tickets for a few hours, so theoretically the company lost millions of revenue during that time. But that assumes people wouldn't just try again later. Downtime does not, in practice, translate to losses I think.

I mean look at Twitter, which was famously down all the time back when it first launched due to it popularity and architecture. Did it mean people just stopped using Twitter? Some might, the vast majority and then some didn't.

Downtime isn't catastrophic or company-ending for online services. It may be for things in space or high-frequency trading software bankrupting the company, but that's why they have stricter checks and balances - in theory, in practice they're worse than most people's shitty CRUD webservices that were built with best practices learned from the space/HFT industries.


Even with HFT you’d have to have more than 50% of your trades go against you to lose any money, and you’ll probably have hedges, and losing some % of money will be within normal operation parameters. Shit happens! Links go down, hardware fails, bugs slip through no matter how diligent you are. (No I’m not looking to be hired by any HFT shops)


Being an airline boss I would really have hoped the response would have been more in line with the ethos of a plane crash postmortem, i.e. find the system causes and fix those. Maybe you need a copilot when doing live deployments and that copilot had authority to stop the rollout. Along with the usual devops guards.


This reminds me of that old joke that ends "Why would I fire you? We just spent millions training you!".

People who take on high risk projects are underappreciated. But many managers prefer employees who can reliably deliver zero value, than those with positive expected value but non-zero variance.


That story sounds so much like that joke that I'm wondering if there is some urban legend thing going on here.


Wow chatgpt is actually getting worse.


There must be more people like you in those major airlines, as those sites go down all the damn time. 6 hours..??? The lufthansa desktop site didn't allow anyone book anything for like 3 weeks straight, you had to use the app instead.


Your company should definitely have had a production-identical staging environment if an hour of downtime means millions lost :D

That development would be an obvious investment that pays for itself. I’m in banking, and terrified of making even a slightly complex deployment without validating it in production first. (Complex here referring to that it might be dependent not just on code changes, but also environment).


And then everybody clapped.


i've seen this comment almost verbatim somewhere lol (when the gitlab dev erased their DB a few years ago)


I work in TV. During my first job at a small market station 30 years ago, I was training to be a tape operator for a newscast. All the tapes for the show were in a giant stack. There were four playback VTRs. My job was to load each tape and cue it to a 1-second preroll. When a tape played and it was time to eject that tape, it was _very_ easy to lose your place and hit the eject button on the VTR that was currently being played on the air instead of the one that they just finished with. The fella who was training me did something very annoying, but it was effective: every time I went to hit the eject button, he would make a loud cautionary sound by sucking air through his closed teeth as if to tell me I was about to make a terrible mistake. I would hesitate, double check and triple check to make sure it was the right VTR, and then I would eject the tape. He made that sound every single time my finger went for the eject button. It really got on my nerves, but it was a very good way to condition me to be cautious. Our station had a policy: the first time you eject a tape on the air got you a day off without pay; the second time put you on probation; the third time was dismissal. I had several co-workers lose their jobs and wreck the newscast due to their chronic carelessness. Thanks to my annoying trainer, I learned to check, check again, and check again. I never ejected a tape on the air. It certainly would not have been a half-billion dollar mistake if I had, but at that point in my career it would have felt like it to me.


That explains the old blooper reels that were popular on TV in the early 80's, where the reporter would be talking about something, and get video of something completely bonkers in the background instead.


Rolling the wrong tape still happens frequently enough on modern live broadcasts.


I agree that the person who made such a mistake will be the person who never makes that mistake again. That's why firing someone who has slipped up (in a technical way) and is clearly mortified is typically a bad move.

However, I don't agree that this is the "real" lesson.

Given the costs at play and the risk presented, the lesson is that if you have components that are tested with a big surge of power, give them custom test connectors that are incompatible with components that are liable to go up in smoke. That's the lesson. This isn't a little breadboard project they're dealing with, it's a vast project built by countless people in a government agency that has a reputation for formal procedures that are the source of great time, expense, and in some cases ridicule.

The "trust the 28 year old with the $500m robot that can go boom if they slip up" logic seems very peculiar.


Well, it's true that it should be designed such that they cannot be plugged incorrectly. I would imagine it is indeed mostly designed in that way, but there can still be erroneous configurations that were not accounted for at the design stage.

Especially during testing you're often dealing with custom cables connectors and circuits that are different from the "normal configuration".

I would say that the lesson is to do as many critical operations under the 4-eye principle: someone is doing the thing, someone else is checking each step before continuing. Very effective for catching "stupid mistakes" like the one in the article. But again, it is not always possible to have two people looking at one test, especially with timeline pressure etc. So mistakes like these do happen in the real world. You have to make the whole system robust.


> Well, it's true that it should be designed such that they cannot be plugged incorrectly

I agree with you, but on Earth this is easy. For spacecraft I imagine you can't just use any connector from Digikey

> especially with timeline pressure etc.

If timeline pressure, lost sleep, or rushing jobs not meant to be rushed causes a catastrophic technical error to be made, it is 100% the fault of the person who imposed the timeline, whether that be some middle manager, vice president, board, investor, or whoever. Emphatically NOT the engineer who did the work, if they do good work when not under time pressure.

HOLD PEOPLE LIABLE for rushing engineers and technicians to do jobs that require patience and time to do right.


I agree that individuals shouldn't be held responsible for mistakes like this.

However, you can't always eliminate timeline pressure. Even if the project is planned and executed perfectly, there will almost always be unknown unknowns encountered along the way that can push your timeline back. As is the case with sending things to Mars there is a window every two years. That's a very real, non-fictitious deadline that can't be worked around.


> As is the case with sending things to Mars there is a window every two years.

This is very simple to deal with.

(a) If it's unmanned, rush and launch on-time but don't fault the engineer for mistakes made by rushing. If it doesn't work everyone accept that as a consequence of rushing.

(b) If it's manned, wait until the next launch window and prioritize safety. Period.


That was my first thought as well.

On the other hand, it's hard to make these kinds of judgment calls when you're talking about a one-off piece of equipment that's only going to go through this particular testing cycle a single time.

In computing, there are a lot of similar "one-off" operations -- something you to do to the prod database or router config a single time as part of an upgrade or migration.

Sometimes building a safeguard is more effort than just paying attention in the first place. And while we don't always perfectly pay attention, we also don't always perfectly build safeguards, and wind up making similar mistakes because we're trusting the faulty safeguard.

In circumstances like the one in the story, the best approach might almost be the hardware equivalent of pair programming -- the author should have had a partner solely responsible for verifying everything he did was correct. (Not just an assistant like Mary who's helping, where they're splitting responsibilities -- no, somebody whose sole job is to follow along and verify.)


“One off” is never just a one off it’s always part of a class of activity such as server migrations etc. Just paying attention guarantees eventual failure when repeated enough times.

This may be acceptable, but it comes down to managing risks. If failure means the company dies then taking a 1 in 10,000 risk to save 3 hours of work probably isn’t worth it. If failure means an extra 100 ours of work and 10k in lost revenue then sure take that 1 in 10,000 risk it’s a reasonable trade off.


On careful reading, the power was sent in to the power outleads of an H Bridge, which is a tough piece of electronics, and in the end nothing was damaged- the shutdown was unrelated. If it had been sent to the data line of the motor controller, it probably would have poofed something. We cant rule out that there were different connector types, but the two mistaken connectors were correctly assigned the same type.


When men landed first time on the Moon, the average age of engineers at Mission Control was 28 years old.


I work in this industry and let me explain how this happens. Despite being such a costly project, you can’t really hard-require unique connectors everywhere because of all of the competing requirements. Actually, connectors in particular tend to have a lot of conservative requirements such as being previously qualified, certain deratings, pin spacing, grounded back shells, etc. At the end of the day there’s only a handful of connector series used and stocked and it’s not feasible (at any cost really) to have no matching connectors whatsoever. Of course, you would normally try and make connectors either standardized with the same signals, or unique with no overlap in between.

I don’t know the details in this case but it could be like this: socket-type connectors are required on external connectors on the spacecraft (to prevent shorts when handling), with a harness in between which will never be removed. The harness would be symmetrical with pin-type connectors.

At some point it is decided a breakout box is required for testing and now you have created an opportunity to plug the breakout box in backwards.

Or the breakout box has a 100 pin connector on one side and needs to connect to 25 pieces of test equipment on the other side. You probably don’t have 25 different connectors to chose from, nor can you possibly demand custom requirements for every piece of test equipment.

Spacecraft are moving more towards local microcontrollers with local diagnostics so this kind of test equipment for every possible analogue signal is decreasing. In the case of motors, they would more likely be brushless now and you would rely on telemetry from motor drivers during both testing and flight instead of having this type of breakout box.

Connectors in aerospace are also following other industries and becoming more configurable at order time, including adding keys so you can have 10x “the same” connector but keyed so they only plug in one place. But it’s still not practical to demand all test equipment is configured like this.


| The "trust the 28 year old with the $500m robot that can go boom if they slip up" logic seems very peculiar.

Not just that, but to create a situation whereby said person is working unofficial double shifts to get it done, so probably aren't going to be bringing their best selves into the office. If it were my $500 million I wouldn't even care about the name of this guy but would want to have some very robust discussions with the head of their department. Also, "some mistakes feel worse than death" - I get it, but c'mon, it's not like someone actually did die, which is a sadly unfortunate reality of other much less spectacular and blog-worthy mistakes.


> The "trust the 28 year old

(Same for an 82 year old or any other number..)


I'll add my story here for posterity:

My first job out of university, I was working for a content marketing startup who's tech stack involved PHP and PerconaDB (MySQL). I was relatively inexperienced with PHP but had the false confidence of a new grad (didn't get a job for 6 months after graduating - so I was desperate to impress).

I was tasked with updating some feature flags that would turn on a new feature for all clients, except for those that explicitly wanted it off. These flags were stored in the database as integers (specifically values 4 and 5) in an array (as a string).

I decided to use the PHP function (array_reverse)[https://www.php.net/manual/en/function.array-reverse.php] to achieve the necessary goal. However, what I didn't know (and didn't read up on the documentation) is that, without the 2nd argument, it only reversed the values not the keys. This corrupted the database with the exact opposite of what was needed (somehow this went through QA just fine).

I found out about this hours later (used to commute ~3 hrs each way) and by that time, the senior leadership was involved (small startup). It was an easy fix - just a reverse script - but it highlighted many issues (QA, DB Backups not working etc.)

I distinctly remember (and appreciate) that the lead architect came up to me the next day and told me that it was rite of passage of working with PHP - a mistake that he too had made early in his career.

I ended up being fired (grew as an engineer and was better off for it) but in that moment and weeks after it, it definitely demoralized me.


They fired you for that array_reverse mistake?


Yep, I was put on PIP with impossible success criteria (no issues raised in PRs by senior engineers and no issues in code deployed to production - even if it was reviewed by senior engineers & QA) and fired (for failing that criteria) in 2 weeks.

I worked there for ~8 months in total.


> no issues raised in PRs by senior engineers

Wat? Like serious issues, or minor things that can be improved? Because it's very rare in my place of work that there are no comments on a 'PR'. Something can always be improved.


+1 on this, every place or project I have touched has a backlog of tens if not hundreds of nice to haves but never enough time to touch them, and some of them are really not complicated.


The impossible PIP trick for dismissal is something I’d love to see get eventually legally obliterated.


Constructive dismissal is illegal in many countries. It's the choice of the people which system they want to work in.


It’s illegal where I am, but employers are extremely capable of abusing “performance improvement plans” as a way to constructively dismiss people - knowing most people won’t have the wherewithal to fight it in court.


Sometimes the only winning move is no to play


The story is compellingly written, but I thought it was also confusing.

It sounds as if this team made several mistakes, not just one mistake. It's also not clear if the result of these mistakes was that there might be real damage to the spacecraft, or if the result was just wasted time and hours of confusion about why the spacecraft wouldn't start up.

The first mistake is they didn't realize that the multimeter was not only measuring, but it was also completing the circuit.

That sounds like a real bad idea. But if it was totally necessary to arrange it like that, then that multimeter should never have been touched.

That's not just one guy's error. It's at least two guys at fault, along with whoever is managing them, and whoever is in charge of the system that allows it.

The second mistake is with the break-out-box. They think he misdirected power wrongly into the spacecraft. Then they jump to the conclusion that has generated a power surge which has damaged the spacecraft, because it won't start up.

But they're not sure where the power surge went and what might be damaged. Anyhow they're wrong.

The reason the spacecraft won't start up is just because he took the multimeter out of the circuit before the accident.

I'm still sort of confused about what happened or if they ever really figured out what happened.

He said "Weeks of analyses followed on the RAT-Revolve motor H-bridge channel leading to detailed discussions of possible thin-film demetallization".

Does this mean that they decided that the misdirected power surge might have flowed into the RAT-Revolve motor H-bridge channel and damaged that?


You forgot: the telemetry guy (Leo) didn't mention they had lost Telemetry before the storyteller told him he did a mistake. I mean shouldn't they have cancelled all testing until they have it back?


It started up fine. The multimeter was connecting up the telemetry, so they weren't getting any information from it until they restored that circuit.

The power absolutely did feed into that circuit, they were trying to decide if it would have damaged it (but a motor driver is going to be able to handle power coming from the motor, so they decided that it probably didn't damage it).


That is my reading too. But why was the multimeter connecting up telemetry? That seems very strange to me.


According to the article, it was monitoring bus voltage, but I could imagine it was used to measure the current being used by the telemetry system. So if it was used as an ammeter, it would've been placed in series.


I assume it was wired up in series to measure current.


> It started up fine.

Thanks. I understand that better now. The spacecraft did start up, but it seemed as if it could be badly damaged because they were not receiving any telemetry data


I am a Mechanical/Aerospace engineer.... I wish my scariest stories 'only' involved a potential bricking of a main computer on an unmanned $500M rover.

No... I was the senior safety-crit signoff on things carrying human lives. I had to look over pictures of parts broken from a crash and have the potential feeling of 'what-if that's my calculation gone wrong'. My joint that slipped. My inappropriate test procedure involving accelerated fatigue life prediction, or stress corrosion cracking. My rushing of putting parts into production processes that didn't catch something before it went out the door.

It's interesting to read people's failure stories from similar fields but, to me, the ones that people so openly write about and get shared here on HN always come across as... well, workplace induced PTSD is not a competition. It's just therapy bills for some of us more than for others.


That reminds me of a fiction-quote by one of my favorite authors, where a welding-instructor has just finished sharing a (somewhat literal) post-mortem anecdote of falsified safety inspections.

> He gathered his breath. “This is the most important thing I will ever say to you. The human mind is the ultimate testing device. You can take all the notes you want on the technical data, anything you forget you can look up again, but this must be engraved on your hearts in letters of fire.

> “There is nothing, nothing, nothing more important to me in the men and women I train than their absolute personal integrity. Whether you function as welders or inspectors, the laws of physics are implacable lie-detectors. You may fool men. You will never fool the metal. That’s all.”

> He let his breath out, and regained his good humor, looking around. The quaddie students were taking it with proper seriousness, good, no class cut-ups making sick jokes in the back row. In fact, they were looking rather shocked, staring at him with terrified awe.

-- Falling Free by Lois McMaster Bujold


Aside: I recognised 'quaddies' from your quote ... there was some very distinctive cover art on the Analog magazine for that story: https://www.abebooks.co.uk/Analog-Science-Fact-Fiction-Febru...


Regarding your last paragraph, I thought the same... When the author wrote:

> I'm instantly transported back to that moment — the room, the lighting, the chair I was in, the table, the pit in my stomach, ...

I could'nt help but think "that sounds like a trauma reaction". Good on them to be able to use that energy to do better! But also not everyone reacts the same way to trauma nor is it easy to compare such reactions to trauma (for example as a hiring question). I feel there are too many social variables at play


This was the triggering phrase in the original article for me, yep.

I also want to clarify that I've worked in both aerospace and automotive, and the mention of the word 'crash' in my above comment was referring to work I did in automotive, lest someone tries to start wondering 'which one' with regards to an airframe.

For me, the reaction the the stress of having to make sure I was delivering... and the idea that those things out there, I mean.. put it this way I've worked on enough vehicles that a majority of HN readers will have ridden in something utilizing math that I did or parts that I specified, drew, and released, on a road, at least once in the last 15 years.

I once had potential employers ask that 'how would you respond to this kind of stressful situation' question before and I've actually had difficultly getting my answers across because the real stressful shit I can't even talk about without potentially triggering just a horrible social reaction. Or panic attacks. Or potential legal issues.


> I had to look over pictures of parts broken from a crash and have the potential feeling of 'what-if that's my calculation gone wrong'.

Does it inevitably come down to that for someone? I mean even if its a detail that a procedure couldn’t have caught, someone is responsible for forming good procedures. I suppose there could be several factors. But it seems like ultimately someone is going to be pretty directly responsible.

Just interesting to think about in the context of software engineering and kinda even society at large where an individual’s mistakes tend to get attributed to the group.


"But it seems like ultimately someone is going to be pretty directly responsible."

Or many people, or no one directly. Space missions come with calculated risk. So someone calculates the risk that this critical part brakes is 0.5% and then someone higher up says, that is acceptable and all move on - and then this part indeed brakes and people die.

Who is to blame, when the calculation was indeed correct, but 0.5% chances still can happen (and itnwould be a lot)? And economic pressures are real, like the limits of physic?

See Murpheys Law, "Anything that can go wrong will go wrong." (Eventually, if done again and again)

https://en.m.wikipedia.org/wiki/Murphy's_law

Astronauts know, there is a risk with every mission, so do the engeneers, so does management. Still, I cannot imagine why anyone thought it was an acceptable risk, to use a 100% oxygen atmosphere with Apollo 1, where 3 Astronauts died in a fire. But that incident indeed changed a lot regarding safety procedures and thinking about safety. Still, some risk remain and you have to live with that.

I am quite happy though, that in my line of work, the worst that can happen is a browser crash.


They had reasons. See: https://en.wikipedia.org/wiki/Apollo_1#Choice_of_pure_oxygen...

Even after the fire, the Apollo spacecraft still used 100% oxygen when in space. The cabin was 60% oxygen / 40% nitrogen at 14.7 psi at launch, reducing to 5 psi on ascent by venting, with the nitrogen then being purged and replaced with 100% oxygen.

> See Murpheys Law...

Indeed. I hope that was a joke.


> Still, I cannot imagine why anyone thought it was an acceptable risk, to use a 100% oxygen atmosphere with Apollo 1

Especially when, in prior experience there, asbestos had caught fire in the same situation (O2, low pressure).


Wow, this detail I did not know yet. It just was a reckless rush to the moon, no matter the cost at this time. Without the deaths, nothing would have changed probably.


I read this in a book some years after the Apollo 1 (i.e. Apollo 204) fire, so I have no reason to doubt its provenance.


For what it's worth, I strongly disagree -- the group as a whole (and especially its leadership) is responsible for the policies they decide to institute, and the incentives they allow to exist. For example, in this article's story the author is apparently working >80 hour weeks directly manipulating the $500M spacecraft two weeks before it launches. Do we really think they are "directly responsible" for the described mistake? I think a root cause analysis that placed responsibility on any individual's actions would simply be incorrect -- and worse, would be entirely unconstructive at actually preventing reoccurrence of similar accidents.

I think this is furthermore almost always true of RCAs, which is why blameless post-mortems exist. It's not just to avoid hurting someone's feelings.


I’d be interested in talking to the team at Boeing.


I sincerely hope there are more people like you in the aerospace industry and less like those who conceived, implemented and signed off the 737 MAX's MCAS at Boeing...


I'm surprised that group therapy for engineers doesn't exist or maybe I just can't find it. I don't work in anything high stakes myself though have often been an ear to those who do in aviation or rail.

I believe that it can be quite hard sometimes if you have empathy and don't take the "Once the rockets are up, who cares where they come down? That's not my department" approach."

Perhaps some peer support group (possibly facilitated) for people that build safety critical systems or deal with the fallout. Not all companies will provide good counseling etc.

Perhaps the engineering boards / chartered engineer organizations should provide this and fund it from their membership fee, though that would probably scare people off going to the service as they could be afraid of losing their stamp / chartered engineer status / license.

Perhaps this would be dealt with in the past by getting drunk with colleagues in the pub, though alcoholism (or being impaired at work the next morning) is bad and pubs etc. are less popular now.


Profession induced trauma is finally getting taken seriously in the medical setting, another high stakes field.

Group therapy for doctors and nurses is finally becoming a thing, but unfortunately it is completely dependent on being employed by an organization that cares about it.


And the engineers designing safety mechanisms for nuclear weapons probably think you have it easy.


Yep, that's fair!

But my understanding is the default behaviour of most nuclear weapons (other than Hiroshima-style ones) is "blows itself to pieces without detonating the nuclear part", rather than "vapourises everyone within a mile".

Everything needs to go right for a nuclear weapon to actually blow up with a significant yield.


Yeah, luckily they seem to have a pretty well sorted failure mode.


Reminds me of the NOAA-N Prime satellite that fell over, because there weren't enough bolts holding it to the test stand.

The roots cause, and someone correct me if this is not accurate, was that the x-ray tested bolts to hold it down were so expensive, that they had been "borrowed" to use on another project, and not returned, so that when the time came to flip the satellite into a horizontal position, it fell to the floor. Repairs cost $135M.

https://en.m.wikipedia.org/wiki/NOAA-19


And so when "Check for bolts" is added to flip procedure errata/addendum, is light sarcasm called for ?


Interesting to see that the worry could have been avoided if they had lined up their timelines better in the first place. If they'd compared the timestamp on the test readout to the last timestamp from the telemetry system, they'd have seen that the telemetry failed BEFORE the test was executed. Partially caused by using imprecise language "we seem to have lost all spacecraft telemetry just a bit ago" rather than an accurate timestamp.

A cautionary lesson in properly checking how exactly events are connected during an incident. Easy to look at two separate signals and assume they must be causal in a particular direction, when in reality it is the other way around.


It's a interesting story, but the author may be overselling it. It's not a _failure_ story, nor was it a 500M mistake. I get that it was really stressful and the mistake could have cost him the job, but it didn't; it also didn't cost NASA anything other than a few hours of work (which, during testing, I would guess it's expected).

When I'm asked to share failures, I'm usually not thinking about "that one time when I almost screwed up but everything was fine", instead, I'm thinking of when I actually did damage to the business and had to fix it somehow.


“My quasi-$500M Mars rover mistake” just doesn’t have quite the same ring to it.

But, to your point, it is still a failure story that _could_ have lead to a much worse outcome than it did. The fact that it didn’t was mostly due to luck.


One thing about long aerospace missions like this with huge lead times that always gets me - you can spend years of your life working on a mission, only for it all to fail with potentially years until you can try again.

This is a refreshingly humanizing article, but is also one written from the perspective of a survivor. Imagine if the rover were actually lost. I asked the question "what would you do if the mission failed after all of this work? How could you cope?" to the folks at (now bankrupt) Masten Aerospace during a job interview, and maybe it was a bad time to ask such a question, but I didn't get the sense they knew either. "The best thing we can do is learn from failure," one of them told me. An excellent thing to do, but not exactly what I asked. This to me stands out as the defining personal risk of caring about your job and working in aerospace. Get too invested, and you may literally see your life's work go up in flames.


> you may literally see your life's work go up in flames.

Incidentally, this happened to Lewicki a few years later when Planetary Resources' first satellite blew up on an Antares rocket: https://www.geekwire.com/2014/rocket-carrying-planetary-reso...


Did they have a narrow launch window they couldn't afford to miss? I'm not talking about missions where you eat a big monetary loss on the launchpad and try again, I mean missions which rely on planetary alignments that may not happen again for years, or even the rest of your life, such as Voyager. Or even just missions where you launch successfully, but then after months (or years) of flight time the spacecraft is lost.


> "The best thing we can do is learn from failure," one of them told me.

I would argue that if we don't charge the process to prevent this kind of catastrophic failure mode then we really haven't learned from the failure.


> And I still remember the shock when Project Manager Pete delivered the decision and the follow-on news: ‘These tests will continue. And Chris will continue to lead them as we have paid for his education. He’s the last person on Earth who would make this mistake again.’

I wonder whether Pete had followed this 1989 general aviation/accident analysis story:

> When he returned to the airfield Bob Hoover walked over to the man who had nearly caused his death and, according to the California Fullerton News-Tribune, said: "There isn’t a man alive who hasn’t made a mistake. But I’m positive you’ll never make this mistake again. That’s why I want to make sure that you’re the only one to refuel my plane tomorrow. I won’t let anyone else on the field touch it."

-- https://www.squawkpoint.com/2014/01/criticism/

(The incident above led to the creation and eventual mandated use of a new safety nozzle for refueling, which seems like a better long-term solution than having the people who've nearly killed you nearby to fuel your plane indefinitely: https://en.wikipedia.org/wiki/Bob_Hoover#Hoover_nozzle_and_H...)


If there is the possibility to make a mistake, somebody will certainly make it. You expect all the humans involved to be competent. But relying on that competence is a mistake. The emotional stress of dealing with such enormous responsibilities, the often long work hours and the long list of procedures will make any competent professional to inadvertently slip up at some point.

In case of electrical connectors, the connectors are often grouped together in such a way as to avoid making wrong connections. Connectors with different sizes, keying, gender, etc are chosen to make this happen. This precaution is taken at design time. JPL is extremely experienced in these matters. There is probably something else left unsaid, that led to this mistake being possible.

Meanwhile, motor controllers using H-bridge is something that's never boring. I once saw a motor control fail so spectacularly that we were scratching our heads for days afterwards. As always, a failure is never due to a single cause (due to careful design and redundancies). It's a chain of seemingly innocuous events with a disastrous final outcome. But the chain was so mind-bending that we had to write it down just to remember how it happened. Recently, I was watching the Chernobyl nuclear disaster and I got reminded of this failure. Our failure was nowhere near as disastrous - but the initial mistakes, the control system instability, the human intervention and the ultimate failure propagation were very similar in nature. Needless to say, it sent us back to the drawing board for a complete redesign. The robustness of the final design taught me the same lesson - failures are something you take advantage of.


Are the electronics in these rovers really so bespoke that they don't have multiple copies of each electronic component warehoused on-site?

I'd expect that the rover body itself would be bespoke this late in the process, although a parallel test vehicle would be useful, do they have that?).

But in case someone fried the rover's electronics I'd think tearing it apart and replacing them while maintaining the chassis should be doable in 2 weeks, but what do I know?


They almost certainly had flight spares but with two weeks until your launch window, there is zero chance you are deintegrating multiple systems, swapping in the spare, reintegrating, and re running your acceptance test campaigns. And that is assuming that they damaged a subsystem. Back powering the entire spacecraft could have wrecked your power system and anything connected to it. You'd have to disposition every part of the system that was touched. It's much more involved than just swapping in the spare and sending it.


The implied context here is that you'd forgo the usual tests, because the alternative is to send nothing to Mars.

According to Wikipedia they could have stretched those 2 weeks to around 3 weeks, but after that they'd have missed the launch window.

The usual processes are there to have a near-certainty of a working rover, but under these circumstances I'd think they'd just YOLO it and hope for the best.

But that assumes they've got spare electrical components, or alternatively a better use for the booster sitting on the pad than such an improvised mission.


Spirit/Opportunity had the SSTB1 test rover, which supposedly had a complete set of scientific instruments. If it was fully qualified and tested, swapping it out could have been as easy as dropping it in the lander and writing a different serial number in the paperwork.

(I really doubt it was fully tested. But why else have a flight spare vehicle?)


I've worked on satellites, and yeah - everything is super bespoke, very low quantity, very expensive. There is probably a qualification unit, or a flight spare that may be available for many subsystems, but maybe not. Integration is a long and complicated process. Pulling apart this bot, with however many fasteners, joints, etc, and then reassembling it correctly would be a decidedly non-trivial project that could easily take a month or two, given that testing has to be performed at each step along the way to ensure that every sensor and actuator is fully functional throughout the integration process. This style of traditional aerospace assembly / integration is not particularly efficient. The only reason it is done this way is that for these kinds of missions you only get one shot, and total cost is ridiculously high so everything must be done correctly.


Any idea why they would use brushed motors? When every gram counts I would think ditching the mechanical commutator would be a no-brainer, but maybe adding another leg to the H-bridge is a bigger penalty?


In 2003, small brushless DC motors were way less mature and available as they are now, particularly for low-speed/high-torque applications*. Brushless controllers are much more complicated than brushed controllers, particularly on the control software front, so sticking with simpler and more reliable brushed controllers for a space application makes sense (remember, it probably needed to be radiation hardened - doing that for an H-bridge is much easier than a BLDC controller).

*A notable example of this is in the world of RC cars, where rock-crawlers only very recently have started switching to brushless motors using field-oriented control to deliver acceptable very-low-speed behavior. Until FOC controllers became available, brushed motors offered much better low-speed handling.


Without disagreeing with your point, would availability be an issue in this case? They need one or two, have an enormous budget, and if the technology exists can make their own.


Availability is often strongly correlated with technical maturity. Small brushless motors with FOC didn't become widely available and mature until really the late 2010s. Arguably the foundation of nearly all of DJI's product lines is due to their early mastery of small brushless motor control (drones, gimbals, lens controls, robots, etc), and that's a company founded in 2006, well after the events of the article.

You can get good control of brushed motors with just a couple of transistors. Good brushless control means FOC, which really requires a fairly capable microcontroller in addition to all the power electronics for variable-frequency drive. While brushed motors certainly have limitations, those were quite well understood by the early 2000s (to the point here that assessing whether or not damage had occurred was "just" a question of "have these few transistors suffered from voltage applied in an unintended manner"). Brushless motors involve way more components with way more integration required to make then small. Far more complexity and potential failure modes need to be understood.


> Availability is often strongly correlated with technical maturity.

I see what you mean. Yes, agreed.


Closed loop control of brushless motors is just more complex, in addition to needing 3 phase AC output, you also need either hall sensors or an encoder of some kind to be able to start the motor smoothly, and you need a dedicated IC or MCU for each motor to manage commutation and read the sensors.

I don't think FOC type controllers were anywhere near common back then either, which is needed to run a brushless motor smoothly.

There is just so much more that can go wrong with a brushless setup, vs brushed where you just apply power and that's it.


I'd guess it's mostly because it was 2003 and decent BLDCs were not super common-place yet. There were some older forms (steppers, PMSMs, etc) but they generally didn't have very good torque/weight performance. Brushed motors would probably have been the answer at the time.


Related: NASA's monster rocket costs several times more than SpaceX's monster rocket.


Your average installed Mars probe widget is not a fungible component; it's been integration tested six ways to Sunday and is certified in its current configuration, as is the assembly of the whole thing, center of gravity, cleanroom status, the torque on every fastener and so on. Even if it's an off the shelf component, it may not be possible to replace it without a chain of sign-offs that would require months of work.


Of course they are bespoke. There are no COTS Mars rovers for sale.


To be fair; there may be COTS Mars rovers for sale (possibly even cheap), they're just not certified or engineered specifically for Mars. Even an RC car might last long enough for a mission if it was free to get there.


Sometimes I wonder whether something like the SpaceX approach would work for these sorts of mission: develop a way to cheaply and reproducibly build the mission hardware, then iterate on it until it works.


Typically your best bet is going with a provider that already produces whatever item, isn't bespoke to that particular program, and has an assembly line of sorts already set up. However the other problem is they are often built in low numbers to begin with so getting hold of a flight qualified unit will most likely be an issue as well. Also everything is tightly packed together which usually makes replacing something involve messing with a bunch of other items.


Yes, and since they're such low-volume, they are super expensive. I've had to fight tooth and nail just to get more than one set of test hardware.


My understanding is that these electronics need to be radiation hardened (eg; RADHARD) to prevent them from misbehaving in ways far worse than the Belgian bit flipping incident (https://radiolab.org/podcast/bit-flip). If you want to use commercial off the shelf (COTS) parts, you need to install them in triplicate and have them "vote". (https://apps.dtic.mil/sti/citations/ADA391766)


Any test that could cause a fatal destructive error should be risk assesed and a suitable protocol approved with four eyes approval on a final checklist before going hot on the electrics. The issue here is poor project governance not human error.


Yeah, the borrowed multimeter really hammers how silly this all was. You don't touch other peoples lab equipment unless the other ends of the wires are hanging free. A finger pointing to a meter needs to be followed up with a clear confirmation that the wires can be disconnected, and special care isn't needed in the process. If I need something that's connected, I always ask the person to disconnect it for me. Definitely a process/culture problem.


$500M? Pocket change...

THE LITTLE VAX THAT COULD https://userpages.umbc.edu/~rostamia/misc/vax.html


I am as old as the hills. How have I never heard this story before? Thank you.


I cost my company about 5 times my yearly salary once, long ago. I sampled an enormous amount of seismic data at 2ms instead of the proper 4ms. This was back when we rented IBM our mainframes for a pretty penny. The job ran for the entire weekend and Monday morning I was summoned by management, informed of my error, and asked, "You won't ever do that again will you?" and sent back to work.

Knowing that you are allowed to fail, but are very much expected/required to learn from your failure, makes for rather a good employee, in my experience.


> I was into my unofficial second shift having already logged 12 hours that Wednesday. Long workdays are a nominal scenario for the assembly and test phase.

Although the time pressure coming with the upcoming deadline is understandable, perhaps the bigger lessons here is that when you are possibly sleep-deprived, and have already pulled too long a shift, you are bound to make avoidable mistakes. And that is the last thing you want on a $500M mission with at limited flight window.


Yep. Beyond the technical issue this story shows a people management issue.


Reminds me of a quote attributed to Thomas J Watson:

Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. No, I replied, I just spent $600,000 training him. Why would I want somebody to hire his experience?


Well be prepared to spend $600k more next month for the next “training” session.


The point is that they won't make the same mistake again


I guess the idea is that if the problem was incompetency they will find some other way to mess up as opposed to a genuine mistake made by a competent person.


> I guess the idea is that if the problem was incompetency they will find some other way to mess up as opposed to a genuine mistake made by a competent person.

Are you suggesting that a competent person never messes up/makes a mistake?

The most fundamental part of life is learning from mistakes, and today even AI is starting to do this. Mistakes and evolution are what _make us_ human and living.


Competent people make fewer mistakes because they’re careful and… competent.


Right, but we can all agree that they still make them, and sometimes big ones. Unless they are so careful that they severely limit both their potential and contribution to group. Or they are the rare person who happens to be both "perfect" and lucky.


Okay. But don't start an insurance company with that attitude.


I used to work in a rural manufacturing plant. I once totaled my car on the way to town. Our shop foreman was also a volunteer firefighter and was the first one on the scene. I survived with a bloody nose and shattered confidence. The next day at work we needed some parts from town. I didn't have a car anymore, but that same shop foreman lent me his suped up pickup truck. I was totally confused. He wisely said:

"Today is the safest day to let you drive my truck, cause I know you'll be extra careful"


You'll be extra careful but I don't know about extra safe. I had a few close calls while driving and it made me a less confident driver which I don't think did anyone any favors.


Ohhhh I hate things that you “just have to not screw up.” A fiddly manual process with so many possible ways to screw something up is almost guaranteed to see a catastrophic failure.

If this really manual fiddly process was really the only way they could test the motors, I’d say that’s a big failure on the design engineer’s part.


Or it just needs a budget that can dramatically increase and/or have a deadline that can continue to be pushed back to allow for corrections. As exhibit A, I'd like to present the james webb space telescope.


Test requirements often arise after system design is complete. Engineers need time to examine the system, theorize failure modes, and design the tests. Also time to test the system, find some unexpected failure mode, and then design yet more tests.


Would it be so hard to add a diagnostic interface? That’s all we’re talking about here. It seems like you’re making the problem more difficult than it is. And they’d had to have known ahead of time if they wanted to test voltage curves of motors.

Your answer is good in the general case, but for the anecdote, the design was clearly bad.


Off-topic but I never realized how much kapton tape is used to put these things together, until I saw these internal 'guts' photos.


As a spacecraft integration engineer I can confirm that most spacecraft are about 80% kapton tape.


From reading the developer anecdotes in here I think it’s worth mentioning that if just one person can bring down the whole enterprise, a hacker only needs a point of entry to do the same.

For our databases we have separate credentials, compartmentalized access and disallowed “dangerous” commands. This now seems like obviousness, but we only got this years in. Thankfully, no (major) incidents have occurred to this date.


It's a beautiful story. As a space fan, especially of the interplanetary type, this story was riveting to me as it unveiled details of the spacecraft testing I've never imagined. I do far less important testing in my day to day but I was able to draw some similarities with the author.

Many have posted of their failures here so I suppose I could share a couple of mine.

- Pushing gigabytes of records into a Prod table only to realize the primary key was off by a digit, rendering the data useless for a go-live. It had to be deleted by the database admins and reloaded, which took precious hours. I forget why, but an update wasn't feasible.

- A perfect storm of systems issues that lead to all servers in the pool becoming unavailable, causing an entire critical system to go dark. We got it back up within minutes, but harrowing nonetheless.

- Realizing hours before a go-live that a key data element was missing, prompting a client who was now in a code freeze to make a change (they were quite upset). Pretty sure I got an unfavorable review from that, but haven't made the same mistake since.


I’m glad they didn’t fire him.

I’m a firm believer that despite all the short comings of US, what makes it great is there are millions of engineers and scientists working to push the frontier of what is possible, and trillions of dollars in economy to fund that into reality.

NASA is truly an inspiration.

And also the private aerospace industry - SpaceX, ULA, Boeing, Lockheed Martin, Blue Origin, Planet Labs etc.

No other country has that.


The fact this can so easily happen shows a lack of safety mechanisms. All the software-related stories in this comment section go in a similar direction: They could have been prevented by simple safety nets. If you accidentally wipe a production database, then it was likely too easy to do so.

Don't blame humans for occasional mistakes, it won't stop them from happening.


The idea that one can learn to not make errors is toxic. To err is human. Sure, you can get more reliable at something, but everyone - even the most experienced - will fry the rover at some point.


Hey neat - I once destroyed a $5000 prototype disk drive (I was a Firmware engineer at Quantum in the late 1990s at my first post-college job), also with back-EMF (it flowed from the braking spindle motor into a power chip and melted the circuit board). I felt terrible and that was multiple orders of magnitude cheaper. :)


With only two weeks left to go, why was a multimeter completing the telemetry circuit?


maybe it was measuring the current rather than the voltage as he thought, which would require putting it in series with the circuit


That stood out to me as well when I read the article.


The multimeter in the photo has the probes plugged into the volts/ohms jack, though. Maybe it was a different meter.

I figured this was going to be a story about trying to measure voltage with the meter set up on the 10A current range.


yeah i don't think he took a photo of the multimeter to use in the article before unplugging it. he didn't yet know it was important


So that it could be quickly and easily unplugged


Why would you not use a multimeter two weeks before launch? The two things are orthogonal to each other.


this sure seems like a symptom of a much larger organizational failure that could mistake "it's not plugged in" for "it's broken"


This clickbait title is misleading.


"My almost $500M mistake" doesn't have the same ring.


"My mistake that seemed to potentially cost up to $500M, but then it turned out fine" would be more accurate. While a more appropriate title would be something akin to "learning from mistakes" or "a Mars rover testing scare".


Thanks for sharing your experience. It reminded me of another one I'd read a while back by Andrew Latham: https://www.linkedin.com/pulse/what-i-learned-from-my-bigges...


Unrelated to the content of the article, but is the cover image AI generated? Looks kind of like DALL-E style output to me but it's hard to be sure.

(Also, I know I'm breaking the rule "Please don't pick the most provocative thing in an article or post to complain about in the thread." My defense is this is less of a complaint and more just plain curiosity!)


Definitely is.


In my company people aren’t allowed to change org level GitHub settings without a second person watching over them, but NASA let a 28 year old kid run electricity into spacecraft without oversight?

Really seems like anything which when done improperly could cause millions of dollars in damage, there should be a second person reviewing your setup first.


The average age of engineers running Apollo 11 was supposedly 28

https://www.popularmechanics.com/space/a4288/4318625/?utm_so...


Great write-up. My two favorite quotes:

> I had learned from countless experiences in this and other projects that bad news doesn’t get better with age

That's so true! We tend to sit on bad news and hope that somehow time will blunt them; but if anything the opposite happens.

And

> I still remember the shock when Project Manager Pete delivered the decision and the follow-on news: ‘These tests will continue. And Chris [the author] will continue to lead them as we have paid for his education. He’s the last person on Earth who would make this mistake again.’

We sometimes think people who made one mistake will make another one, and it's better to go with the person who doesn't make mistakes. But that's not the correct approach. People who don't make mistakes are often people who don't do anything.


So If I understand it correctly, the electrical connector was genderless? If so, that's relatively rare (the only ones I can think of are Anderson Powerpole, which I don't think are rated for interplanetary vehicles) and extremely stupid.

Edit: I suppose he could've been using alligator leads.


You can configure an Anderson connector pair in a way that it won't reverse. I use them quite frequently, and have a local (though admittedly undocumented...) standard for what orientations mean what voltage. It's not bulletproof, but it does make me think when things don't line up.


A breakout box usually has both male and female sides, along with the banana or whatever breakout in the middle, so it can be plugged into either side of the circuit, or both at once and watch signals while the system operates. It's not genderless, it's a pass-through.

Making an only-faces-the-motor breakout, and a separate only-faces-the-driver breakout, might've been prudent, presuming that they used unique and consistent connectors, for instance a single gender always on the motors. But that's quite an assumption and I can imagine a ton of reasons it might not apply.


This was a test fixture so probably banana leads.


I color code _everything_.


All the major suppliers make hermaphroditic connectors, but like you I’ve only ever seen the powerpoles in person.


My friend reports his phone started dropping contacts methodically, until only one remained: a person he didn't know.

Googling him, found it was a junior dev at a FAANG. Oops.

An hour later the contacts repopulated, and all was well. But had to be a white-knuckle time for that poor shmuck.


There is a principle here that I can't see mentioned and that is how easy it is to discard something as a cause of a problem because it seems so minor, so routine (i.e. removing the "spare" multimeter) and therefore getting blinkered on what is going on.

There are more than a few times where I am scratching my head as to "how could my change have possibly broken this" only to remember a couple of hours later that I had made another change somewhere, or rebooted, or changed a config file temporarily.

I guess it just says that we all need to log everything we do, including removing spare multi-meters, so that by looking over the list we can remember these things.


> failure is not an option — it comes pre-installed

Love this. Whenever people say “failure is not an option”, I get the sense that they don’t really understand how the universe works. It’s like saying “entropy is not an option”. Uh...


Another failure is from the person who used the multimeter to complete a circuit - and then didn't even leave a note on the multimeter. That person could reasonably have anticipated this error mode, and taken steps to prevent it.


When I was 20 and in college, I used to work at a Barnes and Noble. When I was working the registers one day, I apparently forgot to put a $300 gift card in a lady's bag at checkout.

The store manager got a complaint about it like a month later, and she tracked down the gift card number from the receipt. It wound up getting loaded with even more money a few days later, and given to someone else.

Anyway, the whole incident got me fired. And to this day, I always check that I put gift cards into bags (on the rare occasion it comes up in my job as a software engineer for NASA missions).


I worked in a video startup thing and watching the app fail after people sunk like 30 mins into using it... Oh man that was cringe.

The videos were recorded because they were legal/financial related and also autotranscripted for text search. I would watch them to trace how the problem occured (I know there are many ways to log).

I was the only developer so it was my fault ha. The app had so many parts and I just used an E2E test to make sure everything generally worked. It's cool Chrome has a fake video feed (spinning green circle/beeping).


Fantastic story! One of the fascinating things about this are the parallels to Apollo 12 - Lightning strike / Power surge lead to a total loss of telemetry which raised the specter of losing the entire mission - and a similar resolution when telemetry was restored!

I've written a bit about it myself - https://flyingbarron.medium.com/lightning-strikes-92482387ca...


I worked at JPL for two years in college and helped with flight hardware testing a few times (probably in the same clean room this story took place in, albeit several years later). I can definitely see how a mistake like this could get made. A few stories I remember hearing from those days:

1. Bending pins from trying to insert a connector incorrectly 2. Running a full day of testing but forgetting to set up data recording 3. Accidentally leaving a screwdriver next to the hardware inside a thermal vac chamber in an overnight test

Fun times!


Ernie is the hero in this story.

As Fred Rogers said, "look for the helpers".


What a great write up. The tension preceding launch must have been tremendous, and lack of sleep adding to the possibility of such a small yet critical error. Good lesson for all of us.


It's a nice story with nice message. And it's pretty normal that mistakes happen, especially under pressure and within long shifts. The "mistake" itself is understandable: what is shocking to me it the multimeter thing. Learning that a mythical "NASA guy" in charge of really serious stuff cannot realize that the multimeter is measuring current, thus it's part of the circuit, and removing it will switch off something.


So let me share mine, it goes like hey can you correct this issue directly on the database of 5 years of corporate financial data? and in midst of testing and looking for the issue I wrote a delete with incomplete where clause that would delete lots of data, fortunately it was stopped because of a constraint violation, I still remember the adrenaline rush when I was processing what just happened


I thought this was about this story. Probably more interesting.

"The First Bug on Mars: OS Scheduling, Priority Inversion, and the Mars Pathfinder" - https://kwahome.medium.com/the-first-bug-on-mars-os-scheduli...


When I saw the headline I thought it may have been a write up from this:

https://www.latimes.com/archives/la-xpm-1999-oct-01-mn-17288...

> NASA lost its $125-million Mars Climate Orbiter because spacecraft engineers failed to convert from English to metric measurements


An even worse engineering horror story: https://faroutmagazine.co.uk/the-steely-dan-song-that-was-al... (more sad trombone and far less "relieved engineer, happy ending.")


That's an interesting read!

Not really sure whether this counts as "engineering" though (accidentally taping over a track), nor would I consider this "worse" than potentially destroying 500M worth of advanced equipment.


I'm really at a loss of words. There is only one lesson from this incident: nobody is supposed to touch a 500M piece of equipment after 12 hours of work. Period. The author is not getting it, and nobody in the comments. The world is run by idiots and it shows.


It seems like there was a lot of pressure to meet the deadline and everybody overworked in this environment. A perfect example of normalization of deviance. It's a shame that there still exists many workplaces like that. It's our duty to be aware of what's happening around us and tell the guy - Hey, what the fuck are you doing here? Get home, that's no critical emergency.


I like this - “Let your scars serve you; they are an invaluable learning experience and investment in your capability and resilience.”

I have had couple of scars of mine. I feel like sometimes you become risk-averse and When you are launching new things you will face the fear of failure.


That one time I deleted the files on the CEO's laptop and did the backup afterwards - wrong order. Oops. Never heard anything about it though, so he didn't fully entrust all his data to the intern. Wise decision.


I felt like the real lesson is a missing flight recorder system, if each action and behaviour is properly logged, it would have been pretty clear that the telemetry was lost before chris even began testing the RAT motor


Ctrl-F, "The Martian"

"Phrase not found"

But... how? This is exactly the same story that happened in the book. Since Spirit is older is the book, perhaps this was the real-life inspiration.


Bravo to the author. I think if I had made the same mistake, and if the disaster is permanent, I would probably be scarred for life and never recover from it.


The author is scarred for life, reliving it in certain circumstances. They use the scar as a tool to sharpen their performance.


I once legit blew a $2K FPGA in postgraduate lab by fatfingering. Horrible memory.


And I thought I had a big mistake when I crashed an expensive drone.. nice read!


It's interesting story, but not a catastrophic failure.


Likely my favorite article ever; amazing writing.


$500M is a lot to waste in private money, but unfortunately in taxpayer money it's pittance and usually swept under the carpet


So, some short sighted bean counters combined the EE & CE degrees and you are the result? Ask for a tuition refund!


At least it was just a Fluke.


Decades ago I worked as a leader of a small team of sysadmins. We worked around the clock maintaining server room and critical applications for an average size bank.

One of our responsibilities was to execute nightly checklists to run various processes, do backups at correct times, etc. These processes would be things like running calculations on loans, verifying results, etc.

We had a huge ream of checklists to accomplish this and we were supposed to follow them religiously.

We had two very similar applications, one our core and another a core from another bank we bought with the same application but older version and slightly different config. Consequently, we had two tracks of checklists with very similar steps.

One of those steps was to change the accounting date in the system. The application was a terminal app. We would telnet to the server, log in, then we would execute commands in the menu driven app. To change the date we would have to go to a special menu for super dangerous applications. It required the user to log in again.

Our core system, required logging in, selecting that we want to advance accounting date by one day, entering admin password again, pressing enter, then waiting for about 4 hours while the process ran. Then the process would exit back to the menu where the highlighted option would be to advance the day.

Our legacy system required logging in, selecting the option to advance accounting date, entering admin password, pressing enter, then waiting for two hours after which a popup showed to ask a stupid question where we would just always press enter, then wait for another hour until it exits to the menu.

We quickly figured out, that we can just press the enter key twice on our legacy system. The second enter press would just leave there in keyboard buffer and dismiss the popup. This was very useful for us, as this was the only operation that interrupted what would be the only time during night where we could have a kebab...

One night I made a mistake, and I pressed the enter twice... on the wrong system. When I figured out that I did it I realised the process would exit to the menu and then should ask for the admin password.

But, unfortunately, the application had a bug (or a feature). Once it exited to the menu, it came back in but for some reason it remembered that the admin password has already been entered and started advancing the accounting date again without asking for the password.

Unfortunately, the date was December 24. For entire December 24, the entire bank was unable to process any operations while we were restoring from last good backup (before day close) and then redoing eod operations. Then on December 25 as a penalty, I had to sit for entire day with accounting department observing how they manually entered all of the operations that would normally happen automatically on Dec 24th.

One extra key pressed.


He’s the last person on Earth who would make this mistake again. That hit HAAARD


Trust but verify!!


It's crazy at how much better boomers had it. If this had happened in 2010 then he would have been fired after that.


Why would you ever allow people to work 12-hour days on something so important? Grad student labour is cheap, surely trying to have one person do the work of two is a false economy.


I have bad news for you about health care professionals.


Health care professionals is a weird one, because while long shifts are dangerous, patient handover is also dangerous and there may be an argument that longer shifts means fewer handovers which could result in better patient outcomes.


Would we accept this if they were dealing with nukes, rather than people? Yeah we let people who haven't had sleep in 36 hours handle the nukes because having the people involved talk to each other between shifts is hard.


They do know roughly how long it takes to take care of a patient & should be set up with overlapping shifts and to be winding down towards a normal shift (i.e. no new patients) so that there's no handoff of a single patient but no one is working long hours. Some patients might take longer than a single shift, but handoff is inevitable at some point. You can improve your handoff processes but you can't improve the decision making of someone working a 12 hour shift.


Is this maybe one of those "if something is hard, do it more often" things?


Maybe but I won't claim to know how to quantify things to evaluate proposals. I do know that even in tech with low stakes, hand off is a problem. I recall hearing teams trying to do 3 on-call teams in 3 different timezones and the team requested to scale back to 2 with longer hours because of the handoff problem (& these hand-offs were occurring daily).


I would prefer fewer patients per doctor then. It seems that the problem is due to the limited supply of doctors. In both countries where I lived, supply of doctors was artificially limited by regulation.


That just means that doctors need to handle fewer patients.


Fewer patients doesn't necessarily get the patients they are handling out the door faster.


I have worse news for you about grad students.


So the work is too important to work 12 hour shifts on it, yet your solution is throwing “cheap” grad students at it?


> work is too important to work 12 hour shifts on it

Yes. Because it is know that exhausted people make mistakes. The work is too important to let exhausted people screw it up so you should make sure everyone working on it is well rested.

> your solution is throwing “cheap” grad students at it

Yes? It is testing an electric motor. They can do it. The solution is that you employ enough people so nobody needs to work 12 hour heroic shifts.


That “solution” is nothing more than typical HN backseat driving.

In the real world there are budget, personnel and hiring constraints. You don’t get to hire all the people you want. You make do with what you have, and try to push the mission forward, even in suboptimal conditions.


So we should make every person work 12 hours, then. This 8 hour workday is for the birds, the company has a mission!


Wait until I tell you who's doing work in hospitals and how much they're being paid for it.


Another relevant bit of info from hospital accidents: hand-offs between shifts are known to increase the risk of a mistake in care and are part of the reason nurses and doctors work such long hours.


I avoid, if possible of course, going to the hospital right before a shift change for this very reason.


I'm not questioning your logic here, but how do you keep intimate knowledge of the seasonal vagaries of shift changes at every department of your local hospital?


If you’ve been to that hospital you can easily take note. And most places have a shift change between 5-7am. This one is almost universal, as far as I have observed, even in different countries.


You can probably make a good guess, but at this point I wouldn't be surprised to find websites or Facebook groups dedicated to tracking this information for hospitals in any given area.


Because they’re more or less universal.


On your next hospital visit for life-saving care, I'm sure you will be comforted to know that nurses (in the US) typically work 12 hour shifts and they're on their feet the whole time.


Nurses work 3x12 shifts to reduce the number of times patient care is handed off. If everyone worked 8 hour shifts you’d have 3 handoffs per day and a minimum of 3 different people caring for the patient. With 12 hour shifts you have only 2 handoffs per day and can have 2 people trade off on patients if their schedules line up.

I’m sure it varies by location, but my nurse friends only work 3x12, giving them 4 days off per week. Working 12 hour shifts is much more acceptable when you have more days off than days spent working. They’re virtually unavailable on days they work, but then they’re off traveling or having fun for 4 days, some times more if they combine their days off back to back. My close nurse friend routinely takes week long vacations without actually taking any time off at all.


This is true, but IMO doesn't refute the point, it just makes me concerned about care quality. Were there any studies that showed that the long, grueling shifts are actually better or is it simply this way because it's always been that way and change would be hard and expensive, and because "my grandma walked up hill to school both ways, so young people can too"?


Who said it's gruelling? In the case of the rover, it's not year round, it's when they are preparing for a launch. Also, you're on HN, so surely you have heard of flow.


And doctors doing 24h shifts. At least in the NICU my kids were in.


One of the many reasons I'd never live there. I'll stick to places with better labour laws thanks.


One of the hardest problems in a workplace is coordinating the workers. There's also a substantial overhead cost for every employee. I'm sure workers are less efficient at the end of a 12 hour shift, but shift changes also cost a lot and introduce lots of opportunities for errors.


> Grad student labour is cheap

Sir, this is NASA not Kerbal space program ;)


I am guessing from my personal experience, but for the most part people tend to ask more repetitive technical jobs (like testing motors) to people lower in the ladder, so (relatively) less experienced people are going to touch the parts more.

Then such people are doing the "actual work" and overloaded with tasks, working overtime is a rule rather than the exception. The justification for this is that he was "getting experience" for trying to move up in his career. So all good.

Most people are going to remember that he messed up, rather than that he was working overtime to meet expectations, maybe except for the guy that pat him in his shoulder, he saw enough to understand it.


They definitely should have used a grad student to complete the circuit, freeing up that multimeter.


I am a bit disappointed that the name of the author is not Howard Wolowitz.


How is this a $500 million mistake? It seems the issues didn’t cost $500 million.


It's not "mistake" singular, it's actually mistakes.

The first mistake is the $500 million rover is fried.

The second mistake is believing the first mistake.

(or put another way, the first mistake cost $500 million, the second mistake which they didn't realise at the time, saved $500 million)

But you can't explain the second mistake without first explaining the first mistake, hence the title.


I guess the idea is that it potentially could have cost $500 million therefore it was a $500 million mistake. It's not exactly accurate but it does help contextualize the gravity of their mistake.


Reminds me of my favorite George Carlin bit[1]:

Here's one they just made up: "near miss". When two planes almost collide, they call it a near miss. It's a near hit. A collision is a near miss.

[1]: https://www.quotes.net/mquote/35854


It was feared to be a $500M mistake.


Congratulations. 102 circuits would have taught this. But some short sighted bean counters merged EE & CE so you didn't get the opportunity. I suggest you ask for a tuition refund as they failed to educate you as they promised.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: