My $500M Mars rover mistake

jay-barronville · on Nov 29, 2023

Really well written story.

As a software engineer, I have a couple stories like this from earlier in my career that still haunt me to this very day.

Here’s a short version of one of them: Like 10 years ago, I was doing consulting work for a client. We worked together for months to build a new version of their web service. On launch day, I was asked to do the deployment. The development and deployment process they had in place was awful and nothing like what we have today—just about every aspect of the process was manual. Anyway, everything was going well. I wrote a few scripts and SQL queries to automate the parts I could. They gave me the production credentials for when I’m ready to deploy. I decided to run what you could call my migration script one last time just to be sure I’m ready. The very moment after I hit the Enter key, I realized I had made a mistake: I had just updated the script with the production credentials just before I made the decision to do another test run. The errors started piling and their service was unresponsive. I was 100% sure I had just wiped their database and I was losing it internally. What saved me was that one of their guys had just a couple hours earlier completed a backup of their database in anticipation of the launch; in the end, they lost a tiny bit of data but most of it was recovered via the backup. Ever since then, “careful” is an extreme understatement when it comes to how I interact with database systems—and production systems in general. Never again.

cdogl · on Nov 29, 2023

Your excellent story compelled me to share another:

We rarely interact directly with production databases as we have an event sourced architecture. When we do, we run a shell script which tunnels through a bastion host to give us direct access to the database in our production environment, and exposes the standard environment variables to configure a Postgres client.

Our test suites drop and recreate our tables, or truncate them, as part of the test run.

One day, a lead developer ran “make test” after he’d been doing some exploratory work in the prod database as part of a bug fix. The test code respected the environment variables and connected to prod instead of docker. Immediately, our tests dropped and recreated the production tables for that database a few dozen times.

cachvico · on Nov 29, 2023

Verbatim from my current code:

    if strings.Contains(dbname, "prod") {
        panic("Refusing to wipe production database!")
    }
    Truncate(db)

devsda · on Nov 29, 2023

Ours are not named with a common identifier and this also needs constant effort to maintain while refactoring and there's still scope for a mistake.

*ideally* devs should not have prod access or their credentials should only have limited access without permissions for destructive actions like drop/truncate etc.

But in reality, there's always that one helpful dba/dev who shares admin credentials for a quick prod fix with someone and then those credentials end up in a wiki somewhere as part of an SOP.

masklinn · on Nov 29, 2023

That‘s why you do credentialing via ssh keys, and keys are explained and map to a user, and non-dba keys should expire.

If you need access for a quick prod fix, your key gets added to the machine with that explanation and a week (or lees) lifetime.

amluto · on Nov 29, 2023

I also have a table with one row in it indicating whether the database is prod.

manfre · on Nov 29, 2023

I've added a similar safety to every project. It's not perfect, but this last line of defense has saved team members from themselves more than once.

For Django projects, add the below to manage.py:

    env_name = os.environ.get("ENVIRONMENT", "ENVIRONMENT_NOT_SET")
    if env_name in TEST_PROTECTED_ENVIRONMENTS and "test" in sys.argv:
        raise Exception(f"You cannot run tests with ENVIRONMENT={env_name}")

hunterrr · on Nov 29, 2023

I think runtime checks like this using environment variables is great however what has burned me in the past is that when debugging problems, not knowing what happened at runtime when the logs were produced was problematic. So when test protected environments environment variable needed to be updated I might have a hard time back tracking to it

ericbarrett · on Nov 29, 2023

Everybody replying to you that this is fragile is missing the point. This kind of code isn't the first line of defense—it's the last.

RankingMember · on Nov 29, 2023

Exactly- it's layers of prevention rather than being just one screwup away.

jerf · on Nov 29, 2023

And when your last line of defense fires... you don't just breath a sigh of relief that the system is robust. You also must dig in to how to catch it sooner in your previous lines.

For instance, test code shouldn't have access to production DB passwords. Maybe that means a slightly less convenient login for the dev to get to production, but it's worth it.

toasted-subs · on Nov 29, 2023

Yup, I have 3 prompts if you want to wipe anything.

One of the reasons I put interactions between databases behind a cli.

csomar · on Nov 29, 2023

This is bad because if someone forgot to add prod or for whatever reason the code executed beyond the panic, you’ll wipe out the db.

There is no code that will protect your db/data. Only replication to a read-only storage will help in such situations.

saagarjha · on Nov 29, 2023

If code is executing past a panic, I think it is unlikely that you can trust the integrity of your database anyways.

Andrex · on Nov 29, 2023

But what if you have a chron job that auto replicates and then deletes everything after you forward it?

augustk · on Nov 29, 2023

And then it turns out that the order of the parameters was mixed up... just kidding.

jve · on Nov 29, 2023

Just yesterday, I did C# Regex.Match with a supersimple regex: ^\d+ And it seemed not to work. I asked ChatGTP and he noted that I had subtle mistake: the parameters were other way around... :facepalm:

augustk · on Nov 29, 2023

That's indeed a drawback of function call syntax compared to method call syntax where the object comes before the name of the method.

stylepoints · on Nov 29, 2023

#metoo

bazzargh · on Nov 29, 2023

We had this - 10 years ago. In our case there was a QA environment which was supposed to be used by pushing code up with production configs, then an automated process copied the code to where it actually ran _doing substitutions on the configs to prevent it connecting to the production databases_. However this process was annoyingly slow, and developers had ssh access. So someone (not me) ssh'd in, and sped up their test by connecting the deploy location for their app to git and doing a git pull.

Of course this bypassed the rewrite process, and there was inadequate separation between QA and prod, so now they were connected to the live DB; and then they ran `rake test`...(cue millions of voices suddenly crying out in terror and then being suddenly silenced). The DB was big enough that this process actually took 30 minutes or so and some data was saved by pulling the plug about half-way through.

And _of course_ for maximum blast radius this was one of the apps that was still talking to the old 'monolith' db instead of a split-out microservice, and _of course_ this happened when we'd been complaining to ops that their backups hadn't run for over a week and _of course_ the binlogs we could use to replay the db on top of a backup only went back a week.

I think it was 4 days before the company came back online; we were big enough that this made the news. It was a _herculean_ effort to recover this; some data was restored by going through audit logs, some by restoring wiped blocks on HDs, and so on.

norman784 · on Nov 29, 2023

Our test suite expects that the database name has a `_test` suffix, so you can't run the tests even locally without the suffix.

masklinn · on Nov 29, 2023

Our test harness takes an optional template as input and immediately copies it.

It’s useful to distribute the test anyway, especially for non-transactional tests.

If the database initialisation is costly that’s useful even if tests run on empty, as copying a database from a template is much faster than creating one DDL by DDL, for postgres at least.

masklinn · on Nov 29, 2023

(distribute as in parallelise, possibly across multiple machines)

spixy · on Nov 29, 2023

Our test suite uses DB user that exists in docker DB but not in prod, so droping prod database cannot happen.

vichle · on Nov 29, 2023

This is why I always delete by ID when cleaning up after tests.

markmark · on Nov 29, 2023

At a place I was consulting about 10 years ago one of the internal guys on another product dropped the prod database because he was logged into his dev db and the prod db at the same time in different windows and he dropped the wrong one. Then when they went to restore the backups hadn't succeeded in months (they had hired consultants to help them with the new product for good reason).

Luckily the customer sites each had a local db that synced to the central db (so the product could run with patchy connectivity), but the guy spent 3 or 4 days working looooong days rebuilding the master db from a combination of old backups and the client-site data.

dotsam · on Nov 29, 2023

> logged into his dev db and the prod db at the same time in different windows

I am very worried about doing the wrong thing in the wrong terminal, so for some machines I colour-code my ssh windows, red for prod, yellow for staging and green for dev. e.g. in my ~/.bashrc I have: echo -ne '\e]11;#907800\a' #yellow background

prox · on Nov 29, 2023

This is a good idea, especially when tired, end of day, crunch time work is happening !

arethuza · on Nov 29, 2023

About 10 years ago I literally saw the blood drain from a colleagues face as he realised he had dropped a production database because he thought he was in a dev environment.

A DBA colleague sitting nearby laughed and had things restored back within a few minutes....

Feathercrown · on Nov 29, 2023

Isn't that almost exactly what happened at github too?

technimad · on Nov 29, 2023

This happened to Gitlab.

ThePowerOfFuet · on Nov 29, 2023

schemescape · on Nov 29, 2023

Anecdote: I ran a migration on a production database from inside Visual Studio. In retrospect, it was recoverable, but I nearly had a heart attack when all the tables started disappearing from the tree view in VS…

…only to reappear a second later. It was just the view refreshing! Talk about awful UI!

donatj · on Nov 29, 2023

Around 15 years ago, I was packing up getting ready to leave for a long weekend. One of our marketing people I was friends with comes over with a quick change to a customers site.

I had access to the production database, something I absolutely should not have had but we were a tiny ~15 person company with way more clients than we reasonably should have. Corners were cut.

I write a quick little UPDATE query to update some marketing text on a product and when the query takes more than an instant I knew I had screwed up. Reading my query, I quickly realize I had ran the UPDATE entirely unbounded and changed the description of thousands and thousands of products.

Our database admin with access to the database backups had gone home hours earlier as he worked in a different timezone. It took me many phone calls and well over an hour to get ahold of him and get the descriptions restored.

The quick change on my way out the door ended up taking me multiple hours to resolve. My friend in marketing apologized profusely but it was my mistake, not theirs.

As far as I remember we never heard anything from the client about it, I put that entirely down to it being 5pm on Friday of a holiday weekend.

kfrane · on Nov 29, 2023

That's why I always write a BEGIN statement before executing updates and deletes. If they are not instant or don't return the expected number of modified rows I can just rollback the transaction.

groestl · on Nov 29, 2023

That, and I start the line with /*, write the where clause first, and immediately before I execute the query I check the db host.

Oh, and I absolutely refuse to do anything but the most critical stuff against prod on Fridays.

xattt · on Nov 29, 2023

Lesson is never attempt to do anything on a Friday afternoon that will take far more time for recovery.

elaus · on Nov 29, 2023

Or is the lesson to _always_ attempt such critical changes on a Friday? After all, in this instance the client didn't notice any problems, apparently because they were already off to their weekend.

For me personally the much bigger issue would be harming the client, their business or our relationship. Doing a few hours of overtime to fix my mistakes would probably only feel as well deserved punishment...

askvictor · on Nov 29, 2023

One place I worked (some 20 years ago) had a policy that any time you run a sudo command, another person has to check the command before you hit enter. Could apply the same kind of policy/convention for anything in production.

yafbum · on Nov 29, 2023

I'm not sure this doesn't just lead to blind rubber-stamping unless this is done very very rarely

0xDEAFBEAD · on Nov 29, 2023

The trick is to have good access controls so confirmations happen often enough to be useful, but not so often to be rubber-stamped

j16sdiz · on Nov 29, 2023

I guess most of the common task are scripted / automated. Running a "raw" sudo command should be very very rare.

HenriTEL · on Dec 1, 2023

That's not a good advice IMO, as most sudo commands will mess-up just one host, and it's something you should generally be prepared for. You're more likely to develop a culture where engineers think about hosts as critical resources whereas they should be generally considered as instances that can be thrown away. It's better to identify hosts that are SPOF and be cautious on those only.

I can think of a larger blast radius when deleting files on a shared mount point for example but it's not representative to the regular use of sudo.

stickfigure · on Nov 29, 2023

I have a rule when working on production databases: Always `start transaction` before doing any kind of update - and pay close attention to the # of rows affected.

jeremyjh · on Nov 29, 2023

If you use postgres you can put

  \set AUTOCOMMIT off

In your .psqlrc and then you can never forget the begin transaction; every statement is already in a transaction, its just the default behaviour to automatically commit the statements for some ungodly reason.

Years ago I hired an experienced Oracle developer and put him to work right away on a SQL Server project. Oracle doesn't autocommit by default, and SQL Server does. You don't want to learn this when you type "rollback;". I took responsibility and we had all the data in an audit table and recovered quickly. I wonder if there are still people who call him "Rollback" though.

groestl · on Nov 29, 2023

> its just the default behaviour

That's good from the DBA perspective, but relying on that default as a user is risky in itself, when you deal with multiple hosts and not all are set up this way.

hannofcart · on Nov 29, 2023

What strikes me as remarkable in all such stories is how almost always, the person committing the mistake is a junior who never deserves the blame. And how cavalier the handoff/onboarding by the 'seniors' working on the projects are.

Having worked in enough of these though, I am aware that even they (the "seniors") are seldom entirely responsible for all the issues. It's mostly business constraints that forces cutting of corners and that ends up jeopardizing the business in the long run.

zimpenfish · on Nov 29, 2023

As I said on Slack the other day in response to a similar story, "If, on your first day, you can destroy the prod database, it's not your fault."

(One of my standard end-of-interview questions is "how easy is it for me to trash the production database?" Having done this previously[1] and had a few near misses, it's not something I want to do again.)

[1] In my defence, I was young and didn't know that /tmp on Solaris was special. Not until someone rebooted the box, anyway.

lostlogin · on Nov 29, 2023

> /tmp on Solaris was special.

I’ve had a search but can’t work out why it’s special.

gosub100 · on Nov 29, 2023

it gets wiped on reboot. I remember around 2007 on Gentoo Linux, this behavior changed. I was using /tmp as pretty much a "my documents" type folder, I updated, and one day all my stuff was gone! I was flabbergasted. But yeah, it was reckless to store things on a folder that pretty has "temp" in the name!

sbierwagen · on Nov 30, 2023

This is rude, but I'd like to reply a comment you deleted in a separate thread.

"why didn't they have a hot-spare" They do! Flight spares are complete, flight-rated copies of spacecraft built for exactly this contingency: https://en.wikipedia.org/wiki/Flight_spare After launch the flight spares are used for terrain testing and troubleshooting. (The "mars yard" has flight spares for Curiosity and Perseverance https://www-robotics.jpl.nasa.gov/how-we-do-it/facilities/ma... which were used to test some wheels to destruction after Curiosity started showing some wear https://www.planetary.org/articles/08190630-curiosity-wheel-... )

The blog post lays it on a bit thick with the $500 million number and the "launch only two weeks away" given that the article itself is illustrated with a photo of the Sojourner flight spare. Spirit had the SSTB1 test rover. If he had actually blown out the entire electrical system, they could have launched it instead. Swapping out the entire vehicle right before launch would have been an awful job, but it's not flat out impossible.

gosub100 · on Nov 30, 2023

Not rude at all! I appreciate the reply. Only reason I deleted my message was because right after posting, I scrolled down and saw someone asking the exact same question at the top level, so I felt like it was best to conserve effort and not repeat them.

I liked that other people pointed out that risk could have been eliminated by using polarized connectors (I hope they started doing this after the incident), but also made me wonder about "back-EMF" caused by solar flares. In other words, maybe all thick wires and ground/power planes should be hardened against current surges simply due to a solar event hitting mars (which may incidentally cover the case of back-powering the driver circuits).

lostlogin · on Dec 1, 2023

Thanks you.

I have been burned by this in some version of Ubuntu and have assumed it was normal behaviour ever since.

jeffreygoesto · on Nov 29, 2023

A friend once had to remotely do an OS update of a banking system. Being cautious, he thought he'd back up some central files, just in case and went "mv libc.so old_libc.so". Had to call some guy in that town to throw in the Solaris CD on prem at 2:30 in the morning...

xyzzy123 · on Nov 29, 2023

Never this simple and calling someone still probs right thing to but fixing stuff like this is what /sbin is for.

Tangurena2 · on Nov 29, 2023

One way I mistake-proof things in SQL Management Studio is to have different colors for production vs test databases.

To do that, on the "connect to server" dialog, click "options". On the tab "connection properties" in the "connection" option group, check "use custom color". And I pick the reddest red there is. The bottom of the results window will have that color.

edit: my horrible foul-up was restoring a database to production. The "there is trouble" pagers were all Iridium pagers since they loved climbing mountains (where there was no cell service back then). But then that place didn't use source control, so it was disasters all the way down.

ThePowerOfFuet · on Nov 29, 2023

>The very moment after I hit the Enter key, I realized I had made a mistake

This brief moment in time has a name: an ohnosecond.

https://en.wiktionary.org/wiki/ohnosecond

justo-rivera · on Nov 29, 2023

Seems this is very typical, first time launches usually lose some data.

We never hear about first time launch deploys that wipe ALL data because whoever is so unlucky probably never got to browse hacker news

totallywrong · on Nov 29, 2023

As a young consultant, I was once one Enter away from causing a disaster, but something stopped me. I still shudder even though it didn't actually happen. Nothing of the sort in many years since, so a great lesson in retrospect I guess.

cloths · on Dec 4, 2023

We used to have another engineer watch over your shoulder when you do Prod stuff, can be very helpful.

T-A · on Nov 29, 2023

That reminds me...

https://news.ycombinator.com/item?id=14476421

RHSman2 · on Nov 30, 2023

I’d love to know the long term physiological effect on the body of these events. Have had a few. Still feel shakey :)

alberth · on Nov 29, 2023

Hope you bought that guy a beer.

Great story, thanks for sharing.

PH95VuimJjqBqy · on Nov 29, 2023

to be fair, it's a rite of passage to do something like this.

But you should definitely have bought that man a beer :)

NL807 · on Nov 29, 2023

Hats off to the backup guys.

3abiton · on Nov 29, 2023

Backup saving the day.

toasted-subs · on Nov 29, 2023

Nightmare fuel

irjustin · on Nov 29, 2023

I'm reminded of the phrase - if your intern deleted the production database you don't have a bad intern; you have a bad process.

Whether this was a process problem or a human one we don't really get to judge since we do expect more from a FTE.

I'll just say putting myself into his shoes made me tear up as I read the dread and pangs of pain upon realizing what happened - then to have life again after the failure of the ray of hope. That weight, I've never had a project that so many people depended on.

All heroes in my book.

Scubabear68 · on Nov 29, 2023

At a major brokerage firm I accidentally hit prod with a testing script that did several millions of dollars of fake FX test trades.

The first thing mentioned in the post mortem call was “No one is going to blame the guy who did those trades. It was an honest mistake. What we are going by to do is discuss why a developer can hit the production trading API without any authentication at all”.

skrebbel · on Nov 29, 2023

Were the trades any good though?

Scubabear68 · on Nov 29, 2023

No, it was caught by our trading ops guys. A few minutes after I hit enter I got. Rather chilling phone call from them. So that part of the system worked

hnarayanan · on Nov 29, 2023

Plot twist: It made so much money that that’s now their strategy.

KMag · on Nov 29, 2023

Back in school, my roommate's mom worked for a hedge fund and he did part-time work for them. He factored out a common trading engine from individual strategies, and one day the head of the fund asked him to run a strategy that had made a bunch of money in the past, but had been retired after failing to make money for a while. So, he put the strategy back in production without any testing, forgetting that he had recently done some minor refactoring of the trading engine. He typo'd one variable name for a similar variable name, so in the loop where it broke down large orders into small orders, it actually had an infinite loop. Luckily the engine had an internal throttle, so it wasn't trading as fast as it could send messages over the network.

I was chatting with him when he noticed the stock the strategy was trading (KLAC) was gradually declining linearly. He looked at the L2 quotes and saw that someone using his brokerage was repeatedly putting out small orders, and then he realized they were his orders.

The fund got a margin call and had to shift some funds between accounts to make margin, and they had to contact regulators and inform them of the bug, and they had to manually trade their way out of the massive short position they traded. However, they ended up making $60,000 that day off of his mistake.

hnarayanan · on Dec 4, 2023

This is such a cool story.

hn_go_brrrrr · on Nov 30, 2023

That's an excellent postmortem culture.

pyrolistical · on Nov 29, 2023

You should never blame the individual for organizational failures like this. I see two process issues:

1. The plug was allowed to be connected backwards. Either this should be impossible or this hazard should be identified and more than one human should verify orientation

2. In use tools like multimeters should never be disconnected. At worst you get problems like this at best you annoy whoever was using it

Blaming individuals only gets them fired and weakens the entire organization. You just fired the one person who learned an expensive lesson.

The only time when an individual should be blamed is when they intended harm, at which case the law could kick in

idlewords · on Nov 29, 2023

You can't apply process thinking here, where the scenario is custom testing a unique probe, and you don't know what other constraints are in play (for example, the reason for the plug design). If NASA were sending these things to Mars by the dozen, then you can start to formalize things like test procedures and look for places mistakes can happen. But in this scenario, you're just disempowering your staff by not letting them choose the most effective and low-risk way to do one-off, highly specialized testing work.

astrolx · on Nov 29, 2023

I can't say about NASA, but I can say about my experience at ESA (European Space Agency), where I worked on Mars lander hardware. You have very very formal procedures and detailed checks as soon as you approach any parts which is going to fly.

The simplest task you can imagine takes incredible proportions (for good reasons).

Disconnect and reconnect that plug? Please inform persons X and Y, person Z must be present, only person W can touch that plug, and do perform a functional test according to the procedure in this document before and after and file these reports etc ...

Cleaning a part? Oh glob. Get ready for 3 months of adventure talking to planetary protection experts and book the cleanest room in the continent.

Dioxide2119 · on Nov 29, 2023

The Hacker News mic drop strikes again. I have nothing super substantive to add except to agree with your point and add that yes, it feels like work to put in the formal policies and procedures, but when the stakes are high enough (rocket to mars? its high enough), even the work that doesn't intuitively feel 'worth it' to someone is DEFINITELY worth it.

"It's a waste of time" is very often a fallacy, especially when the risk cannot be easily undone.

I (mostly mentally) complete the phrase "It's a waste of time" with "what's the worst that could happen?", and when I'm actually saying the phrase out loud, stare at whoever said that for 5 full seconds.

astrolx · on Nov 30, 2023

Exactly :). The funny part is, the thing actually crashed! [1]

Why? Bad error handling in the software (primarily). What is the worst that could happen? An instrument saturate, a variable gets stuck at a value, but keeps being integrated, the spacecraft computes a negative altitude and thinks it'a below ground level (negative altitude) but is in fact in full descent and at 3+ km from the surface. Oopsie !

[1] https://exploration.esa.int/web/mars/-/59176-exomars-2016-sc...

zilti · on Nov 30, 2023

And we all know how reliable ESA landers are. The laughing stock of the industry.

hughes · on Nov 29, 2023

Aerospace-grade connectors are specifically designed to support multiple keyings that prevent this kind of thing. It's definitely a problem preventable by careful design if the interface supports making this kind of mistake.

lisper · on Nov 29, 2023

Can confirm. Source: I used to work for NASA, and I'm a private pilot. There are literally millions of electrical connections that get made on aircraft and spacecraft on a regular basis and I can't think of ever hearing of an incident caused by one of them being made backwards. (Now, mechanical connections getting made backwards is not unusual. That's why you check to make sure that the flight control surfaces move in the right direction as one of the last checklist items before you take off. Every. Single. Time.)

maxerickson · on Nov 29, 2023

So how do you prevent them from grabbing the wrong break out box?

Like say they have one that is setup to test the motor driver circuitry and another one that is setup to test the motor?

Or say the breakout box intentionally has both sides of the connection on it, so that you can get in-between the driver and motor?

amluto · on Nov 29, 2023

I can think of very few kinds of connectors for which this type of error is even possible. You would need two cable terminations which can connect to each other, for which either side can plug into the same jack.

So either the ends are literally the same (e.g. Anderson Powerpole), or there is some kind of weird symmetry or inadequate keying. Or maybe the two cables don’t connect directly and instead go through some of kind of interface? The latter is fairly common in networking, e.g. “feed-through” patch panels and keystone jacks and quite a few kinds of fiber optic connectors.

All of these seem like utterly terrible ideas in an application where you would take the thing apart after final assembly and where the person doing the disassembly or reassembly could possibly access the wrong side of the panel.

jeffreygoesto · on Nov 29, 2023

One guy in our workshop had to provide DC to a display with a round 4-pin connector. He soldered two neighboring pins to Gnd and the other two to Vcc. There were two chances to short the powersupply, one to brick the display and one to get it right. Guess what we had to replace until we found out.

maxerickson · on Nov 29, 2023

A break out box could very sensibly have both sides of the connector on it and then have the various pins broken out into individual connections for flexibility.

In that case keying or whatever isn't going to prevent you from connecting to the wrong side, because both sides are present.

numpad0 · on Nov 29, 2023

Looks like the author tried to double-power the motor with both the spacecraft motor driver and through the breakout box that MITMs the driver and the motor. In such event, the free-wheeling diode in the driver will allow reverse current to be fed back to the driver's power supply up to certain amounts. This will absorb back-EMF, or energy from "regen" from the motor.

I'm suspecting the breakout wasn't literally sitting between the driver and the motor, but rather all internal connections are broken out to the box for testing; and likely the author's mistake was to not mess with the spacecraft to temporarily disconnect the driver.

But I'm not sure if I'd "just" made the right call and done so nonchalantly on a Mars rover to launch in few weeks.

kristjansson · on Nov 29, 2023

> flight control surfaces move in the right direction

How … how often does that go wrong?!

ahazred8ta · on Nov 30, 2023

There have been several cases of the landing gear up/down lever getting wired backwards during maintenance. Not to worry, the gear has a 'squat switch' sensor that prevents the gear from being raised when the plane is on the ground. Unless you taxi over a bump and the switch decides it's now airborne. Crunch.

lisper · on Nov 29, 2023

It depends on what you mean by "that". Getting control surfaces actually reversed is not very common, but it does happen, typically after maintenance when a mechanic inadvertently re-connects a control cable backwards.

Control cables also can and do break, but that too is fairly rare.

What is not rare is control mechanisms jamming. Here is an example:

https://www.ntsb.gov/news/press-releases/Pages/NR20230928A.a...

markus92 · on Nov 29, 2023

How not to check your flight surfaces: Air Astana 1388.

6LLvveMx2koXfwn · on Nov 29, 2023

From the Wikipedia article of that flight:

"The incident was featured in season 23, episode 5 of the Canadian documentary series Mayday . . ." [1]

Season 23 - I'm glad I don't fly!

1. https://en.wikipedia.org/wiki/Air_Astana_Flight_1388

car · on Nov 29, 2023

This is also the case for medical gas connectors in operating rooms, at least in Europe.

n8cpdx · on Nov 29, 2023

The one process part that can be controlled and jumped out from the first paragraphs is not letting people touch billion dollar equipment at the tail end of a 12-hour shift.

If you are putting people in a situation with absolutely no safeguards, you can’t have them go into it fatigued.

I’m guessing the people working on that team also weren’t getting great sleep by the discussion of high stress and long hours. Recipe for disaster.

deepsun · on Nov 29, 2023

> only gets them fired

Agree on the blame point, but not on firing point. As a manager, sometimes you need to fire people, that's a necessary part your job. And no, changing the hiring process cannot prevent that.

skrebbel · on Nov 29, 2023

Firing people for incidental mistakes instead of overall bad performance is pretty shitty management.

deepsun · on Nov 29, 2023

For one incidental mistake of course not. For repeated inattention (like plugging mars rover's cables wrongly several times) at an attention-demanding job -- yes.

dividedbyzero · on Nov 29, 2023

So you'd not blame them for their simple mistake, but still fire them for it?

ycombobreaker · on Nov 29, 2023

Firing somebody for a simple mistake with grave consequences doesn't make your organization stronger. There will be plenty of better examples to make.

solardev · on Nov 29, 2023

Anecdote:

At my first real job as a web dev after school, I crashed the production website on my very first day. Tens of thousands of visitors were affected, and all our sales leads stopped.

Thankfully, we were able to bring it back up within a few minutes, but it was still a harrowing ordeal. The entire team (and the CEO in the next room) was watching. It ended up fine and we laughed about it after some minor hazing :)

But by the time I left that job a couple years later, we had turned that fragile, unstable website into something with automatic testing, multiple levels of backups and failover systems across multiple data centers, along with detailed training and on-boarding for new devs. (This was in the early days of AWS, and production websites weren't just a one click deploy yet.)

That one experience led to me learning proper version control, dev environments, redis, sharding and clustering, VMs, Postgres and MySQL replication, wiki, monit, DNS, load balancers, reverse proxies, etc. All because I was so scared of ever crashing the website again.

That small company took a chance on me, a high school dropout with some WordPress experience, and paid me $15/hour to run their production website, lol. But they didn't fire me after I screwed up, and gave me the freedom and trust to learn on the job and improve their systems. I'm forever grateful to them!

deepsun · on Nov 29, 2023

Not in this case. It's a one-off very custom-built rover, the first of its name. There's already all kinds of processes established, but no one can foresee everything. Yes, they probably fixed the process after that, but remember that it was their first time.

PS: Also, more rules and better processes are not necessarily a good thing. Sometimes there are just too much red tape and bureaucracy that makes already super-slow NASA even slower. In those first-of-its-kind missions sometimes you need to risk and depend on people, not processes.

brabel · on Nov 29, 2023

I can't help but think about what would've happened if the Rover had indeed been destroyed though. It seems the only thing that stopped that from happening was sheer luck as they could as easily ( I guess ) have connected to another wrong lead that wouldn't have the protection required to survive the charge? That is, it was outside the author's actual abilities to have stopped that and he could just as easily have been the destroyer of the Rover and forever remembered for that fact, as he had feared he would.

MagicMoonlight · on Nov 29, 2023

The fact that they were still being made to work after completing a 12 hour shift (which is already too long to be safe) means this was a process error

drra · on Nov 29, 2023

This really resonates with my experience. Working at a major airline, I was the one who would pick the most difficult and risky projects. One was a quick implementation of a new payment provider for their website. That website sold millions of euros worth of tickets every day. Seconds after deployment, it turned out that I had failed to recognize the differences between the test and live environments as one of the crucial variables was blank in production. I could have expected this if I had spent more time preparing and reading documentation. Sales died completely, and my heart sank. After a lengthy rollback procedure that resulted in a few hours without sales, a massive surge of angry customers, and a loss of several million euros, I approached the CEO of the company. I still remember catching him in an elevator. I explained that this incident was all my fault and I had failed to properly analyse the environment. I assured him that I was ready to bear full consequences, including being fired. He burst into laughter and said something like this: "Why would I want to get rid of you? You made a mistake that you'll never do again. You are better at your job than you were yesterday!" This experience was formative to me on many levels including true leadership. I successfully completed many high risk projects since than.

supriyo-biswas · on Nov 29, 2023

The language is just so anodyne and there’s just that bit of implausible detail in the story (approaching the CEO yourself when you’re the one who fucked up, also how parent claims to be a “top performer” and “I made my company lose millions” at the same time) makes me think this comment was written by an LLM, or at least a fabrication.

hutzlibu · on Nov 29, 2023

The suspicious part for me would be the CEO laughing like it was nothing. Also yes, one would expect it goes the other way around, when you messed up big, someone will come to you. But the world is big and maybe it happened like this.

mango7283 · on Nov 29, 2023

Surely this is a variation of this anecdote attributed to IBM's Watson

https://news.ycombinator.com/item?id=13419313

"> A young executive had made some bad decisions that cost the company several million dollars. He was summoned to Watson’s office, fully expecting to be dismissed. As he entered the office, the young executive said, “I suppose after that set of mistakes you will want to fire me.” Watson was said to have replied,

> “Not at all, young man, we have just spent a couple of million dollars educating you.” [1]"

creshal · on Nov 29, 2023

A variant is in From the Earth to the Moon, where a junior engineer at Grumman confesses messing up vital calculations for the Lunar Lander to his boss, and finishes with "So… I guess I'll go clean out my desk." "What for?" "I figure you're gonna fire me now."

The boss's response makes a lot more sense than the usual fluff, though: "If I fire you now, the next guy to make a mistake won't admit it and we won't find out about it until it's too late."

hutzlibu · on Nov 29, 2023

I wonder how much of those stories are rather wishful thinking of how it should work and not how it does work, when a major screw up happened and some heads need to roll for the sake of it.

numpad0 · on Nov 29, 2023

I wonder how would the boss explain it to his bosses/shareholders. Was that totally a known possible outcome that merely surfaced by chance and subsequently handled without issues under his leadership, or...?

creshal · on Nov 29, 2023

Apollo was very much pushing the envelope of bleeding edge technology, while the bosses were probably not too happy, it was far from the only occurrence, and didn't threaten the contract.

hutzlibu · on Nov 29, 2023

Thanks, I knew I heard that story somewhere before. (but I would not rule it out, that recent CEOs heard and learned from that episode as well)

xp84 · on Nov 29, 2023

In the context of “suspected AI” I at first thought you meant a different Watson!

me_me_me · on Nov 29, 2023

Story is real or not aside, why would you not laugh it off? At that stage nothing can be changed, money was lost and bug was fixed. You can only look forward and plan for the future and the guy is going to be paranoid in the future deployments to make sure not to fuck up again.

hutzlibu · on Nov 29, 2023

In general yes, and people with enough Zen can do this. But if also the CEO is looking forward to hear from the board and the investors to explain the incident, he might not be in the mood to laugh.

Thorrez · on Nov 29, 2023

Has it been edited? I don't see "top performer".

chrz · on Dec 7, 2023

Also airlines doesnt sell millions of $ in tickets every day

defrost · on Dec 7, 2023

The quote was:

> That website sold millions of euros worth of tickets every day.

The claim wasn't that a single airline sold a million dollars per day, but that a third party on seller sold a million euros worth of tickets a day.

Is that plausible?

Consider The City in the Sky:

    Every day 100,000 flights criss-cross the globe with more than 1 million people in the air at any one time. Dallas Campbell and Dr Hannah Fry explore the world of aviation.

https://www.imdb.com/title/tt5820022/

At any instance there are one million people aloft.

At any instance there's at least 50 million dollars worth of ticket sales in play - how much during a 24 hour day would you estimate?

Is it possible for a single third party seller to capture a million euro per day?

belter · on Nov 29, 2023

A CEO at the office and not in the golf course...that gives it away....

Salgat · on Nov 29, 2023

That story has been repeated in some form for as long as I was alive.

bboygravity · on Nov 29, 2023

You worried about that?

I'm a frequent flyer and I got a feeling that most airline ticket booking pages are broken in some way more than half the time. Maybe not often broken to the point that they're blank, but definitely broken to the point that booking a ticket isn't possible (I prefer blank, so that I don't waste like 30 minutes on not being able to book a ticket).

Also most of the internet seems often broken. Oh hello Nike webshop errors upon payment (on Black Friday) for which helpdesk's solution is: just use the App.

BiteCode_dev · on Nov 29, 2023

Hell, I used to worry about down time for my tiny blog. Didn't want to let down my readers.

Everything can be a guilt trip if you try hard enough.

Then I met a guy, now a good friend, that made me do my first "pull the plug migration" on his most important website. He lived on this.

I looked at the site going down, horified. He mocked me, then proceeded with the udate. It didn't work. The site stayed offline for hours.

Then it worked again. And nobody cared. It had zero consequences on traffic.

User were pissed off for a few hours, and life goes on.

thfuran · on Nov 29, 2023

What always got me is that, for at least the first several years, Google couldn't get their store page to handle the load when they were releasing a new phone and it'd be crapping out for days.

creshal · on Nov 29, 2023

Steam still craps out during large sales. I really wonder how Valve calculates that it's fine to keep losing out on (cumulatively) hours of sales each time.

lucb1e · on Nov 30, 2023

Woa, and I always wonder why it's only me that seems to have to use the developer tools to enable that stupid submit button when I filled out every field on the page correctly, shaking my head and wondering how normal people use the internet. I keep thinking it's got to be something about using firefox instead of a big tech browser, my mouse gestures extension, I don't know but normal webshops are broken so often it's insane. Thanks for sharing that it's not just me!

Cthulhu_ · on Nov 29, 2023

I think the loss may not have been as much as you think; sure, nobody could buy tickets for a few hours, so theoretically the company lost millions of revenue during that time. But that assumes people wouldn't just try again later. Downtime does not, in practice, translate to losses I think.

I mean look at Twitter, which was famously down all the time back when it first launched due to it popularity and architecture. Did it mean people just stopped using Twitter? Some might, the vast majority and then some didn't.

Downtime isn't catastrophic or company-ending for online services. It may be for things in space or high-frequency trading software bankrupting the company, but that's why they have stricter checks and balances - in theory, in practice they're worse than most people's shitty CRUD webservices that were built with best practices learned from the space/HFT industries.

davedx · on Nov 29, 2023

Even with HFT you’d have to have more than 50% of your trades go against you to lose any money, and you’ll probably have hedges, and losing some % of money will be within normal operation parameters. Shit happens! Links go down, hardware fails, bugs slip through no matter how diligent you are. (No I’m not looking to be hired by any HFT shops)

quickthrower2 · on Nov 29, 2023

Being an airline boss I would really have hoped the response would have been more in line with the ethos of a plane crash postmortem, i.e. find the system causes and fix those. Maybe you need a copilot when doing live deployments and that copilot had authority to stop the rollout. Along with the usual devops guards.

rahimnathwani · on Nov 29, 2023

This reminds me of that old joke that ends "Why would I fire you? We just spent millions training you!".

People who take on high risk projects are underappreciated. But many managers prefer employees who can reliably deliver zero value, than those with positive expected value but non-zero variance.

mr_mitm · on Nov 29, 2023

That story sounds so much like that joke that I'm wondering if there is some urban legend thing going on here.

holografix · on Nov 29, 2023

Wow chatgpt is actually getting worse.

slowmotiony · on Nov 29, 2023

There must be more people like you in those major airlines, as those sites go down all the damn time. 6 hours..??? The lufthansa desktop site didn't allow anyone book anything for like 3 weeks straight, you had to use the app instead.

marvin · on Nov 29, 2023

Your company should definitely have had a production-identical staging environment if an hour of downtime means millions lost :D

That development would be an obvious investment that pays for itself. I’m in banking, and terrified of making even a slightly complex deployment without validating it in production first. (Complex here referring to that it might be dependent not just on code changes, but also environment).

leoff · on Nov 29, 2023

And then everybody clapped.

ookblah · on Nov 29, 2023

i've seen this comment almost verbatim somewhere lol (when the gitlab dev erased their DB a few years ago)

chriscjcj · on Nov 29, 2023

I work in TV. During my first job at a small market station 30 years ago, I was training to be a tape operator for a newscast. All the tapes for the show were in a giant stack. There were four playback VTRs. My job was to load each tape and cue it to a 1-second preroll. When a tape played and it was time to eject that tape, it was _very_ easy to lose your place and hit the eject button on the VTR that was currently being played on the air instead of the one that they just finished with. The fella who was training me did something very annoying, but it was effective: every time I went to hit the eject button, he would make a loud cautionary sound by sucking air through his closed teeth as if to tell me I was about to make a terrible mistake. I would hesitate, double check and triple check to make sure it was the right VTR, and then I would eject the tape. He made that sound every single time my finger went for the eject button. It really got on my nerves, but it was a very good way to condition me to be cautious. Our station had a policy: the first time you eject a tape on the air got you a day off without pay; the second time put you on probation; the third time was dismissal. I had several co-workers lose their jobs and wreck the newscast due to their chronic carelessness. Thanks to my annoying trainer, I learned to check, check again, and check again. I never ejected a tape on the air. It certainly would not have been a half-billion dollar mistake if I had, but at that point in my career it would have felt like it to me.

ksaj · on Nov 29, 2023

That explains the old blooper reels that were popular on TV in the early 80's, where the reporter would be talking about something, and get video of something completely bonkers in the background instead.

maxerickson · on Nov 29, 2023

Rolling the wrong tape still happens frequently enough on modern live broadcasts.

i-use-nixos-btw · on Nov 29, 2023

I agree that the person who made such a mistake will be the person who never makes that mistake again. That's why firing someone who has slipped up (in a technical way) and is clearly mortified is typically a bad move.

However, I don't agree that this is the "real" lesson.

Given the costs at play and the risk presented, the lesson is that if you have components that are tested with a big surge of power, give them custom test connectors that are incompatible with components that are liable to go up in smoke. That's the lesson. This isn't a little breadboard project they're dealing with, it's a vast project built by countless people in a government agency that has a reputation for formal procedures that are the source of great time, expense, and in some cases ridicule.

The "trust the 28 year old with the $500m robot that can go boom if they slip up" logic seems very peculiar.

jdiez17 · on Nov 29, 2023

Well, it's true that it should be designed such that they cannot be plugged incorrectly. I would imagine it is indeed mostly designed in that way, but there can still be erroneous configurations that were not accounted for at the design stage.

Especially during testing you're often dealing with custom cables connectors and circuits that are different from the "normal configuration".

I would say that the lesson is to do as many critical operations under the 4-eye principle: someone is doing the thing, someone else is checking each step before continuing. Very effective for catching "stupid mistakes" like the one in the article. But again, it is not always possible to have two people looking at one test, especially with timeline pressure etc. So mistakes like these do happen in the real world. You have to make the whole system robust.

dheera · on Nov 29, 2023

> Well, it's true that it should be designed such that they cannot be plugged incorrectly

I agree with you, but on Earth this is easy. For spacecraft I imagine you can't just use any connector from Digikey

> especially with timeline pressure etc.

If timeline pressure, lost sleep, or rushing jobs not meant to be rushed causes a catastrophic technical error to be made, it is 100% the fault of the person who imposed the timeline, whether that be some middle manager, vice president, board, investor, or whoever. Emphatically NOT the engineer who did the work, if they do good work when not under time pressure.

HOLD PEOPLE LIABLE for rushing engineers and technicians to do jobs that require patience and time to do right.

jdiez17 · on Nov 29, 2023

I agree that individuals shouldn't be held responsible for mistakes like this.

However, you can't always eliminate timeline pressure. Even if the project is planned and executed perfectly, there will almost always be unknown unknowns encountered along the way that can push your timeline back. As is the case with sending things to Mars there is a window every two years. That's a very real, non-fictitious deadline that can't be worked around.

dheera · on Nov 29, 2023

> As is the case with sending things to Mars there is a window every two years.

This is very simple to deal with.

(a) If it's unmanned, rush and launch on-time but don't fault the engineer for mistakes made by rushing. If it doesn't work everyone accept that as a consequence of rushing.

(b) If it's manned, wait until the next launch window and prioritize safety. Period.

crazygringo · on Nov 29, 2023

That was my first thought as well.

On the other hand, it's hard to make these kinds of judgment calls when you're talking about a one-off piece of equipment that's only going to go through this particular testing cycle a single time.

In computing, there are a lot of similar "one-off" operations -- something you to do to the prod database or router config a single time as part of an upgrade or migration.

Sometimes building a safeguard is more effort than just paying attention in the first place. And while we don't always perfectly pay attention, we also don't always perfectly build safeguards, and wind up making similar mistakes because we're trusting the faulty safeguard.

In circumstances like the one in the story, the best approach might almost be the hardware equivalent of pair programming -- the author should have had a partner solely responsible for verifying everything he did was correct. (Not just an assistant like Mary who's helping, where they're splitting responsibilities -- no, somebody whose sole job is to follow along and verify.)

Retric · on Nov 29, 2023

“One off” is never just a one off it’s always part of a class of activity such as server migrations etc. Just paying attention guarantees eventual failure when repeated enough times.

This may be acceptable, but it comes down to managing risks. If failure means the company dies then taking a 1 in 10,000 risk to save 3 hours of work probably isn’t worth it. If failure means an extra 100 ours of work and 10k in lost revenue then sure take that 1 in 10,000 risk it’s a reasonable trade off.

QuadmasterXLII · on Nov 29, 2023

On careful reading, the power was sent in to the power outleads of an H Bridge, which is a tough piece of electronics, and in the end nothing was damaged- the shutdown was unrelated. If it had been sent to the data line of the motor controller, it probably would have poofed something. We cant rule out that there were different connector types, but the two mistaken connectors were correctly assigned the same type.

world2vec · on Nov 29, 2023

When men landed first time on the Moon, the average age of engineers at Mission Control was 28 years old.

johnwalkr · on Nov 29, 2023

I work in this industry and let me explain how this happens. Despite being such a costly project, you can’t really hard-require unique connectors everywhere because of all of the competing requirements. Actually, connectors in particular tend to have a lot of conservative requirements such as being previously qualified, certain deratings, pin spacing, grounded back shells, etc. At the end of the day there’s only a handful of connector series used and stocked and it’s not feasible (at any cost really) to have no matching connectors whatsoever. Of course, you would normally try and make connectors either standardized with the same signals, or unique with no overlap in between.

I don’t know the details in this case but it could be like this: socket-type connectors are required on external connectors on the spacecraft (to prevent shorts when handling), with a harness in between which will never be removed. The harness would be symmetrical with pin-type connectors.

At some point it is decided a breakout box is required for testing and now you have created an opportunity to plug the breakout box in backwards.

Or the breakout box has a 100 pin connector on one side and needs to connect to 25 pieces of test equipment on the other side. You probably don’t have 25 different connectors to chose from, nor can you possibly demand custom requirements for every piece of test equipment.

Spacecraft are moving more towards local microcontrollers with local diagnostics so this kind of test equipment for every possible analogue signal is decreasing. In the case of motors, they would more likely be brushless now and you would rely on telemetry from motor drivers during both testing and flight instead of having this type of breakout box.

Connectors in aerospace are also following other industries and becoming more configurable at order time, including adding keys so you can have 10x “the same” connector but keyed so they only plug in one place. But it’s still not practical to demand all test equipment is configured like this.

alibarber · on Nov 29, 2023

| The "trust the 28 year old with the $500m robot that can go boom if they slip up" logic seems very peculiar.

Not just that, but to create a situation whereby said person is working unofficial double shifts to get it done, so probably aren't going to be bringing their best selves into the office. If it were my $500 million I wouldn't even care about the name of this guy but would want to have some very robust discussions with the head of their department. Also, "some mistakes feel worse than death" - I get it, but c'mon, it's not like someone actually did die, which is a sadly unfortunate reality of other much less spectacular and blog-worthy mistakes.

lucb1e · on Nov 30, 2023

> The "trust the 28 year old

(Same for an 82 year old or any other number..)

throwawayosiu1 · on Nov 29, 2023

I'll add my story here for posterity:

My first job out of university, I was working for a content marketing startup who's tech stack involved PHP and PerconaDB (MySQL). I was relatively inexperienced with PHP but had the false confidence of a new grad (didn't get a job for 6 months after graduating - so I was desperate to impress).

I was tasked with updating some feature flags that would turn on a new feature for all clients, except for those that explicitly wanted it off. These flags were stored in the database as integers (specifically values 4 and 5) in an array (as a string).

I decided to use the PHP function (array_reverse)[https://www.php.net/manual/en/function.array-reverse.php] to achieve the necessary goal. However, what I didn't know (and didn't read up on the documentation) is that, without the 2nd argument, it only reversed the values not the keys. This corrupted the database with the exact opposite of what was needed (somehow this went through QA just fine).

I found out about this hours later (used to commute ~3 hrs each way) and by that time, the senior leadership was involved (small startup). It was an easy fix - just a reverse script - but it highlighted many issues (QA, DB Backups not working etc.)

I distinctly remember (and appreciate) that the lead architect came up to me the next day and told me that it was rite of passage of working with PHP - a mistake that he too had made early in his career.

I ended up being fired (grew as an engineer and was better off for it) but in that moment and weeks after it, it definitely demoralized me.

notRobot · on Nov 29, 2023

They fired you for that array_reverse mistake?

throwawayosiu1 · on Nov 29, 2023

Yep, I was put on PIP with impossible success criteria (no issues raised in PRs by senior engineers and no issues in code deployed to production - even if it was reviewed by senior engineers & QA) and fired (for failing that criteria) in 2 weeks.

I worked there for ~8 months in total.

erhaetherth · on Nov 29, 2023

> no issues raised in PRs by senior engineers

Wat? Like serious issues, or minor things that can be improved? Because it's very rare in my place of work that there are no comments on a 'PR'. Something can always be improved.

coolgoose · on Nov 29, 2023

+1 on this, every place or project I have touched has a backlog of tens if not hundreds of nice to haves but never enough time to touch them, and some of them are really not complicated.

fullspectrumdev · on Nov 29, 2023

The impossible PIP trick for dismissal is something I’d love to see get eventually legally obliterated.

globalise83 · on Nov 29, 2023

Constructive dismissal is illegal in many countries. It's the choice of the people which system they want to work in.

fullspectrumdev · on Nov 29, 2023

It’s illegal where I am, but employers are extremely capable of abusing “performance improvement plans” as a way to constructively dismiss people - knowing most people won’t have the wherewithal to fight it in court.

m0rtalis · on Nov 29, 2023

Sometimes the only winning move is no to play

curiousObject · on Nov 29, 2023

The story is compellingly written, but I thought it was also confusing.

It sounds as if this team made several mistakes, not just one mistake. It's also not clear if the result of these mistakes was that there might be real damage to the spacecraft, or if the result was just wasted time and hours of confusion about why the spacecraft wouldn't start up.

The first mistake is they didn't realize that the multimeter was not only measuring, but it was also completing the circuit.

That sounds like a real bad idea. But if it was totally necessary to arrange it like that, then that multimeter should never have been touched.

That's not just one guy's error. It's at least two guys at fault, along with whoever is managing them, and whoever is in charge of the system that allows it.

The second mistake is with the break-out-box. They think he misdirected power wrongly into the spacecraft. Then they jump to the conclusion that has generated a power surge which has damaged the spacecraft, because it won't start up.

But they're not sure where the power surge went and what might be damaged. Anyhow they're wrong.

The reason the spacecraft won't start up is just because he took the multimeter out of the circuit before the accident.

I'm still sort of confused about what happened or if they ever really figured out what happened.

He said "Weeks of analyses followed on the RAT-Revolve motor H-bridge channel leading to detailed discussions of possible thin-film demetallization".

Does this mean that they decided that the misdirected power surge might have flowed into the RAT-Revolve motor H-bridge channel and damaged that?

prmoustache · on Nov 29, 2023

You forgot: the telemetry guy (Leo) didn't mention they had lost Telemetry before the storyteller told him he did a mistake. I mean shouldn't they have cancelled all testing until they have it back?

maxerickson · on Nov 29, 2023

It started up fine. The multimeter was connecting up the telemetry, so they weren't getting any information from it until they restored that circuit.

The power absolutely did feed into that circuit, they were trying to decide if it would have damaged it (but a motor driver is going to be able to handle power coming from the motor, so they decided that it probably didn't damage it).

sopooneo · on Nov 29, 2023

That is my reading too. But why was the multimeter connecting up telemetry? That seems very strange to me.

rchard2scout · on Nov 29, 2023

According to the article, it was monitoring bus voltage, but I could imagine it was used to measure the current being used by the telemetry system. So if it was used as an ammeter, it would've been placed in series.

tempestn · on Nov 29, 2023

I assume it was wired up in series to measure current.

curiousObject · on Nov 29, 2023

> It started up fine.

Thanks. I understand that better now. The spacecraft did start up, but it seemed as if it could be badly damaged because they were not receiving any telemetry data

mk_stjames · on Nov 29, 2023

I am a Mechanical/Aerospace engineer.... I wish my scariest stories 'only' involved a potential bricking of a main computer on an unmanned $500M rover.

No... I was the senior safety-crit signoff on things carrying human lives. I had to look over pictures of parts broken from a crash and have the potential feeling of 'what-if that's my calculation gone wrong'. My joint that slipped. My inappropriate test procedure involving accelerated fatigue life prediction, or stress corrosion cracking. My rushing of putting parts into production processes that didn't catch something before it went out the door.

It's interesting to read people's failure stories from similar fields but, to me, the ones that people so openly write about and get shared here on HN always come across as... well, workplace induced PTSD is not a competition. It's just therapy bills for some of us more than for others.

Terr_ · on Nov 29, 2023

That reminds me of a fiction-quote by one of my favorite authors, where a welding-instructor has just finished sharing a (somewhat literal) post-mortem anecdote of falsified safety inspections.

> He gathered his breath. “This is the most important thing I will ever say to you. The human mind is the ultimate testing device. You can take all the notes you want on the technical data, anything you forget you can look up again, but this must be engraved on your hearts in letters of fire.

> “There is nothing, nothing, nothing more important to me in the men and women I train than their absolute personal integrity. Whether you function as welders or inspectors, the laws of physics are implacable lie-detectors. You may fool men. You will never fool the metal. That’s all.”

> He let his breath out, and regained his good humor, looking around. The quaddie students were taking it with proper seriousness, good, no class cut-ups making sick jokes in the back row. In fact, they were looking rather shocked, staring at him with terrified awe.

-- Falling Free by Lois McMaster Bujold

badcppdev · on Nov 29, 2023

Aside: I recognised 'quaddies' from your quote ... there was some very distinctive cover art on the Analog magazine for that story: https://www.abebooks.co.uk/Analog-Science-Fact-Fiction-Febru...

cobertos · on Nov 29, 2023

Regarding your last paragraph, I thought the same... When the author wrote:

> I'm instantly transported back to that moment — the room, the lighting, the chair I was in, the table, the pit in my stomach, ...

I could'nt help but think "that sounds like a trauma reaction". Good on them to be able to use that energy to do better! But also not everyone reacts the same way to trauma nor is it easy to compare such reactions to trauma (for example as a hiring question). I feel there are too many social variables at play

mk_stjames · on Nov 29, 2023

This was the triggering phrase in the original article for me, yep.

I also want to clarify that I've worked in both aerospace and automotive, and the mention of the word 'crash' in my above comment was referring to work I did in automotive, lest someone tries to start wondering 'which one' with regards to an airframe.

For me, the reaction the the stress of having to make sure I was delivering... and the idea that those things out there, I mean.. put it this way I've worked on enough vehicles that a majority of HN readers will have ridden in something utilizing math that I did or parts that I specified, drew, and released, on a road, at least once in the last 15 years.

I once had potential employers ask that 'how would you respond to this kind of stressful situation' question before and I've actually had difficultly getting my answers across because the real stressful shit I can't even talk about without potentially triggering just a horrible social reaction. Or panic attacks. Or potential legal issues.

nonethewiser · on Nov 29, 2023

> I had to look over pictures of parts broken from a crash and have the potential feeling of 'what-if that's my calculation gone wrong'.

Does it inevitably come down to that for someone? I mean even if its a detail that a procedure couldn’t have caught, someone is responsible for forming good procedures. I suppose there could be several factors. But it seems like ultimately someone is going to be pretty directly responsible.

Just interesting to think about in the context of software engineering and kinda even society at large where an individual’s mistakes tend to get attributed to the group.

hutzlibu · on Nov 29, 2023

"But it seems like ultimately someone is going to be pretty directly responsible."

Or many people, or no one directly. Space missions come with calculated risk. So someone calculates the risk that this critical part brakes is 0.5% and then someone higher up says, that is acceptable and all move on - and then this part indeed brakes and people die.

Who is to blame, when the calculation was indeed correct, but 0.5% chances still can happen (and itnwould be a lot)? And economic pressures are real, like the limits of physic?

See Murpheys Law, "Anything that can go wrong will go wrong." (Eventually, if done again and again)

https://en.m.wikipedia.org/wiki/Murphy's_law

Astronauts know, there is a risk with every mission, so do the engeneers, so does management. Still, I cannot imagine why anyone thought it was an acceptable risk, to use a 100% oxygen atmosphere with Apollo 1, where 3 Astronauts died in a fire. But that incident indeed changed a lot regarding safety procedures and thinking about safety. Still, some risk remain and you have to live with that.

I am quite happy though, that in my line of work, the worst that can happen is a browser crash.

perilunar · on Nov 29, 2023

They had reasons. See: https://en.wikipedia.org/wiki/Apollo_1#Choice_of_pure_oxygen...

Even after the fire, the Apollo spacecraft still used 100% oxygen when in space. The cabin was 60% oxygen / 40% nitrogen at 14.7 psi at launch, reducing to 5 psi on ascent by venting, with the nitrogen then being purged and replaced with 100% oxygen.

> See Murpheys Law...

Indeed. I hope that was a joke.

euroderf · on Nov 29, 2023

> Still, I cannot imagine why anyone thought it was an acceptable risk, to use a 100% oxygen atmosphere with Apollo 1

Especially when, in prior experience there, asbestos had caught fire in the same situation (O2, low pressure).

hutzlibu · on Nov 29, 2023

Wow, this detail I did not know yet. It just was a reckless rush to the moon, no matter the cost at this time. Without the deaths, nothing would have changed probably.

euroderf · on Nov 29, 2023

I read this in a book some years after the Apollo 1 (i.e. Apollo 204) fire, so I have no reason to doubt its provenance.

cjbprime · on Nov 29, 2023

For what it's worth, I strongly disagree -- the group as a whole (and especially its leadership) is responsible for the policies they decide to institute, and the incentives they allow to exist. For example, in this article's story the author is apparently working >80 hour weeks directly manipulating the $500M spacecraft two weeks before it launches. Do we really think they are "directly responsible" for the described mistake? I think a root cause analysis that placed responsibility on any individual's actions would simply be incorrect -- and worse, would be entirely unconstructive at actually preventing reoccurrence of similar accidents.

I think this is furthermore almost always true of RCAs, which is why blameless post-mortems exist. It's not just to avoid hurting someone's feelings.

ChristianGeek · on Nov 29, 2023

I’d be interested in talking to the team at Boeing.

rob74 · on Nov 29, 2023

I sincerely hope there are more people like you in the aerospace industry and less like those who conceived, implemented and signed off the 737 MAX's MCAS at Boeing...

quacksilver · on Nov 29, 2023

I'm surprised that group therapy for engineers doesn't exist or maybe I just can't find it. I don't work in anything high stakes myself though have often been an ear to those who do in aviation or rail.

I believe that it can be quite hard sometimes if you have empathy and don't take the "Once the rockets are up, who cares where they come down? That's not my department" approach."

Perhaps some peer support group (possibly facilitated) for people that build safety critical systems or deal with the fallout. Not all companies will provide good counseling etc.

Perhaps the engineering boards / chartered engineer organizations should provide this and fund it from their membership fee, though that would probably scare people off going to the service as they could be afraid of losing their stamp / chartered engineer status / license.

Perhaps this would be dealt with in the past by getting drunk with colleagues in the pub, though alcoholism (or being impaired at work the next morning) is bad and pubs etc. are less popular now.

quasse · on Nov 29, 2023

Profession induced trauma is finally getting taken seriously in the medical setting, another high stakes field.

Group therapy for doctors and nurses is finally becoming a thing, but unfortunately it is completely dependent on being employed by an organization that cares about it.

pants2 · on Nov 29, 2023

And the engineers designing safety mechanisms for nuclear weapons probably think you have it easy.

rgmerk · on Nov 29, 2023

Yep, that's fair!

But my understanding is the default behaviour of most nuclear weapons (other than Hiroshima-style ones) is "blows itself to pieces without detonating the nuclear part", rather than "vapourises everyone within a mile".

Everything needs to go right for a nuclear weapon to actually blow up with a significant yield.

King-Aaron · on Nov 29, 2023

Yeah, luckily they seem to have a pretty well sorted failure mode.

atonalfreerider · on Nov 29, 2023

Reminds me of the NOAA-N Prime satellite that fell over, because there weren't enough bolts holding it to the test stand.

The roots cause, and someone correct me if this is not accurate, was that the x-ray tested bolts to hold it down were so expensive, that they had been "borrowed" to use on another project, and not returned, so that when the time came to flip the satellite into a horizontal position, it fell to the floor. Repairs cost $135M.

https://en.m.wikipedia.org/wiki/NOAA-19

euroderf · on Nov 29, 2023

And so when "Check for bolts" is added to flip procedure errata/addendum, is light sarcasm called for ?

initplus · on Nov 29, 2023

Interesting to see that the worry could have been avoided if they had lined up their timelines better in the first place. If they'd compared the timestamp on the test readout to the last timestamp from the telemetry system, they'd have seen that the telemetry failed BEFORE the test was executed. Partially caused by using imprecise language "we seem to have lost all spacecraft telemetry just a bit ago" rather than an accurate timestamp.

A cautionary lesson in properly checking how exactly events are connected during an incident. Easy to look at two separate signals and assume they must be causal in a particular direction, when in reality it is the other way around.

HeavyStorm · on Nov 29, 2023

It's a interesting story, but the author may be overselling it. It's not a _failure_ story, nor was it a 500M mistake. I get that it was really stressful and the mistake could have cost him the job, but it didn't; it also didn't cost NASA anything other than a few hours of work (which, during testing, I would guess it's expected).

When I'm asked to share failures, I'm usually not thinking about "that one time when I almost screwed up but everything was fine", instead, I'm thinking of when I actually did damage to the business and had to fix it somehow.

felideon · on Nov 29, 2023

“My quasi-$500M Mars rover mistake” just doesn’t have quite the same ring to it.

But, to your point, it is still a failure story that _could_ have lead to a much worse outcome than it did. The fact that it didn’t was mostly due to luck.

dgunay · on Nov 29, 2023

One thing about long aerospace missions like this with huge lead times that always gets me - you can spend years of your life working on a mission, only for it all to fail with potentially years until you can try again.

This is a refreshingly humanizing article, but is also one written from the perspective of a survivor. Imagine if the rover were actually lost. I asked the question "what would you do if the mission failed after all of this work? How could you cope?" to the folks at (now bankrupt) Masten Aerospace during a job interview, and maybe it was a bad time to ask such a question, but I didn't get the sense they knew either. "The best thing we can do is learn from failure," one of them told me. An excellent thing to do, but not exactly what I asked. This to me stands out as the defining personal risk of caring about your job and working in aerospace. Get too invested, and you may literally see your life's work go up in flames.

typpo · on Nov 29, 2023

> you may literally see your life's work go up in flames.

Incidentally, this happened to Lewicki a few years later when Planetary Resources' first satellite blew up on an Antares rocket: https://www.geekwire.com/2014/rocket-carrying-planetary-reso...

dgunay · on Nov 29, 2023

Did they have a narrow launch window they couldn't afford to miss? I'm not talking about missions where you eat a big monetary loss on the launchpad and try again, I mean missions which rely on planetary alignments that may not happen again for years, or even the rest of your life, such as Voyager. Or even just missions where you launch successfully, but then after months (or years) of flight time the spacecraft is lost.

Teever · on Nov 29, 2023

> "The best thing we can do is learn from failure," one of them told me.

I would argue that if we don't charge the process to prevent this kind of catastrophic failure mode then we really haven't learned from the failure.

cjbprime · on Nov 29, 2023

> And I still remember the shock when Project Manager Pete delivered the decision and the follow-on news: ‘These tests will continue. And Chris will continue to lead them as we have paid for his education. He’s the last person on Earth who would make this mistake again.’

I wonder whether Pete had followed this 1989 general aviation/accident analysis story:

> When he returned to the airfield Bob Hoover walked over to the man who had nearly caused his death and, according to the California Fullerton News-Tribune, said: "There isn’t a man alive who hasn’t made a mistake. But I’m positive you’ll never make this mistake again. That’s why I want to make sure that you’re the only one to refuel my plane tomorrow. I won’t let anyone else on the field touch it."

-- https://www.squawkpoint.com/2014/01/criticism/

(The incident above led to the creation and eventual mandated use of a new safety nozzle for refueling, which seems like a better long-term solution than having the people who've nearly killed you nearby to fuel your plane indefinitely: https://en.wikipedia.org/wiki/Bob_Hoover#Hoover_nozzle_and_H...)

goku12 · on Nov 29, 2023

If there is the possibility to make a mistake, somebody will certainly make it. You expect all the humans involved to be competent. But relying on that competence is a mistake. The emotional stress of dealing with such enormous responsibilities, the often long work hours and the long list of procedures will make any competent professional to inadvertently slip up at some point.

In case of electrical connectors, the connectors are often grouped together in such a way as to avoid making wrong connections. Connectors with different sizes, keying, gender, etc are chosen to make this happen. This precaution is taken at design time. JPL is extremely experienced in these matters. There is probably something else left unsaid, that led to this mistake being possible.

Meanwhile, motor controllers using H-bridge is something that's never boring. I once saw a motor control fail so spectacularly that we were scratching our heads for days afterwards. As always, a failure is never due to a single cause (due to careful design and redundancies). It's a chain of seemingly innocuous events with a disastrous final outcome. But the chain was so mind-bending that we had to write it down just to remember how it happened. Recently, I was watching the Chernobyl nuclear disaster and I got reminded of this failure. Our failure was nowhere near as disastrous - but the initial mistakes, the control system instability, the human intervention and the ultimate failure propagation were very similar in nature. Needless to say, it sent us back to the drawing board for a complete redesign. The robustness of the final design taught me the same lesson - failures are something you take advantage of.

avar · on Nov 29, 2023

Are the electronics in these rovers really so bespoke that they don't have multiple copies of each electronic component warehoused on-site?

I'd expect that the rover body itself would be bespoke this late in the process, although a parallel test vehicle would be useful, do they have that?).

But in case someone fried the rover's electronics I'd think tearing it apart and replacing them while maintaining the chassis should be doable in 2 weeks, but what do I know?

apcragg · on Nov 29, 2023

They almost certainly had flight spares but with two weeks until your launch window, there is zero chance you are deintegrating multiple systems, swapping in the spare, reintegrating, and re running your acceptance test campaigns. And that is assuming that they damaged a subsystem. Back powering the entire spacecraft could have wrecked your power system and anything connected to it. You'd have to disposition every part of the system that was touched. It's much more involved than just swapping in the spare and sending it.

avar · on Nov 29, 2023

The implied context here is that you'd forgo the usual tests, because the alternative is to send nothing to Mars.

According to Wikipedia they could have stretched those 2 weeks to around 3 weeks, but after that they'd have missed the launch window.

The usual processes are there to have a near-certainty of a working rover, but under these circumstances I'd think they'd just YOLO it and hope for the best.

But that assumes they've got spare electrical components, or alternatively a better use for the booster sitting on the pad than such an improvised mission.