I agree that the person who made such a mistake will be the person who never makes that mistake again. That's why firing someone who has slipped up (in a technical way) and is clearly mortified is typically a bad move.
However, I don't agree that this is the "real" lesson.
Given the costs at play and the risk presented, the lesson is that if you have components that are tested with a big surge of power, give them custom test connectors that are incompatible with components that are liable to go up in smoke. That's the lesson. This isn't a little breadboard project they're dealing with, it's a vast project built by countless people in a government agency that has a reputation for formal procedures that are the source of great time, expense, and in some cases ridicule.
The "trust the 28 year old with the $500m robot that can go boom if they slip up" logic seems very peculiar.
Well, it's true that it should be designed such that they cannot be plugged incorrectly. I would imagine it is indeed mostly designed in that way, but there can still be erroneous configurations that were not accounted for at the design stage.
Especially during testing you're often dealing with custom cables connectors and circuits that are different from the "normal configuration".
I would say that the lesson is to do as many critical operations under the 4-eye principle: someone is doing the thing, someone else is checking each step before continuing. Very effective for catching "stupid mistakes" like the one in the article.
But again, it is not always possible to have two people looking at one test, especially with timeline pressure etc.
So mistakes like these do happen in the real world.
You have to make the whole system robust.
> Well, it's true that it should be designed such that they cannot be plugged incorrectly
I agree with you, but on Earth this is easy. For spacecraft I imagine you can't just use any connector from Digikey
> especially with timeline pressure etc.
If timeline pressure, lost sleep, or rushing jobs not meant to be rushed causes a catastrophic technical error to be made, it is 100% the fault of the person who imposed the timeline, whether that be some middle manager, vice president, board, investor, or whoever. Emphatically NOT the engineer who did the work, if they do good work when not under time pressure.
HOLD PEOPLE LIABLE for rushing engineers and technicians to do jobs that require patience and time to do right.
I agree that individuals shouldn't be held responsible for mistakes like this.
However, you can't always eliminate timeline pressure.
Even if the project is planned and executed perfectly, there will almost always be unknown unknowns encountered along the way that can push your timeline back.
As is the case with sending things to Mars there is a window every two years. That's a very real, non-fictitious deadline that can't be worked around.
> As is the case with sending things to Mars there is a window every two years.
This is very simple to deal with.
(a) If it's unmanned, rush and launch on-time but don't fault the engineer for mistakes made by rushing. If it doesn't work everyone accept that as a consequence of rushing.
(b) If it's manned, wait until the next launch window and prioritize safety. Period.
On the other hand, it's hard to make these kinds of judgment calls when you're talking about a one-off piece of equipment that's only going to go through this particular testing cycle a single time.
In computing, there are a lot of similar "one-off" operations -- something you to do to the prod database or router config a single time as part of an upgrade or migration.
Sometimes building a safeguard is more effort than just paying attention in the first place. And while we don't always perfectly pay attention, we also don't always perfectly build safeguards, and wind up making similar mistakes because we're trusting the faulty safeguard.
In circumstances like the one in the story, the best approach might almost be the hardware equivalent of pair programming -- the author should have had a partner solely responsible for verifying everything he did was correct. (Not just an assistant like Mary who's helping, where they're splitting responsibilities -- no, somebody whose sole job is to follow along and verify.)
“One off” is never just a one off it’s always part of a class of activity such as server migrations etc. Just paying attention guarantees eventual failure when repeated enough times.
This may be acceptable, but it comes down to managing risks. If failure means the company dies then taking a 1 in 10,000 risk to save 3 hours of work probably isn’t worth it. If failure means an extra 100 ours of work and 10k in lost revenue then sure take that 1 in 10,000 risk it’s a reasonable trade off.
On careful reading, the power was sent in to the power outleads of an H Bridge, which is a tough piece of electronics, and in the end nothing was damaged- the shutdown was unrelated. If it had been sent to the data line of the motor controller, it probably would have poofed something. We cant rule out that there were different connector types, but the two mistaken connectors were correctly assigned the same type.
I work in this industry and let me explain how this happens. Despite being such a costly project, you can’t really hard-require unique connectors everywhere because of all of the competing requirements. Actually, connectors in particular tend to have a lot of conservative requirements such as being previously qualified, certain deratings, pin spacing, grounded back shells, etc. At the end of the day there’s only a handful of connector series used and stocked and it’s not feasible (at any cost really) to have no matching connectors whatsoever. Of course, you would normally try and make connectors either standardized with the same signals, or unique with no overlap in between.
I don’t know the details in this case but it could be like this: socket-type connectors are required on external connectors on the spacecraft (to prevent shorts when handling), with a harness in between which will never be removed. The harness would be symmetrical with pin-type connectors.
At some point it is decided a breakout box is required for testing and now you have created an opportunity to plug the breakout box in backwards.
Or the breakout box has a 100 pin connector on one side and needs to connect to 25 pieces of test equipment on the other side. You probably don’t have 25 different connectors to chose from, nor can you possibly demand custom requirements for every piece of test equipment.
Spacecraft are moving more towards local microcontrollers with local diagnostics so this kind of test equipment for every possible analogue signal is decreasing. In the case of motors, they would more likely be brushless now and you would rely on telemetry from motor drivers during both testing and flight instead of having this type of breakout box.
Connectors in aerospace are also following other industries and becoming more configurable at order time, including adding keys so you can have 10x “the same” connector but keyed so they only plug in one place. But it’s still not practical to demand all test equipment is configured like this.
| The "trust the 28 year old with the $500m robot that can go boom if they slip up" logic seems very peculiar.
Not just that, but to create a situation whereby said person is working unofficial double shifts to get it done, so probably aren't going to be bringing their best selves into the office. If it were my $500 million I wouldn't even care about the name of this guy but would want to have some very robust discussions with the head of their department. Also, "some mistakes feel worse than death" - I get it, but c'mon, it's not like someone actually did die, which is a sadly unfortunate reality of other much less spectacular and blog-worthy mistakes.
However, I don't agree that this is the "real" lesson.
Given the costs at play and the risk presented, the lesson is that if you have components that are tested with a big surge of power, give them custom test connectors that are incompatible with components that are liable to go up in smoke. That's the lesson. This isn't a little breadboard project they're dealing with, it's a vast project built by countless people in a government agency that has a reputation for formal procedures that are the source of great time, expense, and in some cases ridicule.
The "trust the 28 year old with the $500m robot that can go boom if they slip up" logic seems very peculiar.