having worked at an enterprise software company and working with several big clients (including banks), I find it surprising (and shocking to some extent) that JPMC didnt have a more efficient disaster recovery process in place.
I am not saying they didnt have one, just that disaster recovery scenarios should factor into such outages. Hypothetical fire drills etc. are needed at such critical businesses like banks.
My guess is that a bunch of people @ jpmc will most likely be losing their jobs over this.
I read it differently. These things happen, you can get data corruption replicated to the hot spare, i.e. failure more catastrophic than this setup is able to handle.
They were able to identify the problem and successfully recover from backup and successfully replay missing transactions in a reasonable amount of time for the setup this large. In my book it's a success.
With the same experience you have, I am not shocked at all. People design and implement "processes" to "prevent" production issues from happening, but they are mostly feel-good sounding things on top of "let's cross our fingers and hope nothing bad happens".
This usually works, which is why people think it's an acceptable policy. But real planning involves things like software correctness, proper test procedures, ways of making a test environment that's exactly identical to production, and so on. This is hard (and slows down development... "tests, what a waste of time!"), so people instead say, "let's try really hard to not fuck something up".
I am not saying they didnt have one, just that disaster recovery scenarios should factor into such outages. Hypothetical fire drills etc. are needed at such critical businesses like banks.
My guess is that a bunch of people @ jpmc will most likely be losing their jobs over this.