While I don't agree with OP about replacing an entire system overnight. I do remember a friend who worked on a trading floor, and when they had bugs, their manager would say "the traders are taking a 30 minute lunch, you have that long to fix it" with the clear implication that they'd be fired if they didn't.
So I'm not sure if I know that the state of the art for trading platforms is as rock solid as everyone is implying, and Robinhood seems to be way far off from whatever gold standards there are, see infinite leverage bug. So I don't think it's crazy that they move quickly to fix it.
I'd never be able to stomach the pressure, and I wouldn't wish it on others, but it doesn't seem crazy.
I understand the risks need to be weighed against restoring service for their users who might be losing money and avoiding regulatory fines.
Changing a single line can introduce a different bug. Use proper QA and testing to catch as many as possible as with any development.
My emphasis is on getting things fixed quickly. They need to do whatever it takes to get systems online asap. Not sure what's so controversial about that.
Well to be honest we don't know how badly things are broken only they do. But the concept of moving fast while doing things perfectly is the holy grail of software development and not as easily achieved as you seem to make it appear.
I only said it needed to be done fast, regardless of how much work. Of course it's not easy.
I'm surprised by all the misinterpretation in this thread. Seems like it reflects the laid-back West Coast/SV attitude that isn't a good fit for high pressure time-sensitive work in other industries.
You seem to think (or at least imply) that hard things can be done fast if only you work hard enough at it, that it's just a matter of trying. This is just not true. Not sure how else your posts should be interpreted. I've spent days with a team (yes a competent team) just tracking down a bug, let alone fix it (although usually when it's tracked down it's relatively quick to fix). If this issue involves multiple systems in a highly complex environment, then it very well could take a while to address fully, no matter how hard they work at it.
Because that's usually the case. Crunch time, disaster recovery, and emergency fixes are common in every sector from video games to aerospace. If you can't fix then switch to a secondary, or rebuild from backup, or throttle users, or process manually, or do anything other than be completely down.
RH wasn't prepared with any contingency. They should have a resolution for their users - even if they can't find or fix the original cause. That's the failure I'm talking about.
See the 2 other users in this thread that describe similar high-pressure situations.
Depends on what the issue is. Could be something that can be quickly fixed or not. Although going by the lack of a resolution I’m assuming it’s not.
Reality is, without knowing more about what’s causing this it’s impossible for either of us to say. If there is indeed some fundamental bottleneck that was previously not known, then I certainly won’t be surprised if it takes a while to sort out.
Now you can say they should’ve load tested, capacity planned etc etc. But we are where we are. Still can’t go back in time to turn this into a quickly fixable problem if it’s currently not.
Edit: also pretty disappointed that we don’t know more about the root cause. As an user I’d want to know what the issue was and what they are planning to do to about it to evaluate if I should trust them going forward.
> needed to be done fast, regardless of how much work
This is the part you don't understand. There is a difference between digging ten one-foot deep holes vs one ten-foot deep hole. People need time to plan how to coordinate and then get on the same page so that everyone can work at their own pace. That is the part that is not parallelizable and is the rate-determining step.
The context is all lost here. Plenty of other companies and industries have emergency action and disaster recovery. People don't work at their own pace, they work to the deadline with solid procedures. They can fix and replace entire components to restore service ASAP since because that's the priority.
If this sounds unfamiliar or onerous then it's because you and others might have never experienced teams that do this. Robinhood is clearly lacking this experience and disaster planning.
This is a financial trading platform. Do you understand the risks of potentially introducing a different bug?