The engineers still seem to have a physical server mindset rather than a cloud mindset. Deleting data is always extremely dangerous and there was no need for it in this situation.
They should have spun up a new server to act as secondary the moment replication failed. This new server is the one you run all of these commands on, and if you make a mistake you spin up a new one.
Only when the replication is back in good order do you go through and kill the servers you no longer need.
The procedure for setting up these new servers should be based on the same scripts that spin up new UAT servers for each release. You spin up a server that is a near copy of production and then do the upgrade to new software on that. Only when you've got a successful deployment do you kill the old UAT server. This way all of these processes are tested time and time again and you know exactly how long they'll take and iron out problems in the automation.
This type of thing always sounds good and all, but the reality is people get desperate and emotional when their website is down and everyone wants it up ASAP.
I certainly don't disagree with that, but if you have this automated it is also the fastest way to get it back up and running. Besides, the site wouldn't have been down if they had this.
You don't think it's worth it most the time because of the hassle of setting up and managing a cluster, or because clusters in and of it self is not necessary for most?
The latter. I think it's a 1% business case, basically. I mean, if we can get 80% benefit without excessive cost, then it's obviously a good idea. (And I do use Ansible/Docker and the like, but it's not entirely without friction... which is where the cost/benefit analysis comes in.)
EDIT: Obviously, if you really need clustering, then you need it, but IME people tend to overestimate their needs drastically. Everybody wants to be Big Data, but almost nobody actually is.
It cut your cloud services bill by 93%, but how much did it increase your engineering bill by?
If your engineering time is free, then this calculation is complete. Otherwise it is not.
Does that 93% saving pay for a DB engineer, or enough of your developers' time to build the same quality of redundancy as you'd get with a DBaaS?
This calculus is going to be different for every DB and every company, but the OpEx impact of switching to dedicated servers is a bit more complex than you suggest above.
(a) I’m talking about projects I host in my free time
(b) My server budget is fixed.
So, for me the choice was between "use cloud tools, and get performance worse than a raspberry pi", or "run dedicated, and get more performance and storage and traffic than I need, and actually the ability to run my stuff".
For less than the price of a Netflix subscription I’m able to run services that can handle tenthousands of concurrent users, and have terabytes of storage (and enough traffic that I never have to worry about that).
And the cost of setting it up was for me a few days.
For me it was a decision between being able to run services, or not being able to run them at all.
Sure, hobby/spare-time projects are one of the cases where it's perfectly reasonable to self-host; often it's fun to learn about the underlying tools by rolling your own db, and doing so can save you some cash (at the expense of your own time).
However, that paradigm is not really applicable to GitLab's OpEx calculation; they have to pay their engineers ;)
Yes, it might be more affordable. They seem to think it is, as they have chosen to go with self-hosted.
My point is simply that your posts above didn't address the complexity of their calculation, as they didn't factor the costs of switching to self-hosted.
They should have spun up a new server to act as secondary the moment replication failed. This new server is the one you run all of these commands on, and if you make a mistake you spin up a new one.
Only when the replication is back in good order do you go through and kill the servers you no longer need.
The procedure for setting up these new servers should be based on the same scripts that spin up new UAT servers for each release. You spin up a server that is a near copy of production and then do the upgrade to new software on that. Only when you've got a successful deployment do you kill the old UAT server. This way all of these processes are tested time and time again and you know exactly how long they'll take and iron out problems in the automation.