RDS also comes with its own set of tradeoffs. There is no free lunch, and the cloud is just another word for someone else's server. There are reasons Gitlab opposes that.
In the meantime solution architects and sales people from AWS are going to run around with annotated copies of this public post-mortem to enterprises and say "look, RDS would have solved x,y,z and we can do that for you if you pay us"
>> the cloud is just another word for someone else's server.
No. The cloud (AWS, GCE, Azure etc) is not "just" like your own server.
Just consider some basic details - you pay someone else to worry about things like power outages, disk failures, network issues, other hardware failures, and so on.
I think that's a little pedantic. The point he was making is that, conceptually speaking, the cloud is comprised of servers not unlike the servers you run yourself. The difference, obviously, is who runs them, the manner in which they're run, the exact manner in which they're utilized by you, etc., but they are still just servers at the bottom of the stack.
"The point he was making is that, conceptually speaking, the cloud is comprised of servers"
But... that "point" is trivial.
Did anyone ever claim that cloud servers are made of magic pixie dust? No.
The real "point" is that, cloud = hardware + service, with service > 0.
As the OP describes, GitLab tries to do their own service (because service is expensive... it is), and they find out, the hard way, that the "service" part is not easy at all.
Amazon & Microsoft & Google run millions of servers each, so they can afford to hire really good people, and establish really
good procedures, and so on.
You are completely right. There are reasons to oppose the cloud, but maybe they should focus on improving their systems before moving out of the cloud. At this point in time it is clear that GitLab lacks the talent to run everything themselves. I mean 5 backups worthless or lost? You can't let interns write your backups system. After all backup is a large portion of their product.
The worst part of the whole episode, even worse than 'deleted the active database by accident', was '(backups) were no one's responsibility'. This is not an oversight by an individual engineer, but an aspect of the management and company culture. It shows they lack processes derived from requirements. Lots on introspection required from gitlab at this point.
Yes. This should be treated as a serious management failure. Blame does not lie with the individual who made a simple mistake; it lies with the supervisory structure that allowed simple mistakes such as this to result in major data loss (and, as discussed in yesterday's thread [0], has made a series of other serious strategic mistakes that have likely caused them to end up with such inadequate internal hierarchies).
Something like this is not a mere oversight on the part of technical leadership; it's either negligence or incompetence. Whoever is responsible for GitLab's server infrastructure should be having very serious thoughts right now.
Smaller companies that do not have enough senior/good technical guys that they can afford for whatever reason they have... benefit greatly from the cloud. 1 master, no read partitions, weird backup policies and the saviour of the day is some engineers lucky manual snapshot. That sucks. It's better people start with cloud and manage when they are really confident.
It's worth noting that compared to a good number of recent ish startups, GitLab now has (I believe) more than 160 or so employees. Someone could've owned a recurring task to work on backup processes (and I imagine, now, someone (or likely multiple people)).