> 1. notifications go through regular email. Email should be only one
> channel used to dispatch notifications of infrastructure events. Tools
> like VictorOps or PagerDuty should be employed as notification
> brokers/coordinators and notifications should go to email, team chat, and
> phone/SMS if severity warrants, and have an attached escalation policy so
> that it doesn't all hinge on one guy's phone not being dead.
> 2. there was a single database, whose performance problems had impacted
> production multiple times before (the post lists 4 incidents). One such
> performance problem was contributing to breakage at this very moment. I
> understand that was the thing that was trying to be fixed here, but what
> process allowed this to cause 4 outages over the preceding year without
> moving to the top of the list of things to address?
High availability has been a thing we wanted to do for a while, but for whatever
reason we just never got to it (until recently). Not sure exactly why.
> Wouldn't it be wise to tweak the PgSQL configuration and/or upgrade the
> server before trying to integrate the hot standby to serve some read-only
> queries?
The server itself is already quite powerful, and the settings should be fairly
decent (e.g. we used pgtune, and spent quite a bit of time tweaking things). The
servers currently have 32 cores, 440-something GB of RAM, and the disk
containing the DB data uses Azure premium storage with around 700 GB of storage
(we currently use 340).
> And since a hot standby can only service reads (and afaik this is not a
> well-supported option in PgSQL), wouldn't most of the performance issues,
> which appear write-related, remain? The process seriously needs to be
> reviewed here.
Based on our monitoring data we have vastly more reads than writes. This means
load balancing gets very interesting. Hot standby is also supported just fine
out of the box, you just need something third party for the actual load
balancing.
> And am I reading this right, the one and only production DB server was
> restarted to change a configuration value in order to try to make
> pg_basebackup work?
We suspect so. Chef handles restarting processes and we think it's currently
still set to do a hard restart always, instead of doing a reload whenever
possible.
> What impact did that have on the people trying to use the site a) while the
> database was restarting
A few minutes of downtime as the DB is unavailable.
> and b) while the kernel settings were tweaked to accommodate the too-high
> max_connections value?
No. We now reduced the max_connections to a lower value (1000) so we still have
enough but don't to tweak any kernel settings.
> Is it normal for GitLab to cause intermittent, few-minute downtimes like
> that? Or did that occur while the site was already down?
We've had a few too many cases like this in the past. We're aiming to resolve
those, but unfortunately this is rather tricky and time consuming.
> 3. Spam reports can cause mass hard deletion of user data?
Yes.
> Has this happened to other users?
Not that I know of.
> What's the remedy for wrongly-targeted persons?
A better abuse system, e.g. one that makes it easier to see _who_ was reported.
We're also thinking of adding a quorom kind of feature: to remove users more
than 3 people need to approve it, something like that.
> And is the GitLab employee's data gone now too?
No. The removal procedure was throwing errors, causing it to roll back its
changes. This kept happening, which prevented the user from being removed. So
ironically an error saved the day here.
> How could something so insufficient have been released to the public
Code is written by developers, and developers are humans. Humans in turn make
mistakes. Most project removal related code also existed before we started
enforcing stricter performance guidelines.
> and how can you disclose this apparently-unresolved vulnerability? By so
> doing, you're challenging the public to come and try to empty your database
There's no point in hiding it. Spending a few minutes digging through the code
and you'll find it, and probably plenty other similar problems. If somebody
tries to abuse it we'll deal with it on a case by case basis.
> because LVM snapshots now occur hourly, and that it only takes 16 hours to
> transfer LVM snapshots between environments :)
LVM snapshots are stored on the host itself. As such if e.g. db1 loses data we
can restore the snapshot in a few minutes. They only have to be transferred if
we want to recover other hosts. Furthermore, in the Azure ARM environment the
file transfer would be much faster compared to the classic environment.
> 4. the PgSQL master deleted its WALs within 4 hours of the replica
> "beginning to lag" (<interrobang here>). That really needs to be fixed.
Yes, which is also something we're looking into.
> Again, you probably need a serious upgrade to your PgSQL server because it
> apparently doesn't have enough space to hold more than a couple of hours of
> WALs (unless this was just a naive misconfiguration of the
> [min|max]_wal_size parameter, like the max_connections parameter?)
Probably just a naive configuration value since we have plenty of storage
available.
> There were a few other things (including someone else downthread who pointed
> out that your CEO re-revealed your DB's hostnames in this write-up, and that
> they're resolvable via public DNS and have running sshds on port 22), but
> these are the big standouts for me.
Revealing hostnames isn't really a big deal, neither is SSH running on port 22.
In the worst case some bots will try to log in using "admin" usernames and the
likes, which won't work. All hosts use public key authentication, and password
authentication is disabled.
> Not sure how fast your disks were, but 300GB gone in "a few seconds" sounds
> like a stretch.
Nope, after about 2 seconds the data was gone. Context: I ran said command.
> Some data may've been recoverable with some disk forensics.
When using psycial disks not used by anything else, maybe. However, we're
talking about disks used in a cloud environment. Are they actually physical? Are
they part of larger disks shared with other servers? Who knows. The chance of
data recovery using special tools in a cloud environment is basically zero.
> Especially if your Postgres server was running at the time of the deletion,
> some data and file descriptors also likely could've been extracted from
> system memory
That only works for files still held on to by PostgreSQL. PostgreSQL doesn't
keep all files open at all times, so it wouldn't help.
> Hot standby is also supported just fine out of the box, you just need something third party for the actual load balancing.
I'm aware that hot standby is supported, though it's not the default configuration for the standby server (default and safest is a standby mode that you can't query at all; hot standby introduces possible conflicts between hot read queries and write transactions coming in from the WAL, so if failover is your primary intention, you should be cold standbying). I'm saying that mixing read queries in and dispersing them over hot standbys is not well-supported, which is why you need third-party tools to do it.
It can also be risky if your replication lag gets out of control, and you've indicated that it easily does. PgSQL replication is eventually consistent and you risk returning stale data on reads, which could cause all sorts of havoc if it's not accounted for by the application internally.
> We've had a few too many cases like this in the past. We're aiming to resolve those, but unfortunately this is rather tricky and time consuming.
This may take some upfront work, but it's pretty routine. A serious commercial-level offering should not need to take itself offline without announcement in order to restart the single database server and apply a configuration tweak.
> Code is written by developers, and developers are humans. Humans in turn make mistakes. Most project removal related code also existed before we started enforcing stricter performance guidelines.
The point is not that humans make mistakes, nor that bugs exist. The point is that such a feature was released without considering its easily-exploitable potential and the permanent consequences of its exploitation (permanent removal of data). That should trigger a process review.
> There's no point in hiding it. Spending a few minutes digging through the code and you'll find it, and probably plenty other similar problems. If somebody tries to abuse it we'll deal with it on a case by case basis.
There's a lot of risk in drawing attention to this type of vulnerability. I think GitLab should be taking this more seriously. All code has bugs, but this isn't a bug; it's an incomplete, dangerously-designed feature that can be easily used by a malicious actor to permanent destroy large quantities of user data. Your CEO has just highlighted it before the whole world while it's still active and exploitable on the public web site.
Reading the code isn't a dead giveaway because it takes a lot of effort to find the specific code in question and realize what it means, and because the general assumption would be that GitLab.com is running a souped-up or specialized flavor of the code and that such dangerous design flaws must have already been resolved on a presumably high-traffic site. However, this post highlights that it hasn't been, and that's bad. This is effectively irresponsible self-disclosure of a very high-grade DoS exploit.
> Probably just a naive configuration value since we have plenty of storage available.
Having the storage readily available means that the hard part is already done! Each WAL segment is 16MB. You have about 350 GB of unused disk. Set wal_keep_segments and min_wal_size to something reasonable and you won't need to do this obviously-risky resync operation every time you have a couple of hours of heavy DB load.
> Revealing hostnames isn't really a big deal, neither is SSH running on port 22. In the worst case some bots will try to log in using "admin" usernames and the likes, which won't work. All hosts use public key authentication, and password authentication is disabled.
See discussion at https://news.ycombinator.com/item?id=13621027. The worst case is not a bruteforced login, it's an exploited daemon that leads to an exploited box that leads to an exploited network that leads to an exploited company. The secondary concern would be a DoS attack; everyone now knows that you have only one functioning database server that everything depends on, and that that server's IP is x.x.y.y. That's enough to cause trouble even without exploits or zero days.
> When using psycial disks not used by anything else, maybe. However, we're talking about disks used in a cloud environment. Are they actually physical? Are they part of larger disks shared with other servers? Who knows. The chance of data recovery using special tools in a cloud environment is basically zero.
Yes, this complicates things significantly. Something like EBS may be able to be used pretty similarly to a dd image, though there is no way to "pull the plug" on an EC2 server afaik (maybe it's exposed through the API). I've never used Azure so I don't know if this would be practicable there.
> That only works for files still held on to by PostgreSQL. PostgreSQL doesn't keep all files open at all times, so it wouldn't help.
Indeed. While PgSQL doesn't keep all files open at all times, it does keep some files open, and they may or may not have contained useful data. I personally would've also been interested in trying to freeze the memory state (something you can do with a lot of raw VMs that you can't do with physical servers, but admittedly probably not something the cloud provider exposes).
Thanks for pointing that out. It doesn't appear that this was clarified until a couple of hours after I posted my comment, but it's definitely a relief and the wise course of action.