>> Why did replication stop? - A spike in database load caused the database replication process to stop. This was due to the primary removing WAL segments before the secondary could replicate them.
Is this a bug/defect in PostgreSQL then? Incorrect PostgreSQL configuration? Insufficient hardware? What was the root cause of Postgres primary removing the WAL segments?
PgSQL, Mongo, and MySQL all use a transaction stream like this for replication and they all have to put some kind of cap on it or risk running out of disk space, but the cap should be made sufficiently large to allow automatic resumption of disconnected slaves without manual redumping, except in extraordinary circumstances. Log retention should be long enough to last at least a long weekend so that someone can come in and poke the DB back into action on Tuesday morning, but preferably more like 1 week. Alarms should be configured to fire well before replication lag gets anywhere near the log expiration timeout.
In particular, PostgreSQL has a feature that allows automatic WAL archiving (i.e., it confirms that the WAL has been successfully shipped to a separate system before it removes it from the master) and a feature called "replication slots" that ensures that all WALs are kept if a regular subscriber is offline. If either of these features had been correctly configured, there would've been no need to do a full resync; the secondary database would've come back and immediately picked up where it left off.
Additionally, if one must resync the full database (and I've had to do this many times), tools like pg_basebackup and innobackupex are basically required to consistently perform the process of pulling the master dumps, and the old (unsynced) data directory should be allowed to linger until the full master snapshot has been fully confirmed and is ready to resync. It's very reckless to go around removing binary data directories until you're certain that the new stuff is running, even if you're "just on the replicant".
With pg_basebackup, you run it on the replicant server and it streams down the files, no need to log into the master server at all. With innobackupex, you need to have read access to the master's binary data directory, but should achieve this safely through something like a read-only NFS mount. mydumper is a possible alternative to innobackupex that tries to capture the binlog coords and doesn't require any direct access to the host beneath the database server.
+1 for replication slots. Just remember to remove the slot if you decommission a server; otherwise storage on the master will grow forever.
innobackupex works fine locally on the server, streaming out to netcat or ssh on the remote side. Nothing wild like read only NFS required. It also copies all binlogs. Mydumper is pretty old at this point and doesn't do most of the things innobackupex can. I wouldn't recommend it.
Wow, thanks. This is like the best answer I've ever seen. You absolutely nailed it.
Are you by any chance looking for any DevOps/Ops consulting? I just founded my third startup Elastic Byte (https://elasticbyte.net) and always looking for smart people. We're a consulting startup that helps companies manage their cloud infrastructure.
I'm flattered, but I'm sure there are better options out there, and I'm swamped as it is anyway. ;) I'll keep it in mind and reach out if something changes.
I do think that's a great startup, though, and this post-mortem and incident only proves how badly it's needed. A lot of people already think they're getting something like what you're offering when they sign up with a cloud provider.
There's always some limit. At $PREVIOUS_JOB I think it was at least 48 hours, probably over 72 hours, of replication log retention (usually measured in GiB though). So it's surprising that in GitLab's case it must have been less than 6 hours (IIRC from the original google doc the slave had more than 4 hours replication lag due to load, initially ...)
>> Why did replication stop? - A spike in database load caused the database replication process to stop. This was due to the primary removing WAL segments before the secondary could replicate them.
Is this a bug/defect in PostgreSQL then? Incorrect PostgreSQL configuration? Insufficient hardware? What was the root cause of Postgres primary removing the WAL segments?