Agreed. This reads like the $1 million system did what the company needed it to do (safely archive code to prevent more loss of old games) but it didn’t do exactly what this developer wanted.
There are various good points scattered in the article for the author’s specific use case, but it’s written as if the entire company was mistaken to not make this decision revolve around this one developer.
Hi, Author here. thanks for reading my poorly written rant!
The issue was that nobody ever said what the system was designed for and equally nobody was open to the idea of doing things differently until things went pop.
The main thesis of the article is supposed to be that it doesn't matter what something costs or how much money is invested if it doesn't solve your needs.
I wrote it a long time ago in a fit of aggravation over someone on hackernews waxing poetic about how much amazon invested in their security- without much regard as to what it was actually spent on.
Sorry but it just looks like you picked wrong solution from start and stuck with it.
Vanilla pgsql backup + WAL shipping would just. work. with those constraints.
Backup solutions that take time to read data and only "ramp up" once you start restoring full backup are nothing new, literally in industry for decades, before in form of tape libraries, now in form of amazon glacier and similar. Hell, backup solution where you can mount whole backup in directory is on the fancier side, because generally priorities are elsewhere
Very likely you're right, as mentioned in another comment (and in the article I think) I didnt know the semantics of the backup system, I’d just tested with a couple of 400G HDDs - then the NFS endpoint I was given to replace those drives was behaving differently and a deep investigation across wide timezones and uncommunicative teams began.
We have the benefit of hindsight now so things can be more clear than they were at the time, nonetheless:
WAL shipping without ever reading sounds hopeful, I don't personally believe in backups that are never verified.
It's particularly nice as it allows to do point in time recovery - you can tell PostgreSQL to replay WALs up to a given point so if say corruption happened because of some code bug you could play the database to the point minute before.
Backup process is two part
* archive logs as they come in - PostgreSQL have hook that will run program with the WAL segment to archive so just need to put whatever you want to use to backup it with there
* start backup from PostgreSQL - that will stop it writing to the DB files and only write to WALs - and just copy the database directory files. Then tell PostgreSQL to resume writing to them. No need for anything fancy like file snapshots.
So just copying files really.
Restore is just restore the above and feed it WALs to the chosen point.
We also run slave so master being shot would only kill running transactions, fancier setups I've seen also run "delayed slave" - slave that replays non-current WAL so basically presents view of database from say 1 hour or 1 day ago. That way if something fucks up DB you already have server that is running, just need to replay WALs to the chosen point.
> I don't personally believe in backups that are never verified.
We ended up making backup job lottery. Pick a job out of system and send an email to ticketing system "hey, admin ,restore this job for testing". So far it worked.
One system also have indirectly tested restore, as the production database is routinely anonymized and fed to dev server
We've also baked it into automation we use for deploying stuff so for most things not backing up is harder than backing up. Still, accidents happened...
yes and that's how I know you never worked in the game industry. A lot of companies use perforce, perforce is a source control system that also has assets, and also has metadata, that you also have to take backups and query.
I'm a bit green when it comes to db back ups. Do you have anything I can read on this approach it sounds really promising but I don't think I understand it?
I got that from the article, and didn't come across as too ranty to me, just standard talking about what a pain some thing was to do in an organisation.
Maybe a different country thing - devs in the US might be expected to be a lot more positive, in other countries we find complaining cathartic + this comes across as standard chat about work.
There are various good points scattered in the article for the author’s specific use case, but it’s written as if the entire company was mistaken to not make this decision revolve around this one developer.