Agreed. This reads like the $1 million system did what *the company* needed it t...

dijit · on March 2, 2023

Hi, Author here. thanks for reading my poorly written rant!

The issue was that nobody ever said what the system was designed for and equally nobody was open to the idea of doing things differently until things went pop.

The main thesis of the article is supposed to be that it doesn't matter what something costs or how much money is invested if it doesn't solve your needs.

I wrote it a long time ago in a fit of aggravation over someone on hackernews waxing poetic about how much amazon invested in their security- without much regard as to what it was actually spent on.

ilyt · on March 2, 2023

Sorry but it just looks like you picked wrong solution from start and stuck with it.

Vanilla pgsql backup + WAL shipping would just. work. with those constraints.

Backup solutions that take time to read data and only "ramp up" once you start restoring full backup are nothing new, literally in industry for decades, before in form of tape libraries, now in form of amazon glacier and similar. Hell, backup solution where you can mount whole backup in directory is on the fancier side, because generally priorities are elsewhere

dijit · on March 3, 2023

Very likely you're right, as mentioned in another comment (and in the article I think) I didnt know the semantics of the backup system, I’d just tested with a couple of 400G HDDs - then the NFS endpoint I was given to replace those drives was behaving differently and a deep investigation across wide timezones and uncommunicative teams began.

We have the benefit of hindsight now so things can be more clear than they were at the time, nonetheless:

WAL shipping without ever reading sounds hopeful, I don't personally believe in backups that are never verified.

ilyt · on March 3, 2023

It's particularly nice as it allows to do point in time recovery - you can tell PostgreSQL to replay WALs up to a given point so if say corruption happened because of some code bug you could play the database to the point minute before.

Backup process is two part

* archive logs as they come in - PostgreSQL have hook that will run program with the WAL segment to archive so just need to put whatever you want to use to backup it with there * start backup from PostgreSQL - that will stop it writing to the DB files and only write to WALs - and just copy the database directory files. Then tell PostgreSQL to resume writing to them. No need for anything fancy like file snapshots.

So just copying files really.

Restore is just restore the above and feed it WALs to the chosen point.

We also run slave so master being shot would only kill running transactions, fancier setups I've seen also run "delayed slave" - slave that replays non-current WAL so basically presents view of database from say 1 hour or 1 day ago. That way if something fucks up DB you already have server that is running, just need to replay WALs to the chosen point.

> I don't personally believe in backups that are never verified.

We ended up making backup job lottery. Pick a job out of system and send an email to ticketing system "hey, admin ,restore this job for testing". So far it worked.

One system also have indirectly tested restore, as the production database is routinely anonymized and fed to dev server

We've also baked it into automation we use for deploying stuff so for most things not backing up is harder than backing up. Still, accidents happened...

aprdm · on March 3, 2023

I believe games usually have assets too and not just metadata, would you put gbs of data in Postgres row ?

ilyt · on March 3, 2023

Assets don't go into database lmao.

aprdm · on March 3, 2023

yes and that's how I know you never worked in the game industry. A lot of companies use perforce, perforce is a source control system that also has assets, and also has metadata, that you also have to take backups and query.

dijit · on March 4, 2023

You sound like you know what you're talking about so then you probably know that WAL shipping requires the inverse to be working too.

Your replica needs the ability to read from the WAL storage in the event that you have high transactional throughput.

If you have a low volume then maybe you've never seen it, but it's in the documentation here:

https://www.postgresql.org/docs/11/archive-recovery-settings...

xupybd · on March 3, 2023

I'm a bit green when it comes to db back ups. Do you have anything I can read on this approach it sounds really promising but I don't think I understand it?

ilyt · on March 3, 2023

The docs are good:

https://www.postgresql.org/docs/current/continuous-archiving...

xupybd · on March 3, 2023

Thanks!

They are I would not have expected the DB docs to go into that level of conceptual detail

stuaxo · on March 3, 2023

I got that from the article, and didn't come across as too ranty to me, just standard talking about what a pain some thing was to do in an organisation.

Maybe a different country thing - devs in the US might be expected to be a lot more positive, in other countries we find complaining cathartic + this comes across as standard chat about work.