> It's whole purpose in life is to automatically detect the version of PostgreSQL used in the existing PostgreSQL data directory, and automatically upgrade it (if needed) to the latest version of PostgreSQL.
In a small startup...
* If the data is mission-critical and constantly changing, PostgreSQL is a rare infra thing for which I'd use a managed service like AWS RDS, rather just Debian Stable EC2 or my own containers. The first time I used RDS, the company couldn't afford to lose an hour of data (could destroy confidence in enterprise customer's pilot project), and without RDS, I didn't have time/resources to be nine-9s confident that we could do PITR if we ever needed to.
* If the data is less-critical or permits easy sufficient backups, and I don't mind a 0-2 year-old stable version of PG, I'd probably just use whatever PG version Debian Stable has locked in. And just hold my breath during Debian security updates to that version.
(Although I think I saw AWS has a cheaper entry level of pricing for RDS now, which I'll have to look into next time I have a concrete need. AWS pricing varies from no-brainers to lunacy, depending on specifics, and specifics can be tweaked with costs in mind.)
RDS is very useful for companies which are big enough to employ, say, 2 programmers, but still too small to employ a DBA.
The hard part of running a database, in my experience, isn't setting up or running it. The hard part isn't even configuring backups.
The hard part is noticing that your backups have been broken for years, before you actually need to restore from them. Yes, yes, you know how to do this correctly. But you delegated it to the sysadmin, the sysadmin subtly broke the backup scripts, the scripts have been silently doing nothing for 18 months, and then the sysadmin got a new job.
This is the main value proposition of RDS: Your data will be backed up, your backups will be restorable, and most of your normal admin tasks can be performed by pushing a button.
Exactly, on the backups. There was off-the-shelf open source PITR software that I looked at and could've configured, but I couldn't justify spending all the time to test that setup, given that (supposedly) RDS was rock-solid turn-key. There were other engineering and ops things that needed my time more.
In my startup (until our exit) I've run Postgres on my own, with streaming backups, daily backup tests (automatic restores and checks), offsite backups and (slow) upgrades.
Data wasn't mission critical and customers could live with 5min downtimes for upgrades once a year.
No problem for years. When we had hosted Mongo, we had more problems.
If I have the money I'll use a managed database. But running Postgres in a startup up to $10m ARR / ~1TB data seems not like a problem.
A script creates a database instance from a container, restores backup into it, makes some checks (e.G. table sizes above some value, audit table contains data up to the backup point etc.), sends out email that everything looks ok.
Also unzipping and decrypting, already makes sure encryption worked and zipping worked and you did not end up with a 0 byte backup file (because of permission etc.).
Same goes of course for snapshots and backups in the cloud. I have several clients which had backup problems because of misconfiguration in AWS/GCP.
super. I have been thinking of doing something similar for my productions db systems as well but was struggling to come up with some efficient way to test restoration. This gives me some starting point.
You will be very happy when restoration testing fails. Better to find out in a test than - as some of my clients have - when restoring a backup in an emergency.
While true, you'd think something as important as a DB would be something worthy of creating some kind of nightly automated testing/verification of a restore procedure (or multiple).
In my experience boot strapping a startup, RDS is really expensive. Definitely a nice thing, but if you're a one person show trying to get to ramen profitable, RDS might not really make sense. If you are VC backed then go ahead!
Many things. Here's a list off the top of my head:
MySQL RDS acts like it's just regular MySQL minus SUPER privilege. So it will not accept various important inputs. For example, GTID state will be rejected. Even just pre-GTID log sequence numbers are barely supported. Various SQL SECURITY things are rejected. The suggestions for importing data using any of these features vary from scary nonsense (just ignore GTID numbers!) to absurd hacks (run sed on your mysqldump output to remove SQL SECURITY!).
The docs basically don't acknowledge that GTID matters in a RDS-to-or-from-non-RDS setup. The suggestions don't seem like they deserve to work. (Azure at least has some documentation for GTID, but it involves using fancy barely-documented APIs just to import your data.)
Replication will just break if you accidentally use a feature that RDS can't handle.
For something that could easily cost hundreds to thousands of dollars per month, I expected to be able to run an unmodified mysqldump and have RDS accept the output, process it correctly, and take my money. Nope, didn't happen.
I personally feel like upgrading a database should be an explicit admin process, and isn't something I want my db container entrypoint automagically handling.
This container came about for the Redash project (https://github.com/getredash/redash), which had been stuck on PostgreSQL 9.5 (!) for years.
Moving to a newer PostgreSQL version is easy enough for new installations, but deploying that kind of change to an existing userbase isn't so pretty.
For people familiar with the command line, PostgreSQL, and Docker then its no big deal.
But a large number of Redash deployments seem to have been done by people not skilled in those things. "We deployed it from the Digital Ocean droplet / AWS image / (etc)".
For those situations, something that takes care of the database upgrade process automatically is the better approach. :)
I disagree. It is (or was, last time I tried a year or two ago) hard to upgrade postgres because you need both the old and new binary. Package manager tries its best to allow just one copy. You end up copying from one image to another, then run into tiny breaking differences.
I agree it should be explicitly invoked and not automated, for something almost everyone needs to do, it sure is a hard task.
You should create the both dirs first, then check if they do really exist and only then move the files. Shit happens and it's better be safe than sorry.
Also I would replace ../old and ../new with $OLD and $NEW to have a bit less clutter in the next block with the explicit upgrade calls, ie:
Nah. The concept of verifying that the mkdir's worked is sound and also easy to do. They just need doing individually in the script, as doing both in the same spot would muck up the wildcard a bit later on.
Using variable names better ($OLD, $NEW) is a good idea too. Should cut down any potential typo risk as well. :)
If you prefer to work with the dirs under pgdata (well it makes sense...) you can just make the list of files (maybe even write it to the file, maybe even write a batch file which would move them) and use it for moving the data from pgdata to old. This saves an unnecessary move.
But it's specifically to do wildcard expansion of the quoted string, and the shell interpreter is happy with it. I'm open to suggestions for improvements though. :)
---
This is confusing to me:
... you can just make the list of files (maybe even write it to the file, maybe
even write a batch file which would move them) and use it for moving the data
from pgdata to old. This saves an unnecessary move.
I'm not understanding what you're meaning here.
I understand the "make a list of files" bit, but I'm not grokking why doing that is an improvement, and I'm not seeing where there's an unnecessary move that could be eliminated?
The pg_upgrade process is pretty much:
1. Initialise a fresh data directory using the new PostgreSQL version
2. Run pg_upgrade, pointing at both the old and new data directories
3. Start the database using the new data directory
For the "automatic upgrade container" purposes, we need to do everything under the single mount point so the "--link" option to pg_upgrade is effective and uses hard links.
Thus the "move things into an 'old' directory" first, then the "move the converted 'new' data into place" bit afterwards.
If this was PowerShell then I would just get the list of files in $PGDATA, create folders and then move the files in the list, ie
$files = gci $PGDATA
try {
New-item "$PGDATA/old" -erroraction stop
New-item "$PGDATA/new" -erroraction stop
}
catch {
throw "Failed to create the necessary dirs"
}
try {
$files | Move-Item -Destination "$PGDATA/old" -erroraction stop
}
catch {
# throw "Failed to move pg_data, your databases are now borked, good luck"
gci "$PGDATA/new" | move-item $PGDATA
}
It's way more streamlined and if creating the dirs would fail (especially `new`) then it would fail before moving the data. And you don't need to `set +e` in this part.
I tried to replicate this in Linux and... it's a mess.
`find` includes the directory in the list, `ls -1` does the thing but bash stores it's output as a single string, redirecting it to the file get this file included in the list... I even tried xargs, but quickly abandoned the idea. Though if you can create the redirected output file in some other place than $PGDATA (`/tmp` perharps?) then `ls -1` trick would work.
This is a very confusing title and project at first glance. It took me a minute to figure out what you’re doing…
But just to be clear — the point of this is to migrate an existing Postgres DB from the original version to a newer version? This is about the data volume itself, not automatically upgrading the Postgres binary when a new version is released?
So, you’d basically pass a flag that says --migrate-db when starting the container to kick start changing the data on disk? So when you start a new Postgres 15 container, you could pass it a volume with Postgres 13 data and it would auto update the on disk DB data.
Postgres docs do use the word upgrade, for example:
> pg_upgrade (formerly called pg_migrator) allows data stored in PostgreSQL data files to be upgraded to a later PostgreSQL major version without the data dump/restore typically required for major version upgrades
Right, but you’re not actually automatically migrating a Postgres 13 program to Postgres 15 in a running container. That was what I first thought when I read the title.
You’re using the work “upgrade” differently than how I normally think about it when talking about version changes. You’re talking about data-on-disk, not just the program version.
(I realize that when talking about Postgres, the two are linked, but that’s not the case for most programs)
I've had issues in the past with PostgreSQL as a backend for NextCloud (all running in docker) and blindly pulling the latest PostgreSQL and then wondering why it didn't work when the major version jumped. (It's easy enough to fix once you figure out why it's not running - just do a manual export from the previous version and import the data into the newer version).
However, does this container automatically backup the data before upgrading in case you discover that the newer version isn't compatible with whatever is using it?
> However, does this container automatically backup the data before upgrading ...
Nope. This container is the official Docker Postgres 15.3-alpine3.18 image + the older versions of PostgreSQL compiled into it and some pg_upgrade scripting added to the docker entrypoint script to run the upgrade before starting PostgreSQL.
It goes out of it's way to use the "--link" option when running pg_upgrade, to upgrade in-place and therefore avoid making an additional copy of the data.
That being said, this is a pretty new project (about a week old on GitHub), and having some support for making an (optional) backup isn't a bad idea.
I'll have to think on a good way to make that work. Probably needs to check for some environment variable as a toggle or something, for the people who want it... (unsure yet).
> Recover: This command is used to recover a delayed replica to a specified point in time between now and the database's delay amount. The time can be given in the YYYY-mm-ddTHH:MM format. The latest keyword can also be used to recover the database up to the latest transaction available.:
> The solution architecture chosen here is a 'Traditional backup with WAL streaming' architecture implementation (Backup via rsync/SSH + WAL streaming). This is chosen as it provides incremental backup/restore & a bunch of other features.
I think following MariaDBs lead is a good idea, thru have an environment variable to enable automatic upgrades, and it creates a backup before every upgrade.
Works like a charm.
Just to be clear, this isn't for actually running postgres to serve your database, right? This is just more or less a tool to upgrade your data files to a new postgres version?
It does both. As part of the startup process it first checks if the data files are for an old version, and upgrades them first (if needed).
Then it starts PostgreSQL 15.3 as per normal.
It's intended as being a drop-in replacement for existing (alpine based) PostgreSQL containers.
That being said, it's still pretty new so don't use it on production data until you've tested it first (etc).
--
Btw, it uses the "--link" option when it runs pg_upgrade so should be reasonably suitable even for databases of a fairly large size. That option means it processes the database files "in place" to avoid needing to make a 2nd copy.
At the moment it only process a single data directory, so if you have multiple then it's not yet suitable.
We have been maintaining a manual migration script for our docker users for the DB part, while the app part does migrations automatically already (Django), so making the db part more built-in makes a lot of sense.
Our case doesn't really need live/fully auto, just an easy mode for admins during upgrade cycles, and our scripts were pretty generic, so a project like this makes a lot of sense. There are a few modes we support - local / same-server, cross-node, etc - so am curious.
useful, but it's surprising that docker has not standardized upgrade / migration
the closest thing to this is helm which like everything in the k8s ecosystem is either a useful tool or a single feature wrapped in a cursed configuration language, depending on your perspective
not hard to imagine a single command to snapshot the volume, boot the new version, run a migration command, revert if something fails
So we come full circle it seems. Apt update/upgrade already does this and has been doing this for a few years now. So if i drop docker and just run a container with Debian I get this with a few years of testing behind it.
This one seems like a good proof of concept, but appears to need running as a separate task for upgrading rather than being an "automatic upgrade before starting the database server" thing?
But I was bitten in the rear by Wallabag the other day. Didn't work anymore. Ended up having to do a doctrine upgrade. Then it all worked. Same concept I guess. No automatic upgrade does allow you to pause and evaluate what you're doing, and take backups.
Now I say it allows you. Whether you (or I) do so... after all, that is what automatic backups are for...
I've just pushed an initial build with a tag for PostgreSQL 15:
pgautoupgrade/pgautoupgrade:15-alpine3.8-v1
The "-v1" text fragment on the end is a version number because I'm still improving the scripting for the upgrade part. So that'll likely increment over time (-v2, -v3, -v4, etc) until it stabilises.
The way you handle persistence is by using storage volumes mounted form the outside. Don't put the data in the container, only the software, which then can be replaced. Then managing is similar to other environments. For updating you don't replace rpm/deb packages and restart, but replace the container using the same volume, which should trigger the dbms-specific update routine. Similarly you do backups similar to outside container world (depending on DBMS and your exact choice, from filesystem snapshot to logical backup (dump) or something more or less smart in between)
No, not the same volume, a volume. Each container can mount an arbitrary amout of volumes. It scales as well as your machine scales.
A container is nothing but a process with restrictions on which filesystem subtrees kt can see, what resources (CPU, memory) it may use and which networks it can access, with some tooling to manage self-contained images of directory structures.
> I've not done stuff with kubernetes yet though, so I have no idea how it's done there.
Essentially the same, except that K8s gives you a wide variety of storage backend integrations (Storage classes + storage providers) which can attach "anything" (local volumes on the node, NFS, NAS, Cloud Volumes, ...) depending on your local environment and needs.
A lot of people running on prem k8s clusters have block storage. When I worked on open shift it wasn't uncommon for people to run databases in the cluster, backed by their block storage.
If you're running in the cloud, say on AWS EKS, it makes sense to use in-cluster databases for development environments, and only use RDS DB's for production/integration to save on hosting costs.
There is a huge push for doing that. Whether it is abstract the right thing can be questioned, but many IT departments decided to standardize around Kubernetes for all datacenter management and push that way and in some environments (5G networking) it's part of the specified stack.
Saving it on one host's local filesystem doesn't feel particularly production-ready. There is a distributed store system for Kubernetes called "Longhorn" that I've heard good things about, but I haven't really looked into it much myself. I just run a pair of VMs with a manual primary/replica setup and have never needed to fail over to the replica yet, but I can imagine some sort of fully orchestrated container solution in the future.
I'm just pointing out how it's commonly done. Of course people add things like replication, distributed filesystems, (etc) to the mix to suit their needs. :)
Yep, it seems like the most common answer is “pay exorbitant prices to your cloud provider for a managed SQL database”, but we’ve managed to save a chunk of money running it on our own. I’ve always said that between three engineers(me being one of them), we can form one competent DBA, but our needs are also pretty modest.
This is the opposite. The container is not mutable, but the datastore is. This alleviates the need to run Postgres X and Postgres X+1 to export from X and import into X+1.
For the uninitiated: Fairly certain this can only handle minor version upgrades... Which is generally as easy as moving your data folder beside the new binary and starting postgres.. (minor version upgrades are already trivial..)
Major versions still require a pg_dump (or a scary migration using postgres' anemic logical replication) unless some advancement has happened on the postgres side I'm unaware of.
In a small startup...
* If the data is mission-critical and constantly changing, PostgreSQL is a rare infra thing for which I'd use a managed service like AWS RDS, rather just Debian Stable EC2 or my own containers. The first time I used RDS, the company couldn't afford to lose an hour of data (could destroy confidence in enterprise customer's pilot project), and without RDS, I didn't have time/resources to be nine-9s confident that we could do PITR if we ever needed to.
* If the data is less-critical or permits easy sufficient backups, and I don't mind a 0-2 year-old stable version of PG, I'd probably just use whatever PG version Debian Stable has locked in. And just hold my breath during Debian security updates to that version.
(Although I think I saw AWS has a cheaper entry level of pricing for RDS now, which I'll have to look into next time I have a concrete need. AWS pricing varies from no-brainers to lunacy, depending on specifics, and specifics can be tweaked with costs in mind.)