Zero downtime Postgres migration, done right

chtitux · on June 11, 2021

If you can afford a one off 1 second of latency for your SQL queries, then using logical replication with pgbouncer seems way easier :

- setup logical replication between the old and the new server (limitations exist on what is replicated, read the docs)

- PAUSE the pgbouncer (virtual) database. Your app will hang, but not disconnect from pgbouncer

- Copy the sequences from the old to new server. Sequences are not replicated with logical replication

- RESUME the pgbouncer virtual database.

You're done. If everything is automated, your app will see a temporary increase of the SQL latency. But they will keep their TCP connections, so virtually no outage.

fullstop · on June 11, 2021

This works wonderfully. If you have any long running queries, though, the PAUSE won't pause until they have finished.

I love pgbouncer, it is such a great tool.

fierro · on June 12, 2021

this is interesting. So you have a weakest link kind of problem here.

nijave · on June 13, 2021

You can temporarily reduce query timeout to a smaller setting as part of the automated failover. The long running transactions will fail but you can minimize the window where you can't talk to postgres

fullstop · on June 12, 2021

Not really, new connections will block as it's pausing. But you won't be able to shut down Postgres until those long queries complete. Perhaps I was not super clear, but what I'm trying to say is that PAUSE is not instantaneous.

fierro · on June 13, 2021

yeah what I'm saying is that you can only pause as fast as your slowest currently initiated query. So if you have a diverse set of query patterns, you could be waiting for a really small percentage of small queries to wrap up.

fullstop · on June 13, 2021

I understand now. Yes, this is true and should be taken into consideration during upgrades.

Most of our queries are fast, and the longer ones hit a read-only secondary instance. Cleanup jobs, though, can take hours to complete.

twunde · on June 11, 2021

To be fair about this page, this was used to migrate versions of postgres __prior__ to the introduction of logical replication. Logical replication makes this significantly easier (ie you no longer need the triggers)

stingraycharles · on June 11, 2021

Exactly this. The OP’s approach reminded me so much of the days of Slony, and I wondered why a simpler approach with logical replication would not just suffice.

Rather than pgbouncer, I did this in the actual application code once (write to both databases at the same time, once everything is in sync and you’re confident the new server works well, fail over to the new one only), but it depends upon how much control you can exercise over the application code.

Any approach that is based on triggers makes me shiver, however.

booleanbetrayal · on June 11, 2021

This is precisely the migration I'm planning on doing in the next few weeks with pglogical under the hood for replication. Seems like the atomic switch is much easier than any sort of problem that could stem from conflict or data duplication errors while in a bi-directional replication strategy.

asenchi · on June 11, 2021

Yep, you can also prepare the new database by using a snapshot of the primary's volume, and use pg-rewind to get them in sync. Altogether the tooling can make migrations super easy without minimal downtime.

nhumrich · on June 11, 2021

I use pgbouncer and had no idea it supported logical replication. I cant find anything about it in the docs. Do you have something you can link me to to read more?

aidos · on June 11, 2021

I’m assuming they mean to use Postgres logical replication.

nhumrich · on June 11, 2021

Which is only possible if you are using a version of postgres which is new enough, and isn't restricted, such as some versions of RDS. Which, explains the whole original post.

vira28 · on June 12, 2021

Does PAUSE works in session mode? I have always had challenges with PAUSE in session mode.

foobarbazetc · on June 11, 2021

That's how you do it.

luhn · on June 11, 2021

Braintree (IIRC) had a really clever migration strategy, although I can't seem to find the blog post now. They paused all traffic at the load balancer, cut over to the new DB, and then resumed traffic. No requests failed, just a slight bump in latency while the LBs were paused.

This app apparently had robust enough retry mechanisms that they were able just eat the errors and not have customer issues—Color me impressed! I'm not sure how many teams can make that claim; that's a hard thing to nail down.

scrollaway · on June 11, 2021

I remember reading something like that about Adyen. That might be why you're unable to find it.

rattray · on June 11, 2021

I wonder if that was it (Adyen and Braintree are competitors).

Here's the post: https://www.adyen.com/blog/updating-a-50-terabyte-postgresql...

Discussed here: https://news.ycombinator.com/item?id=26535357

They architected their application to be able to tolerate 15-30min of postgres downtime.

nezirus · on June 11, 2021

pgbouncer has PAUSE comnand, which can be used for seamless restarts/failovers or similar

simonw · on June 11, 2021

I think that's covered in this talk (I've not watched the video though): https://www.braintreepayments.com/blog/ruby-conf-australia-h...

pgr0ss · on June 11, 2021

It was also covered in: https://www.braintreepayments.com/blog/switching-datacenters...

luhn · on June 11, 2021

Aha, that's the post I was looking for!

xyst · on June 11, 2021

probably had a good circuit breaker design. would love to read the article though.

dkhenry · on June 11, 2021

This is one of the areas where Postgres is so far behind MySQL is embarrassing. Zero downtime migrations in MySQL have been a common method for over 10 years. This solution is far from ideal due to the use of Triggers which can greatly increase the load on the database and slow down transactions. If you don't have a lot of load on your DB thats fine, but if you are pushing your DB this will bring it down.

In MySQL they started with triggers with PT-OSC, but now there is GH-OST which does it with Replication. You can do something like this with Postgres by using Logical replication, but its still requires hand holding, and to my knowledge there is no way to "throttle" the migration like you can with GH-OST. Where I work now we are building all this out so we can have first class online migrations, but the chasm is still pretty big.

throwdbaaway · on June 12, 2021

The article is about migrating from RDS Postgres 9.5 to RDS Postgres 12, which is only feasible with Bucardo. Major version upgrade is less of a pain point these days with logical replication, and even Google Cloud SQL started to support logical replication since about a week ago, after a painfully long wait.

Meanwhile, you are talking about schema migration? In that case, gh-ost indeed is the best tool available. With Postgres, most types of schema migration can be done instantly with just a metadata change, or otherwise be done concurrently without locking the table. So a tool like gh-ost is not as vital, but still valuable for edge cases such as:

- altering the column type: requires table rewrite

- adding an index concurrently onto a busy server: a throttle would be nice

montroser · on June 11, 2021

Yep, we did MySQL dual primary-primary circular replication on a cluster with ten read replicas, thousands of qps per box, all the way back in 2005.

We failed over back and forth from one primary to the other on a schedule every few weeks, just to practice and affirm we could.

ransom1538 · on June 12, 2021

Mysql admin here. If i need to create an in sync copy of a mysql db machine i take a snapshot then let replication catch up. Done. I can do whatever I want to this new machine (test alters, drop columns, etc) and let it catch up. If i like the box i just attach a few replicas - promote it to master, done.

The Triggers trick is awesome (go percona!) -- but with cloud vms it takes a few seconds to fire up a complete insync replica. I find this safer than triggers running all over a prod database [esp. if you have thousands of vms!]. IMHO.

When I read about Postgres it's like being transported to 2005.

merb · on June 11, 2021

> but its still requires hand holding, and to my knowledge there is no way to "throttle" the migration like you can with GH-OST

eh? you only do a basebackup and than you can begin the logical replication. at some point you than you do a failover?

chtitux basically described the process which is extremly simple.

dkhenry · on June 11, 2021

Up until the table is so large that the copy takes longer then your WAL retention so you can't ever catch up. Like all things in Postgres, it works great up to a point, and then you are stuck. You also have to Logically replicate the entire schema because postgres won't logically replicate from table to table

merb · on June 11, 2021

if that is the case you should have enough manpower to make use of basebackups+wal-e/wal-g and than once your up, flipping it over to logical.

of course it is not as easy as having vitess, but vitess was not built in a day.

powerbook5300CS · on June 11, 2021

Its insane that it has to be this complex and require third party software to accomplish…

Most modern rdbms/nosql database vendors allow a rolling upgrade where you roll in new servers and roll out the old ones seamlessly.

Also the fact that AWS rds doesnt do this with zero downtime by default through automating it this way is also crazy. Why pay for hosted when the upgrade story is incomplete? Take downtime to upgrade a DB in 2021? Everyone must be joking.

Pxtl · on June 11, 2021

Well, Postgres is F/OSS, so I expect the solution to problems to be "lots of small tools"... but I see the same kind of herculean battle-plans for MS SQL Server work, and I get shocked. That is a paid product, what on Earth are we paying for?

marcosdumay · on June 11, 2021

> what on Earth are we paying for?

The same is the same for any relational DBMS: lock-in.

worewood · on June 11, 2021

You're paying for an "if something goes wrong and it's the software's fault you can sue us" license

senko · on June 11, 2021

You're not. Approximately 25% of SQL Server EULA text deals with ways in which the warranty is limited. The best the license gives you is your money back if you prove that it was the software's fault.

Of course, you can always sue. A $2T company. Good luck with that.

doctor_eval · on June 11, 2021

Went to post the same thing but HN had a little downtime burp.

Would only add that the main reason people buy these licenses, apart from familiarity, is so they can shift the blame if something goes wrong. Its all about ass covering.

salmo · on June 11, 2021

Well, this and

1) the illusion of "support" 2) OSS-fearing companies' love of indemnification

fridif · on June 12, 2021

it likely wasnt downtime and rather a "you are posting too many messages and we will pretend that the whole site is down rather than telling you"

undergrowth54 · on June 12, 2021

You’re presumably paying for a relationship with someone who has expertise.

defaultname · on June 11, 2021

This is a complex task, fraught on any platform. Most systems base such clustering on a shared filesystem, which is obviously orthogonal to the shared nothing designs most try to pursue.

rigaspapas · on June 15, 2021

AWS can upgrade your database automatically, but with some downtime. AWS also provides DMS for migrations, which didn't work well in our case. So it was rather a simple problem at first, which turned to be a very complex one in the end.

0xbadcafebee · on June 11, 2021

> Most modern rdbms/nosql database vendors

"Modern" = "not a traditional RDBMS". Expecting your 1983 Toyota Corolla to fly, regardless of the fact that it's now 2021, is unrealistic. But personal helicopters have been around for a while.

ivan888 · on June 11, 2021

I didn't read this article, but I really hate the tag line "done right". It expresses such a poor sense of humility, which is one of, or perhaps the most, important traits in the world of software

doctor_eval · on June 11, 2021

Yeah I got the same feeling but couldn’t quite put my finger on it.

“Here’s how we did this. Feedback welcome” would seem more appropriate.

Perhaps it doesn’t fit the hyper-aggressive mould of the would be titan of industry.

That said their engineering blog seems to be down so ¯\_(ツ)_/¯

Cullinet · on June 12, 2021

adding "(for us)" would have taken the edge out of their title, I am thinking..

rigaspapas · on June 15, 2021

We didn't mean to be arrogant with the "done right" statement. In the background story we explain how we performed the same migration once again in the past, and we end up with data loss. So this was the time that actually "did it right". Many tutorials on the Internet that describe a similar process have flows that also lead to data loss.

jlmorton · on June 11, 2021

If you are in AWS, or have connectivity available, AWS Database Migration Service makes this relatively trivial.

DMS for Postgres is based on Postgres Logical Replication, which is built-in to Postgres, and the same thing Bucardo is using behind the scenes. But AWS DMS is very nearly point-and-click to do this sort of migration.

chousuke · on June 11, 2021

The key here is Postgres 9.5. AWS DMS does not support it because they require logical replication support.

A few years back I migrated a PostgreSQL 9.2 database to AWS and wasn't able to use RDS because logical replication was not available.

I did try to use Bucardo but ultimately didn't trust myself to configure it such that it wouldn't lose data (first attempt left nearly all BLOBs unreplicated because the data isn't actually in the tables you set triggers on)

Physical replication to a self-built instance was easy, I was 100% confident it wouldn't be missing data, and the downtime from cutover was about 15 minutes (It involved restarting piles of slow Java applications)

booleanbetrayal · on June 11, 2021

I think DMS generally lags RDS releases. Last I checked, there still wasn't a way to replicate from Postgres RDS 12 -> 13 with DMS.

rigaspapas · on June 15, 2021

You are right. AWS DMS was our very first choice to try out. It is very easy to use, deployed within your VPC and most of the problems we mention in the article are already solved. Unfortunately, we experienced errors during our tests and the logging mechanisms were not quite helpful, so we failed to find out the problem and make it work.

jeffbee · on June 11, 2021

Bucardo has no performance impact, it just adds a trigger to every mutation. Negligible!

I really think articles of this kind are not useless, but need to explicitly narrow their audience to set expectation at the start. This particular topic is PG replication for users who aren't very sensitive to write latency.

rigaspapas · on June 15, 2021

That's a good point. We mention latency as "replication lag" and we have devoted a paragraph to the drift that comes as the result of this latency. In our case, we measured the latency to be <1s, and it was totally acceptable. As for the performance, triggers can become a problem with enough write traffic. In a different use case, Bucardo would be eliminated from a potential solution due to this.

booleanbetrayal · on June 11, 2021

Has anyone here leveraged pglogical for this before? Looking at a similar migration but it has native extension support in AWS RDS. Would love to hear any success / horror stories!

Nextgrid · on June 11, 2021

Internet Archive link in case the original is overloaded: https://web.archive.org/web/20210611143214/https://engineeri...

endisneigh · on June 11, 2021

I wonder how much easier software engineering would be if there were a period where things are simply not available.

What problems are currently very difficult would be made trivial if 6 hours of downtime every Sunday were acceptable? 10PM-4AM EST

maccard · on June 11, 2021

Your team is now working 10pm to 4am est every Sunday, and anyone who uses your services in any time zone around the world is without your service. 10pm est is lunch time in Australia. 4am est is working hours in many parts of western Europe.

You don't have to be huge to have customers all around the world.

yifanl · on June 11, 2021

I think the idea is that this world is one where the customer accepts that there may be an outage every Sunday because feature work can be done far more cheaply.

endisneigh · on June 11, 2021

You’re correct, but if there happened to be an outage during that time now you’d have to have your team working then anyway.

The existence of a planned downtime doesn’t necessarily mean deployments, or work, have to happen during that time.

yupper32 · on June 11, 2021

> but if there happened to be an outage during that time now you’d have to have your team working then anyway.

So you're having them work late 52 times a year AND when there's an outage vs only when there's an outage. They're working late WAY more in your plan.

endisneigh · on June 11, 2021

The existence of planned downtime doesn’t mean they have to work during said time. I’m not following your reasoning

kabdib · on June 11, 2021

Er, you bring the service down so you can do work on it. Somebody is doing that work.

I guarantee that you'll have trouble hiring qualified people for a role that includes core hours of "2AM-5AM Sunday mornings" (or whatever other oddball time you've decided). Updating systems is precisely when you may need to have all-hands-on-deck responses. Been there, multiple times.

sigstoat · on June 11, 2021

they might not have to work during every planned downtime, but they're going to have to work during some of them, or what was the point of the downtime?

endisneigh · on June 11, 2021

The point is to create a predictable period in which downtime, if any, would occur during

xmprt · on June 11, 2021

This predictability only matters for frequent, repeat customers. But those are the exact customers that you could just email saying "the website is going to be down on Sunday for 1 hour for planned maintenance" a few times a year when it's actually required.

If you run a business (eg. HFT where you run certain cleanup scripts after trading hours) where regular planned downtime is fine then go for it but most of the time, you don't need a downtime schedule every single week because you're not doing planned maintenance work every single week.

ohyeshedid · on June 12, 2021

You will be doing that every week when the system is designed to only be worked on during those planned windows, because that's what happens in these types of situations.

These windows are also when people discover how fragile their infrastructure is; when it doesn't come back up and now everyone is panicking and well past the planned downtime window.

endisneigh · on June 11, 2021

I think people are getting too caught up in my particular example. It could be a month, biweekly, etc.

yupper32 · on June 11, 2021

I find it very hard to come up with a use case where a weekly 6 hours of downtime at night EST would be acceptable.

US government websites already often do this, and I find that completely unacceptable since these services need to be available to anyone, regardless of work and life schedules.

Any website that's international would also suffer greatly for an EST centralized scheduled downtime.

Maybe a very localized website that doesn't have much impact on real life?

endisneigh · on June 11, 2021

Why wouldn’t it be acceptable? People aren’t available all the time either, nor are stores. In fact software is unique in its availability.

yupper32 · on June 11, 2021

Frankly, because it can be. Weekly scheduled downtime is arbitrary, manufactured, and lazy.

Of course, having to schedule the occasional downtime for a database migration is fine. There's probably a few times a year that you'd need to do it if you don't have the bandwidth to do fancy zero-downtime solutions. It's the weekly arbitrary downtime that I'm firmly against.

nijave · on June 13, 2021

It's expensive to run things like that. Multiple "9"s are significantly more expensive to engineer than, say, 95% uptime.

In general, the correct amount of allowable downtime is when losses due to downtime are greater than the cost of increasing availability (which depends on the system). Should we raise taxes to increase government website availability? Not sure there's a clear cut case and it would definitely depend on each specific website and what utility it provides

endisneigh · on June 11, 2021

Why is it lazy? When things are up people have to work. Do you believe people should be working all of the time?

I think regular downtime is only natural. If you had to choose between 95% availability or 100% availability other than the before mentioned downtime which would you choose?

yupper32 · on June 11, 2021

Sorry I'm a bit confused by your argument.

> When things are up people have to work.

That's not true. Monitoring software and on-call rotations are well established things, and wouldn't even go away in your world. Believe it or not, people who run websites already sleep at night.

> Do you believe people should be working all of the time?

I'm not sure where you got that. Besides, you're already having people in the office in the middle of the night every Sunday already. Or are you saying that the downtime is when your team gets to sleep?

> If you had to choose between 95% availability or 100% availability other than the before mentioned downtime which would you choose?

Again, not sure what you're getting at. There's only a few reasons why you'd need to schedule a downtime. Database migrations are a good one. All of the reasons should only happen a few times a year, likely. Way better than 52 times a year.

endisneigh · on June 11, 2021

I’m not making a particular argument. I’m just wondering if having regular downtime is better or worse in terms of reducing engineering complexity.

Spooky23 · on June 12, 2021

Perhaps. Oftentimes it’s driven by legacy batch processes on mainframes.

There’s a wide variety of reasons why it may make sense to carry on in that way.

kabdib · on June 11, 2021

What part of the world?

One service I worked on (with customers around the globe) had about 10 minutes of scheduled downtime a week. Peak user load was, interestingly enough, during our core hours, when there are a lot more hands on deck to handle any issues. So that's the time we chose.

Naturally people still complained bitterly about that ten minutes.

Getting to zero didn't seem worth the significant engineering effort required. "You're not Amazon, and you're not 911" [emergency number in the US]

doctor_eval · on June 11, 2021

Setting aside the other comments, I think it depends on your definition of “easier”.

Yes - it is easier in terms of up-front investment to have a scheduled (daily, weekly) downtime. But that often means the downtime is not automated - it’s someone doing the work, often out of hours when they’re not fresh, things can do wrong, etc. And it also means that the duration of the work is limited to the scheduled downtime. Some large upgrades just won’t fit in the allotted time, and you need to schedule an outage.

On the other hand, creating a system designed for zero downtime (ie, in-place upgrades, removal of big batch jobs etc) is a lot more of an investment up front, and makes upgrade planning more complex, but ultimately it saves time and effort because you can do all the changes during working hours, and “upgrades” become just how you deploy code - part of the core skill set of your team - and you do them 10 times a day.

The main difference that I’ve seen is that the scheduled downtime approach tends to create a much more conservative upgrade cycle, much bigger upgrades, and more risk of an upgrade failing - which means that a pile of features don’t get released. Also, certain customers will push back and demand a ton of documentation and detailed release notes and advanced warning of the upgrades etc - this creates loads of unproductive work for a small company. A continuous deployment model tends to reduce these problems because you’re not externalising your upgrade schedule and you don’t have a single big upgrade that can go bang, so maybe one feature doesn’t get released but everything else does.

Having taken both approaches (and even moved a company from scheduled to continuous) I would never go back to a scheduled downtime approach.

legohead · on June 11, 2021

Easier sure, but Zero downtime migration is a fun engineering problem (albeit stressful). Just need an employer who is willing to pay for it.

i_have_an_idea · on June 11, 2021

Interestingly, this is unacceptable in the US and probably most of Western Europe. However, I know of more than one bank in Eastern Europe where online banking simply doesn't work after midnight, until about 6am.

tinus_hn · on June 12, 2021

My Dutch bank often sends me mails about their internet banking being available for a few hours during the night for maintenance.

Availability is great but realistically, who’s going to use their bank in the middle of the night? It probably isn’t worth the effort and the risks to avoid the downtime.

nijave · on June 13, 2021

I think it's still somewhat (fairly?) common for internal facing apps where you can just tell your users "It's going to be down in this window, plan accordingly". For a publicly available app, you'll lose your customers to a competitor with a more advanced maintenance strategy (or even one that happened to pick a NoSQL database where rolling upgrades are trivial)

tikkabhuna · on June 11, 2021

This is very common in the equity trading space. Applications must be up during the trading day (non-fri 8-5) but after that they can be upgraded and restarted easily. Typically there’s strict change management processes that prevent Continuous Delivery and slows the rollout of changes.

Often I wish I worked in an industry where I could release during the day.

mmanulis · on June 11, 2021

Can this be done using existing PostgreSQL functionality around replicas? I think there's a plugin for PostgreSQL that supports master-master replication as well.

omot · on June 12, 2021

I always tell my non technical friends: “Migrations are like surgeries, and we’re letting new grads lead them.”

silviogutierrez · on June 11, 2021

Very interesting article. But I have to ask: would taking down the system for a couple of hours be that bad?

I looked at the company, and while they seem rather large, they're not Netflix or AWS.

I imagine they need to be up for people to be able to check in, etc. But they could just block out the planned maintenance as check in times far in advance. I'm sure there's a million other edge cases but those can be thought out and weighed against the engineering effort.

Don't get me wrong, this is very cool.

But I wonder what the engineering cost was. I'd think easily in the hundreds of thousands of dollars.

maccard · on June 11, 2021

I think it's always worth questioning both sides. Why is downtime acceptable? People on this site routinely complain about windows needing a restart for system updates while boasting about their Linux servers uptime.

People talk about how kubernetes is overkill for many people, but it gives you rolling deployments for your applications out of the box.

There's also the "slippery slope" argument. A 0 downtime migration means 0 downtime. A 5 minute migration creeps up to a 30 minutes migration occasionally, and then regression to the mean happens, causing standard migrations to be 30 minutes.

munk-a · on June 11, 2021

> Why is downtime acceptable?

At least where I'm working downtime is acceptable because our customer base is composed of professionals in north america that work 9-5 thus taking the DB offline for five minutes at midnight PST has essentially no cost to our client base. To contrast that, spending three months developing a rollout plan costs our company dearly in competitive advantage.

I agree that it's always worth examining from both sides but I also think that 0 downtime migration is both a shiny interesting problem and a point of pride for developers that can cause a company to vastly over invest in a complex technical solution to a problem they never had.

Daishiman · on June 11, 2021

A lot of 0-downtime migrations end up turning into a months-long ordeal and can still fail.

Allowing maintenance windows means you can do things in a much simple manner. Need to take a snapshot of a DB without running into issues with a production system adding data? Sure, go ahead, you just saved two months.

hinkley · on June 11, 2021

It's the old can/should argument.

The principle of something like zero downtime or switching a vendor out to save 20% may not stack up against the reality in opportunity costs, employee burnout, and customer turnover.

My company is reorganizing and so everyone is trying to settle old philosophical debates during The Churn. But everyone is overextended and a number of us have started pushing back. Just the meetings to discuss changing vendors might push the break-even point out 6 months, before you even get into implementation costs. The more time we spend working on infrastructure instead of customer-facing features and bugs, the more likely those customers are to wander off.

xmprt · on June 11, 2021

I think the slippery slope argument is almost always a fallacy. In practice you can say "we accept 99.99% availability" (which is about 1 hour every year) and budget for that. Your service might go down for unplanned reasons such as an outage but it could also go down for planned maintenance and as long as you're within that budget it seems ok.

fullstop · on June 11, 2021

I work for a small company and it would be devastating if our database was down for a few hours.

silviogutierrez · on June 11, 2021

Unplanned? Probably. But maybe it's not as a dire as you think? Azure Active Directory (auth as a service) went down for a while, globally, and life went on. Same with GCP and occasionally AWS[1]

I'm not saying there's no downside, I'm asking against the downside of engineering cost. And that itself carries risk of failure when you go live. It's not guaranteed.

[1] AWS is so ubiquitous though, that half the internet would be affected so it makes it less individually noticeable.

dividedbyzero · on June 11, 2021

So don't host on AWS, wait out their next big outage and then take down your app for that big migration, so you can hide among all the other dead apps. Half joking, of course.

silviogutierrez · on June 11, 2021

That's brilliant. Have the PR ready to go and merge/deploy only when that happens. Just host on us-east1 and it'll happen eventually.

willcipriano · on June 11, 2021

In my view that's just saying at some point in the future the company will be devastated. Nothing has 100% uptime.

hobs · on June 11, 2021

OTOH I worked for several small companies that had "business hours" software - if you had to take downtime on the weekend or after hours you'd be looking at impacting <1% of users.

jbverschoor · on June 11, 2021

I doubt that.. aws is down sometimes too. Card payments are sometimes down, even 112(911) is sometimes down.

I hope you have everything redundant, including the CEO

ocimbote · on June 11, 2021

> I hope you have everything redundant, including the CEO

Made me laugh, I'll sure reuse it!

fullstop · on June 11, 2021

The database has a hot standby and two warm standby instances, redundant hardware, network infrastructure, all of that.

chefkoch · on June 11, 2021

Then hopefully everything is HA in your company.

fullstop · on June 11, 2021

It absolutely is, BGP with geographically diverse paths, databases, app servers, etc, are all redundant. It's hosted in-house, so there is a cold standby database in AWS which would only be used if, say, an aircraft crashed into our server rooms.

ikiris · on June 11, 2021

It always amazes me when someone says they can't ever be down, and then says they only serve from one physical location actively.

One of these things is not like the other.

fullstop · on June 11, 2021

We have everything in place to run from AWS if needed but do not operate from there because of cost.

This goes both ways -- there have been many AWS outages which have not affected us. I hear what you're saying, but we've had only one instance of extended (hours) downtime in the last 20 years.

ocimbote · on June 11, 2021

The more experienced I become, the more such down to earth solutions seem OMG SO MUCH MORE reasonable than all the bells and whistles of "modern" engineering practices.

ohyeshedid · on June 12, 2021

Out of curiosity, what do you use in AWS to keep a running live backup of the data you'd need to migrate in the worst case scenario?

fullstop · on June 12, 2021

For postgres we send WAL files to a server in AWS which processes them. To bootstrap the database initially we sent zfs snapshots, and those WAL files are applied on an ongoing basis. If our data center were to die a horrendous fiery death, we could lose, at most, about 3 minutes of data although monitoring shows that it's closer to 30s under normal operating conditions.

For the app servers, saltstack is used and we synchronize that repository with what is needed to reproduce a production environment in AWS.

Obviously we'd have to provision servers, etc, but it's all possible in a worst-case scenario.

jnpatel · on June 11, 2021

Could you share what domain your company is in? Just curious what has such HA requirements

sharadov · on June 11, 2021

pg_upgrade with --hard-link option will upgrade the db in mins, I've done a 2TB db in less than a minute. Yes, there is no rollback, but if you do enough testing prior it can be really smooth. Am in the middle of doing this across an environment where we have tens of hosts from a few GB to couple TB.

zozbot234 · on June 11, 2021

You could take a filesystem-level snapshot for rollback.

jeffbee · on June 11, 2021

The problem happens when you estimate it will only take a few hours and at the end of the first hour it's 1% done, or it nears completion and crashes, in a loop. Now what?

rigaspapas · on June 15, 2021

A few minutes of downtime wouldn't be that bad, but it would harm the company's prestige. Unlike AWS and Netflix, we are not based on massive consumption, but on a few customers. Losing one good customer can make a difference. We believe it worth the investment of time.

rigaspapas · on June 15, 2021

Update: The cost is a good point. It wasn't that high. We actually updated the article to reflect this. It was two engineers, working full-time on it, for three weeks.

andrewmcwatters · on June 11, 2021

What is the equivalent for MySQL?

redis_mlc · on June 12, 2021

MySQL has had a myriad of easy-to-use solutions for over a decade that offer point-in-time cloning:

- Percona xtrabackup

- Linux LVM snapshots

- master-slave replication

- then built-in replication for point-in-time catchup.

I do some work with pg, but it seems clunky compared to MySQL for typical HA and failover scenarios, but it can be done. All the MySQL methods I mentioned above can and have been scripted. With MySQL GTIDs, it's usually trivial.

MySQL 8 and Percona Cluster are easy-to-use multi-master topologies. (I only use them with less than 100 GB of data in case of state transfers.)

Source: MySQL DBA.

rhacker · on June 11, 2021

"the zero downtime" + this site is definitely down is making me laugh.

zild3d · on June 11, 2021

next post "zero downtime static site"

ckboii89 · on June 11, 2021

does anyone know if this works when the target database is a replica/standby? The downside using pg_dump is that it acquires a lock on the table its dumping, and doing this on production may cause some slowness

holoduke · on June 11, 2021

Looking for something similar for a mariadb setup. Anyone knows some resources?

nick__m · on June 11, 2021

You can do multi master replication with conflicts detection with symmetricDS. The migration would be similar to the procedure described in the article.

data_ders · on June 11, 2021

way cool!

fyi the hyperlink "found here" to setup_new_database.template is broken.

coolspot · on June 11, 2021

I just don’t get all the “just shutdown the db for an hour, you’re not Netflix” comments.

If you can do things properly, as an engineer, you absolutely should, even if your company serves “just” hundreds of thousands instead of hundreds of millions users.

It is not like they wrote their own database for that, they just used an open source tool.

mswtk · on June 11, 2021

My understanding of doing things "properly" as an engineer, is picking the solution with the right tradeoffs for my use case. If the cost to the business of having some amount of scheduled downtime occasionally is significantly less than the engineering cost of maintaining several 9s worth of availability over major migrations, then I consider the former to be "done right".

perrygeo · on June 11, 2021

If you do things properly as an engineer, you've already negotiated and committed to service-level agreements with specific downtime objectives, right? Right?

If you've got say 5 minutes of downtime budget per month, do you really think it's a good investment to spend a million dollars of engineering effort and opportunity costs to get database downtime to zero seconds? Or you could use off-the-shelf techniques for orders of magnitude less investment and suffer a few minutes of downtime at worst, well within your SLO. It's an economic decision, not a technical one.

If zero downtime at any cost is non-negotiable, well you'd better hire, plan and budget accordingly. And hope those few extra minutes per year are worth the cost.

coolspot · on June 11, 2021

I agree that it is an economic decision. In my understanding it didn’t cost them millions of dollars to do it zero-downtime. Maybe a $10k-$50k in development/admin/test hours to make it so.

Also engineers get better (and happier) when they do challenging tasks!

koreth1 · on June 11, 2021

> Also engineers get better (and happier) when they do challenging tasks!

Better maybe, but happier, I'm not so sure.

At my last job I was the main developer in charge of hitting our zero-downtime target for code deploys and system upgrades, and it was a pain in the ass to always have to implement multi-stage migration processes with bidirectional compatibility between adjacent stages for things that would have been very little work if we'd been able to schedule an hour of downtime.

The cost was worth it from a business point of view, but it wasn't much fun to actually do the work.

Though I guess I agree with the "happier" part too if you're just talking about doing the initial infrastructure and software-architecture work that allows those annoying multi-stage migrations to run seamlessly. I did enjoy writing the migration framework code and figuring out what all the stages would need to be.

Cullinet · on June 12, 2021

> Also engineers get better (and happier) when they do challenging tasks!

I would rather encourage them to engineer to the limit of ability but deliver to Ed: within the limit of burning out my engineering team, plus some comfortable margin gained from the encouragement given, which should be rewarded.

I've talked myself into engineering insufficient time for unforseen downtime tasks and it took me a while to realise that I was wearing my management hat as a founder and not the engineering hat I should have been wearing.

jbverschoor · on June 11, 2021

> Blueground is a real estate tech company offering flexible and move-in ready furnished apartments across three continents and 12 of the world’s top cities. We search high and low for the best properties in the best cities, then our in-house design team transforms these spaces into turnkey spaces for 30 days or longer.

Seriously, how big can that db be, and how bad would a 1hr reduced availability / downtime be?

Seems like a lot of wasted engineering effort. “You are not google”

cryptoz · on June 11, 2021

You cut out the part in About where they hint at it being important here even. "But we go a step further, merging incredible on-the-ground support with app-based services for a seamless experience for all our guests"

It sounds like across multiple timezones, with a tech stack that backs the business in a specific way, that downtime could be a problem and that it would reduce their offering, "seamless" and reliability for moving.

If twitter goes down and you can't tweet, only really Twitter loses money and you can try again later. But if you're moving into an apartment you don't want to be standing outside trying to get the keys but the system is down. Edit: And you also don't want the service to tell you that you can't move in from 11am-1pm because of 'maintenance'

majormajor · on June 11, 2021

You can often get away with some downtime, but that's not the same as not spending any engineering effort.

What kills you is when you take an hour scheduled downtime, communicate this out, get everyone on board... and then are down for a day. If you don't have a good, well-rehearsed, plan, you might be unexpectedly screwed even then. As a rule of thumb... something usually doesn't quite go how you expect it to!

ohyeshedid · on June 12, 2021

To your point:

You can have the best plan in the world, well rehearsed and battle tested; something in that last patch bugged out and took the system with it. Now you're getting cursed out because the email, ivr, etc all said 1 hour downtime and it's been 4.

Weekly scheduled downtime has value, but it's not a good idea for most.

Hallucinaut · on June 11, 2021

I was trying to get on Zoopla (and, I think, Rightmove) two nights ago for over an hour after midnight and the site continued to tell me it was down for maintenance, and to wait again until the non-night owls were up in the morning. Pretty sure their market cap is ten, if not eleven, figures.

svaha1728 · on June 11, 2021

Agree that it's a lot of wasted engineering effort for companies that can have planned downtimes.

My guess the conversation was 'To be agile, we need to be 100% CICD'. Next thing you know everything needs to get pushed straight to prod continuously with no downtime.