Hacker News new | past | comments | ask | show | jobs | submit login
Bidirectional Replication is coming to PostgreSQL 9.6 (2ndquadrant.com)
284 points by iamd3vil on Sept 25, 2016 | hide | past | favorite | 69 comments



Holy crap, I am scared!

Please, please, please read the fine print and ensure you understand the design tradeoffs as well as your application's requirements before blindly using this.

The moment I heard multi-master I thought Paxos, Raft or maybe virtual synchrony. Hmm, nothing in the documentation. Maybe a new consensus protocol was written from scratch then? That should be interesting!

No, none of that either - this implementation completely disregards consistency and makes write conflicts the developer's problem.

From http://bdr-project.org/docs/stable/weak-coupled-multimaster....

* Applications using BDR are free to write to any node so long as they are careful to prevent or cope with conflicts

* There is no complex election of a new master if a node goes down or network problems arise. There is no wait for failover. Each node is always a master and always directly writeable.

* Applications can be partition-tolerant: the application can keep keep working even if it loses communication with some or all other nodes, then re-sync automatically when connectivity is restored. Loss of a critical VPN tunnel or WAN won't bring the entire store or satellite office to a halt.

Basically:

* Transactions are a lie

* Consistent reads are a lie

* Datasets will diverge during network partitioning

* Convergence is not guaranteed without a mechanism for resolving write conflicts

I am sure there are use-cases where the risk of this design is acceptable (or necessary), but ensure you have a plan for dealing with data inconsistencies!


"Everything after "Basically" is misleading and inaccurate.

The conclusions here are not logical: The existence of some restrictions does not imply "Transactions are a lie", since that makes us think ALL transaction semantics are suspended, which is very far from the case.

There are good reasons for the design and restrictions in the BDR design, offering performance about x100 what is possible with eager consensus. Real world pragmatism, with production tools to ensure no inconsistencies exist in the database. The BDR tools allow you to trap and handle run-time issues, so it is not a theoretical debate or a mysterious issue, just a practical task for application authors to check their apps work.

"Consistent reads are a lie". Reads from multiple nodes at the same time are not guaranteed to be consistent - but this is multi-master - why would you read two nodes when all the data is on one node? The whole point is to put the data you need near the users who need it, so this is designed to avoid multiple node reads.

I could go on, and will do in a longer post elsewhere, but the main purpose of my retort is to show that the conclusions drawn here are not valid. Let's see how the hacker news method of consensus decides what is correct in this case.


Well, you can re-define the properties of a "transaction" within BDR architecture but you can't re-define your customer's use-cases.

But you have a point. Not all semantics are suspended, only atomicity, isolation and consistency.

Durability has become probabilistic, because merge conflicts with a concurrent mutation from another host can discard our change.

>offering performance about x100 what is possible with eager consensus

And what's the performance gain over regular asynchronous master-slave replication? All nodes have to process all mutations at some point, no?

>but this is multi-master - why would you read two nodes when all the data is on one node?

The "why" is because my use-case requires consistent reads and the business risk of inconsistent reads is unacceptable.

I am guessing you are working on the project? It scares me that you are asking this question.

It remains to be seen how many people are happy with such trade-off. I claim total ignorance here - it could be millions :)


Postgres-BDR works very well for its intended business use case, which is geographically distributed multiple master database. Master-slave replication is not comparable cos there is only one master. You should use as few masters as your business case requires, so it is possible you don't need BDR at all. The underlying technology is roughly the same between master-slave and master-master, so that is not a big distinction and not the key point.

Postgres-BDR is designed to solve a common problem: access to a database from users all around the world. In cases where you have a single master and multiple standbys then all write transactions must be routed globally to the same location, adding 100s of milliseconds latency and in many cases making applications unusable.

Postgres-BDR offers the opportunity to make those writes to multiple copies of the database in different locations, significantly reducing the latency for users.

What BDR doesn't solve is what you do if all your users want to update the same data at the same time. But that type of application is just not scalable. If you run such an application on any database that gives you perfect consistency and lots of other shiny buzzwords, all you get is a really, really slow application, just like you had in the first place. But it will tick all the keywords.

All distributed systems require you to understand the business case and the trade-offs required to make it work. This is not fancy technology to use because you think its cool, its for use where it is useful and important to do so. BDR gives you the tools to know whether your application will work in a distributed manner and to keep it working in production without fear.


I could'n agree more. Be _very_ careful with this illusion. Multi-master replication implementation like this makes sense for very few use cases and even then the development team better know what the hell are they doing or a lot of unwanted/unaccounted things will show up in production


I kind of agree with the assessment.

I've been prototyping with VoltDB be/c it has a pretty interesting model that should be able to achieve a near-linear scale of write operations for tables that are partitioned. After reading the docs on VoltDB [1] is became clear to me that they are putting the design constraints up front and if you can work through those constraints [2] you can achieve some wicked scale. But it's a bit more complex than your typical single host database.

The work that VoltDB makes you deal with up front are a lot like the work that would have to be done for a multi-master setup in PostgreSQL to function correctly. I like how VoltDB puts those problems up front, but I'm having problems seeing VoltDB as a general purpose solution. The old PG database I work on right now I can't see in the VoltDB, but maybe parts of it would fit OK.

I look forward to the tooling in PG to get better and better. It's a great community and I do like the work that 2ndquandrant is doing. I like how they approach the community with their work. I do the BDR work is important to understand so that when you're in a situation that calls for it you can take advantage of it (same with VoltDB).

[1] https://docs.voltdb.com/UsingVoltDB/IntroHowVoltDBWorks.php#... [2] https://docs.voltdb.com/UsingVoltDB/DesignPartition.php#Sche...


If you have any question about whether VoltDB could fit or not, let us know. It's surely not as general purpose as PG, but yes, it's faster for many use cases and its clustering is fully consistent.


Thanks! I certainly will. It looks pretty cool. I'm certainly learning it and finding the technical details in the documentation to be quite interesting.


> I am sure there are use-cases where the risk of this design is acceptable (or necessary), but ensure you have a plan for dealing with data inconsistencies!

I'd argue most non-financing applications would find these risks acceptable. This form of Multi-Master is what most people writing web-based applications actually are looking for. It simplifies having fail-over, at the costs you mentioned, but those aren't a major issues, especially if they're known upfront.

> * Datasets will diverge during network partitioning

> * Convergence is not guaranteed without a mechanism for resolving write conflicts

While this isn't ideal in a perfect world, it's workable for, again, web-based applications where consistency isn't usually required. Also, the rules are known http://bdr-project.org/docs/stable/conflicts-types.html

So yes, there are definitely workloads where this type of replication isn't appropriate, however, acting like there aren't any is blatantly ignoring many types of workloads.


>acting like there aren't any is blatantly ignoring many types of workloads.

I agree in principle, and if you have the knowledge and means to understand/mitigate the risk, this warning isn't aimed at you ;)

It is aimed at the ignorant, not the negligent e.g the engineer I was 15 years ago - who didn't know what he didn't know and would've chosen BDR simply for its master-master promise without understanding what I am really getting myself into.

I nearly ruined a business by choosing MySQL + NDB few years ago.


If people are using Postgres in a worldwide distributed way, then it's safe to say that they likely have a deep understanding of the issues behind it.

Replication is a problem that n00bs aren't going to touch on their startup site. They're only going to use it when they have hundreds of thousands/millions of users already.


It is amazing to see the number of projects based on Postgres mentioning that they are trying to achieve multi-master with their own restrictions:

* Postgres-R is not multi-master, got discontinued in 2010, and lots most of its meaning after WAL-shipping has been added in-core with hot-standby. http://www.postgres-r.org/

* Postgres-XC, that says to do multi-master, the first project based on PostgreSQL that I heard of doing sharding (term not used at the beginning of the project, and that I first heard in 2010). The project began in 2009, got discontinued in 2013. There are limitations like non-support for triggers and savepoints. XC was designed to be good for OLTP workloads, sucks for long-transactions and data warehouse type of workloads. The design of the project has been done in coordination by NTT and EnterpriseDB. https://sourceforge.net/projects/postgres-xc/

* Postgres-XL, that forked from Postgres-XC, and enhanced the data-warehouse case with improvements for data analytics by introducing a communication protocol between Datanodes. The project is still being maintained by some folks at 2ndQuad. http://www.postgres-xl.org/

* Postgres-X2, which has been an attempt to unify Postgres-XC and Postgres-XL efforts under the same banner. I don't know where this is moving to but things look rather stalled. https://github.com/postgres-x2/postgres-x2

* Postgres-XZ (look for PGXZ!), which is something I heard recently, based on Postgres-XC, and developed by some folks in China. Visibly this has its own limitations, and is used in production where the constraints induced by the scaling out are thought as acceptable.

So, that's cool to see many efforts, BDR being one. And all of them are trying to address the scaling-out problem with their own way. Now each application should study carefully what to use and if the limitations and constraints used are acceptable or not.


Postgres-R is dead. No need to look.

Postgres-XC has never implemented multi-master, so don't look there. Postgres-XZ, if it exists, is a closed-source fork of XC by a Chinese company.

Postgres-XL implements cluster scaling for queries and transactions. It is alive and well. http://blog.2ndquadrant.com/when-to-use-postgres-bdr-or-post...

X2 is the developers of XL and XC working together to see if we can help each other; there is no separate code from that initiative.


Also Citusdb which just got multimaster on their CitusMX cloud version and will release later on open-source.


The moment I heard multi-master I thought Paxos, Raft or maybe virtual synchrony. Hmm, nothing in the documentation. Maybe a new consensus protocol was written from scratch then?

What's your use case here? Do you really want multi-master/bi-directional replication, or do you want distributed transactions? Why would an app want to connect to both masters, this isn't sharding?


If it isn't sharding then what is it? If every node has to process every write anyway (so there's no performance advantage over single-master) then why would one ever use this rather than traditional master-slave?


Active-active DR scenario, zero failover. Done this loads with Oracle. Or mostly-read regional DBs separated by a slow WAN, ditto.


> Active-active DR scenario, zero failover.

I guess, but I don't see the advantage over master -> slave failover? You lose unreplicated writes either way.

> Or mostly-read regional DBs separated by a slow WAN, ditto.

If it's just reads then regional slaves work well. If your writes don't need to be transactional then a RDBMS seems unlikely to be a good fit in general.


mostly reads != just reads

Think mutiple fulfillment centers changing state orders as they are shipped. Read mostly, write a little.


Sure. Either global transactionality for writes is important (in which case this style is useless, you need a master to accept the writes (or else some kind of distributed consensus protocol which this doesn't have) and single global master + distributed read slaves is a good model), transactionality is only important per-center (in which case I'd rather have explicit sharding rather than invisible differences in behaviour between queries that look the same), or you don't need transactional behaviour at all (in which case an RDBMS is probably a poor fit).


I can imagine a reasonable use case. If every node only accepts writes to a particular set of tables (or partitions), and serves as a read slave wrt the rest, the database may remain consistent and more available than a single-master setup. Imagine one node per a geographical region, with each node having a (read-only) picture of the whole.

This is a relatively limited use case.


Do you know of any relational products which offer high-throughput, low-latency, high-availability transaction processing using perfectly synchronous multi-master replication?


No. But I am aware of products that don't silently drop written data. That should be a pretty low bar.


Doesn't Oracle offer such an option?


You can't guarantee low latency and consistency without setting geographical boundaries on the distribution of the servers, unless you have very generous definitions of "low latency", as the speed of light (and more realistically: the speed you can transmit a signal via the internet, which is substantially lower) between the servers will place lower bounds on latency.

This is the case as you can't guarantee consistency without coordination between the servers (or you won't know what order to make operations visible in the case of multi-master, or whether a transaction has even been successfully applied by a slave when replicating master-slave), which at a bare minimum involves one round-trip (sending a transaction, and waiting for confirmation that it has been applied; assuming no conflicts requiring additional effort to reconcile).

You can pipeline some stuff to partially avoid actual delays, but you can't avoid the signalling, and for many applications it has disastrous effect on throughput when certain tables/rows are highly contended.


Yes, I meant a strictly one-cluster solution for machines within the same datacenter and preferably the same rack.

Synchronous replication across such a cluster can give much more read performance with write consistency and durability guarantees even when hardware failures occur. I don't know if any potential write performance increase would be worth the increased complexity, compared to a standard single-master setup.


I am well known as a database guy, but for a scenario like this I think best practice is probably CICS.


Thanks for posting this! How does the logical replication reason about preconditions? Are row versions consistent across replicas? Are the preconditions somehow encoded as value checks?


We need a jepsen test.


Some info from 2nd Quadrant on what BDR is: https://2ndquadrant.com/en/resources/bdr/

Bi-Directional Replication for PostgreSQL (Postgres-BDR, or BDR) is the first open source multi-master replication system for PostgreSQL to reach full production status, developed by 2ndQuadrant and assisted by a keen user community. BDR is specifically designed for use in geographically distributed clusters, using highly efficient asynchronous logical replication, supporting anything from 2 to more than 48 nodes in a distributed database.


> anything from 2 to more than 48 nodes in a distributed database

Why specify a range if you're going to leave it open-ended?


Yes, it means it was tested with up to 48 nodes.

There's no hard limit on the number of nodes, but at the moment BDR uses full mesh topology (each node has connections to all other nodes), which becomes an issue as the number of nodes increases.


Possibly, it means that they have tested it with 48, and any more is left as an experiment for the daring. Just guessing.


It's good to know what range people have tested with, even if it's not a hard cap.


I guess it means they reached production stability on 48 nodes, but there is nothing keeping you from adding more nodes if need be.


While indeed very exciting, it's important to note that this makes the BDR extension from 2ndquadrant compatible with stock Postgres. This does not include BDR shipping with core Postgres.

This continued improvement with the core code and extension APIs will make more and more extensions feasible which will mean more are able to plug-in and add value without things having to be committed to core. Though in time this is one that has a good chance of actually being in core much like pg_logical.


Not exactly. It means that enough infrastructure was moved into PostgreSQL 9.6, making it possible to run BDR on unmodified PostgreSQL. Before 9.6 it was necessary to use patched PostgreSQL packages.


The title is misleading, the replication is not coming to stock Postgresql 9.6.

A replication extension got the patches it needs to run into Postgresql 9.6 so you can use the extension without patching Postgres.


The purpose of extensions is they allow you to run stuff without including it in core database. What language should be used for this case to avoid confusion in future?


"Bidirectional Replication extension no longer needs patches in PostgresSQL 9.6" ?


As the developer who also manages the servers we deploy on, and not a full time PgDBA, things like multi-master replication scare the hell out of me. They really make me worry about what happens after downtime. And latency.

Could anyone here recommend good reading material for scaling out your first database on to multiple servers? How do I know which scheme is the best for me?


The answer really depends on what you mean by "multi-master" - particularly whether you're looking for synchronous or asynchronous solution, what consistency model you need (strongly consistent cluster or nodes consistent independently), and what are your goals (write scalability, read scalability, disaster recovery, ...).

BDR is meant to be asynchronous multi-master, i.e. a collection of nodes that are strongly consistent on their own, but the changes between the nodes are replicated asynchronously. Great for geographically distributed databases (users access their local node), for example.


We're dealing with a booking system so sync is important. But so is redundancy and throughout.


Some more information for anyone trying to understand this better: http://bdr-project.org/docs/stable/overview.html

Sourcecode: https://github.com/2ndQuadrant/bdr


Would love to see this land in Amazon RDS's list of supported extensions!


Looking forward to playing around with this. Native master-master replication is the only thing keeping me on MySQL.


Just curious, what Postgres features are you missing on MySQL?

I had only used MySQL until a year or two ago, and wondered what I was missing since Postgres seems to get more love/hype from the developer community for whatever reason.

Now using Postgres in production, there are few if any features that I notice our team using which don't exist in MySQL (maybe Json landed in Postgres first is one big one?). One thing I have noticed is I find the user/permissions model for Pg less intuitive. It's as if it's designed for use in a computer lab or something where there's one human who is the owner/dba and some things can only be done by them, which doesn't map well to a web app trying to follow "principle of least privilege".

This combined with the fact that we're on RDS where MySQL/Aurora is the clear first class citizen makes me wish we were using MySQL.


MySQL has a bunch of irritating gotchas, most of which can be configured around or worked around, but they're just unpleasant by default. E.g. mysql "latin1" refers to a non-standard characterset that has only half its characters in common with the characterset commonly known as latin-1 (this is documented, but to find that documentation you have to first consider the posibility that "latin1" doesn't simply mean "latin1"). Likewise "utf8" refers to a non-standard characterset that has only 1/16th of its characters in common with the charcterset commonly known as UTF-8. A lot of its defaults are table-engine-specific, but the default table engine is set per-server, and even if you specify the engine you may silently get a table with a different engine (if the server doesn't have the specified engine enabled then it will use the default engine rather than giving an error). It uses non-standard quoting characters and non-standard table name equivalence by default. I can't remember what exactly went wrong with its timezone support but I remember it was bad enough that we didn't use it. A lot of its functions/columns will allow invalid values by default rather than failing to insert.


transactional DDL => if your application often has schema update and you use a tool like Doctrine for PHP / Alembic for python, when a downgrade or upgrade fail on MySQL in the middle became the create index was already taken by someone who "hot-fixed" the database and that now you're in an inconsistent state and you have to clean stuff by hand you will regret to not be on PostgreSQL where it will have simply rollback the transaction, leaving you in a consistent state, you fix the migration, you hit again the command and you can go back home

hstore/jsonb/array/composite types https://www.postgresql.org/docs/8.1/static/rowtypes.html : array is often a good option to implement tag system

partial index: imagine you have a lot of "soft deleted" rows (i.e with a flag deleted turned to true), you can create an index that ignore them

index on expression: you often do request like "where date = today" , but you store timestamp precise to the second ? and you don't want to run date_trunc(your_column, 'day') , which it also a function not present in mysql..., everytime, nor you want to create a dedicated column for that only for the sake of performance, index on expression permit you to do that.

integrated full text search: you have a smallteam, and you don't feel like maintaining one more service for indexing and keeping in sync your search engine, here you are (of course it's not perfect but better than the option provided by MySQL)

table inheritance for partionning: you create one table "orders" , and you can easily partion them into "orders_2016" "orders_2015" etc. while still simply selecting things out of "orders"

constraints: your column "event_start" must be before "event_end", you can enforce that at the table level in PostgreSQL

text columns: in PostgreSQL don't worry with varchar(XXX) with XXX being the subject to flamewars (256 ? 500 ? 1000), the type "text" in postgresql is up to 2Go and as efficient as varchar()

uuid support: postgresql support uuid natively as primary keys (without needing to resort to a varchar ofcourse...)

And I've only talking about the advantage of PostgreSQL, not the strange defect of MySQL: for example that you can only have 1 column with a default timestamp, that your autoicrements will overflow silently taking back previous ids , a lot of things are only "warnings" (value not in an enum, fine I will insert null) that you will not see in your application code except if you really look hard for it.

Edit: I've used MySQL extensively and only started for now 2 years to use PostgreSQL, and though I have more knowledge in MySQL optimization and internals, and I don't consider it "bad", it's just 'so so', you will definitely be able to do whatever you want with it and it will not betray you hard, but PostgreSQL is just from an other league and will actively help you.


A few things from this list are out of date. For example, MySQL 5.6 (GA release 3.5 years ago) added support for multiple columns with default timestamp. MySQL 5.7 (GA release 1 year ago) added generated virtual columns, which can be indexed, basically supporting index on expression. And for an extreme example, table partitioning was added in MySQL 5.1, 8 years ago.

The default SQL mode also changed to be strict in 5.7, preventing silent overflows and other oft-complained-about write behaviors. Using a strict SQL mode has been recommended best practice for quite a long time anyway.

MySQL certainly has its flaws, and there are a number of useful features that are present in pg but missing in MySQL. But the gap is smaller than many people realize.


Great list, thanks for the thorough response


A bunch of other relevant features of PostgreSQL:

- table elimination/join removal (MariaDB supports it but MySQL doesn't)

- asynchronous protocol (the new X Protocol supports it but not the standard protocol)

- table functions (for example the generate_series function which is very useful in PostgreSQL)

- LISTEN/NOTIFY (a built-in pub-sub mechanism)

- WITH (Common Table Expressions)

- Window aggregations (very useful for reporting and stats)

- Row Level Security (introduced in PostgreSQL 9.5; very useful for multi-tenant applications)


utf8 collations. mysql's utf8mb4_unicode_ci is fantastic for real-world multi-language support. postgresql's utf8 charsets require you to attach a language-specific collation. It makes real-world applications impossible to support on potgres.


JSONFields for me. I want to use Aurora, but lack of JSONField makes it non-starter.


Here is a rationale for multi-master from James Hamilton's "On Designing and Deploying Internet-Scale Services".

Designing for automation, however, involves significant service-model constraints. For example, some of the large services today depend upon database systems with asynchronous replication to a secondary, back-up server. Failing over to the secondary after the primary isn't able to service requests loses some customer data due to replicating asynchronously. However, not failing over to the secondary leads to service downtime for those users whose data is stored on the failed database server. Automating the decision to fail over is hard in this case since its dependent upon human judgment and accurately estimating the amount of data loss compared to the likely length of the down time. A system designed for automation pays the latency and throughput cost of synchronous replication. And, having done that, failover becomes a simple decision: if the primary is down, route requests to the secondary. This approach is much more amenable to automation and is considerably less error prone.


I look forward to when this lands on PostgreSQL 9.7 without the need for an extension. But more so when I can also include the Citus DB extension.

Running CitusDB with just 1 master made me nervous. They did talk about having multi-master replication as a belt and braces solution, but I don't know how far they got.

Thinking about this. Both being used may give you a 100% fully fault tolerant solution?



Looking at the defects potential, I'm not sure I would rely on this as more than a "hotter spare" configuration... ex: only write to primary, then secondary if initial fails. Read from wherever.

That said, It seems like this doesn't have the reliability constraints I'd really want to see... however as a "hotter" spare option, that might be nice. It seems like it wouldn't take much to turn this into a fast auto-failover master-slave model.


Please make sure you understand log replication and go through fire drills for the list of things that can go wrong with bi-directional replication. The last thing you'll want to do is deploy this into production and wing operations as you go.


BDR is a nice building block to multi-master PostgreSQL. I'm looking forward to parallel aggregates in 9.6 being added to core. Using the agg[0] extension for something as core as using more than one core per (aggregate function) query felt strange. (I wonder if the time has come to decouple connections from processes/threads in postgres, as well...)

[0]: http://www.cybertec.at/en/products/agg-parallel-aggregations...


Now that is something very interesting. I would love to use this ASAP! :)


Can someone explain to me the point of BDR? Since the writes must happen on all servers anyway, why not just have a master-slave?


It means you can write to any master, so your application need not be aware of "master" or anything. That's master/master anything, really. It just makes replication strategy transparent to applications and is far simpler to reason about. It's also far harder to implement on the server side, which is why most software you see that handles master/master (especially cross-DC master/master) comes with severe caveats, probably this included. Distributed systems are extremely difficult and come with lots of corner cases.

There's no reason you can't hang slaves off such a setup for various purposes either, I would assume, though I haven't used BDR and I'm not sure if that's supported in this software. The best replication strategy I've played with in general is a master/read slave setup in each facility with master/master between each facility. A lot of stuff is built that way, but most people never worry about datacenter failover so it's not the sort of thing you find on StackOverflow.

If I give you the knowledge that your Gmail inbox "lives" in one datacenter, imagine how you'd architect a backup for when that facility fails. That's where stuff like master/master starts to come in handy, because then you start thinking about things like "why would we build a backup facility and never use it? The user's latency to the backup is lower today."


If you have users in various locations, all of whom would like local access to a copy of the database.


Do all nodes need to be up 100% of the time? If not, how long can a node be down without replicating (perhaps because a server is under maintenance).

Does BDR have rules for primary key insertion conflicts? I have a (perhaps odd) situation where identical data is already being written to multiple servers. Currently handling with a custom replication mechanism.


The nodes don't need to be up 100% of the time. Thanks to replication slots, the WAL on the other nodes will be kept until the node connects again and catches up. So it really depends on how much disk space you have on the other nodes.

Regarding the PK conflicts - I'm not sure I understand the question, but it's possible to use global sequences (which is one of the parts that did not make it into core yet). Otherwise it'll generate conflicts, and you'll have to resolve them somehow (e.g. it's possible to implement a custom conflict handler).

See http://bdr-project.org/docs/stable/conflicts.html


Is there some sort of consensus mechanism at work here, or is it closer to circular replication a la MySQL?


I have not used 2ndquadrant's BDR extension. Anyone comment as to how easy it is to setup?


Does anyone know good replication for psql in dynamic environments like kubernetes?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: