How to Scale PostgreSQL on AWS: Learnings from Citus Cloud

simonw · on March 10, 2017

Citus are doing a fantastic job on content marketing. Every single piece they publish on https://www.citusdata.com/blog/ is a case-study in how to write content (and headlines) that appeal to the kinds of developers their product targets.

"How to Scale PostgreSQL on AWS: Learnings from Citus Cloud" - seriously, how am I as a PostgreSQL-liking developer who cares about scalability NOT going to click through to that article?

craigkerstiens · on March 10, 2017

Simon, thanks so much for the kind words. It's nice to know that the content is useful. In general we're focused on sharing both content about how to use Citus and what it's good at, but also good and interesting engineering content as well as we want to add value to our users and broader community. It's great to hear that we're doing alright from your perspective on that front.

louis-paul · on March 10, 2017

Really appreciate the quality of the content, there is some real value to it.

kornish · on March 10, 2017

Citus Cloud is perhaps most exciting me because it has tremendous momentum: as the combined product of deep technical expertise meeting top-flight open source software meeting tons of end user experience, it's quickly outpacing platforms which are locked-in anachronisms. Take Redshift: Postgres 8.4? After you've used some of the features in 9.6, it's hard to go back. It'd be interesting to see some numbers around Citus Cloud's battle-tested deployments.

As a side note, these blog posts on high-level techniques and open source tools (e.g. PgBouncer, wal-e) are useful for anyone considering deploying an on-prem version of Citus as part of a product – thanks, Ozgun!

Usual disclaimers apply: not an employee, but big fan of the team and technology and it's great to see them gaining well-deserved mindshare.

pjungwir · on March 10, 2017

I saw the section on EBS, but it didn't offer much advice. Getting good performance on networked storage is the biggest challenge to me. The last time I asked about that here [1], I got this answer:

    nasalgoat 161 days ago [-]

    The secret to EBS is to use General SSD,
    not Provisioned, but use a RAID stripe.

    The reason this works is because IOPS are provisioned
    per EBS drive and by the size of the drive. So a RAID0 
    stripe of, say, ten General SSD drives will outperform 
    the more expensive PIOPS single drive.

That sounds like a great approach, although I haven't had time to try it out yet. I'm curious if anyone else has done anything like that.

[1] https://news.ycombinator.com/item?id=12609172

fdr · on March 10, 2017

Citus Cloud uses that approach. I designed it with the arithmetic from the page about gp2, http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolume....

A useful inflection is the "minimum size for max throughput" at 214GiB. That same disk would also offers 38 minutes at the maximum throughput at 3000PIOPs. After that, performance will drop to about a fifth: that's the baseline for a 214GiB disk. Another important thing to consider that large contiguous reads and writes on EBS are up to 256KiB per IOP: most bulk transfers take up far fewer IOPS than people normally calculate.

For most workloads, Burst Is Good: for a fixed amount of money, you'll be able to obtain meaningfully better responsiveness in a program. Few workloads are pushing the maximum IOPS all the time. If you need more baseline IOPS you can make gp2 disks bigger or add more disk, depending on whether you find bandwidth has been saturated or not.

jacobscott · on March 10, 2017

Are Citus Cloud customers generally IOPS constrained rather than throughput constrained? You're also constrained by the EBS-optimized interconnect, right? http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimi...

jacobscott · on March 10, 2017

Also, have you considered using i3 instances for their superior price per throughput/IOP? Obviously, you can't rely on EBS snapshot based backup/restore workflows on those instances, but maybe the performance gain is compelling enough to make the exercise worthwhile?

fdr · on March 11, 2017

I have rejected this on the basis that stopping and starting an instance is too useful for solving problems. Even with a few striped EBS disks, instance issues solvable by re-attaching the disks via stop/start are probably ten to one hundred times more likely to occur than a disk failure.

Still, I know some people do things this way, relying on archives and HA for redundancy, and it can be pretty good.

fdr · on March 11, 2017

As a random sample of operating time, probably latency constrained, where something like a local disk would be nice. We do have occasional spikes for bulk writes and reads, and the good throughput of large IOPS EBS are instrumental there.

user5994461 · on March 10, 2017

Use multiple big* SSD volumes in RAID 0 on big* instances.

* IOPS and bandwidth are proportional to the size of volumes

* instance types have a hard cap on network and IO, proportional to the size of the instance.

Otherwise, you can try local SSD instances but there are only a few and they are incredibly expensive.

kinghajj · on March 10, 2017

I'm working on an approach using ZFS on K8s. The idea is to write a custom "flexvolume" driver that takes set of EBS volumes an options, attaches them, create a zpool, set options, etc. In tandem, however, I'm adding to each node host a systemd service to create dm-crypt+LVM devices atop of each NVMe drive (as in the i3 instance types). The flexvolume driver will support adding SLOG and/or L2ARC devices by provisioning LVM volumes from those.

Something I'm still unclear, is whether EBS does any data checksums, in which case I could disable ZFS' to spare CPU cycles (but keep compression, of course.)

illamint · on March 12, 2017

> I'm adding to each node host a systemd service to create dm-crypt+LVM devices atop of each NVMe drive

That sounds cool; can you describe how you're using systemd for this in a bit more detail? I do this with an Ansible playbook when the machines are provisioned.

weitzj · on March 10, 2017

Sounds great. Is this approach somehow related to ClusterHQ's?

manigandham · on March 10, 2017

We use MemSQL and it has the best replication setup process for any relational database with 1 line:

  REPLICATE DATABASE db_name FROM master_user[:master_password]@master_host[:master_port][/master_db_name]

Why is it in 2017 we still don't have any other database that can come close to this? Basic replication is very well understood and used everywhere but it seems like database creators just don't understand what should be prioritized.

anarazel · on March 10, 2017

The stuff in postgres 10 isn't that much different (for logical replication). On the master you do

  CREATE PUBLICATION all_tables FOR ALL TABLES;
  or
  CREATE PUBLICATION important_tables FOR TABLE transactions, users;

and then on the node you want to replicate to

  CREATE SUBSCRIPTION backup_important CONNECTION 'dbname=foo host=bar user=repuser' PUBLICATION important_tables;

manigandham · on March 10, 2017

What about editing the postgresql.conf, pg_hba.conf, recovery.conf files, setting up archiving and doing the first base backup transfer? That's the actual hard part.

Logical replication doesn't replicate DDL commands either from what I know so far so that's another major problem.

anarazel · on March 10, 2017

> What about editing the postgresql.conf, pg_hba.conf, recovery.conf files, setting up archiving and doing the first base backup transfer? That's the actual hard part.

I talked about logical replication here... But the defaults for those are going to be a lot saner too in 10. We're not going to expose postgres to the world by default, so you'll have to change listen_address - but I think everything else would be a really bad deal security wise.

> setting up archiving

You don't really need that anymore in the more basic cases. Unless you want/need point-in-time recovery to older states, replication slots make that a lot easier.

> doing the first base backup transfer?

    psql -h master "SELECT pg_create_physical_replication_slot('my_standby');
    pg_basebackup --write-recovery-conf --slot my_standby -D targetdirectory

I'm not saying things are perfect, but they're clearly improving. And some of the complexity (e.g. not allowing remote connections by default) are pretty basic security vs. simplicity tradeoffs.

manigandham · on March 10, 2017

Default remote access off is perfectly fine, we don't need the mongo style security problems.

However the rest just seems to be a overbuilt mess. Why not start simple with barely and options and work up from there instead of trying to build in so many different configurations and features?

90% of use cases would be satisfied with a simple async replica on a different server started with any user that has read access to that database and a single command, along with the ability to become a standalone master anytime.

Redis (yes its a different database) has another nice and easy setup with a simple "slaveof" command.

chousuke · on March 10, 2017

When you issue that command in MemSQL, what does it actually do?

Physical replication in PostgreSQL is a bit more work to set up, but its behaviour is rather straightforward. As far as I understand it, all it really does is the same as a standalone database, except that instead of generating WAL segments in response to user queries that are then applied to the database, it gets them by streaming them from the master.

It's easy to see how the system behaves when nodes fail.

Similarly, a PITR is achieved by telling PostgreSQL to (optionally) retrieve WAL at startup from an archive and apply it until the desired point. The mechanism is identical to streaming replication, except non-continuous.

Referring to your comment below, postgresql.conf, pg_hba.conf and recovery.conf are data that can't really exist in the database, because it's needed before the database can even start up.

With logical replication in PostgreSQL 10, you could feasibly implement a command identical to MemSQL's, but the problem is that its behaviour is no longer quite so straightforward to understand compared to physical replication.

manigandham · on March 10, 2017

It does everything. There's nothing else to configure. MemSQL is a distributed SQL database so a "cluster" is multiple machines supporting a single logical interface to multiple databases.

Another cluster in another datacenter made up of multiple machines can then issue this command which will automatically contact the master cluster, setup connections in parallel between several machines, create the replica database, ask the master to create a new snapshot, stream the snapshot over, build the database, then subscribe to the change log stream including data + schema changes, all async and very fast.

Yes, this is most commonly referred to as "physical replication" but it covers about 90% of use cases where a database needs a replica somewhere else.

Postgres is great but too often the features keep getting blown out into ridiculous complexity instead of just making the basics work well and without tweaking anything, then building from with optional configuration for greater functionality.

chousuke · on March 11, 2017

I was actually asking what it actually does, not what it does conceptually. My main issue with these advanced "easy" solutions is that while they're easy to setup, often that comes at the cost of missing understanding of behaviour under load / failure scenarios.

I'm not saying PostgreSQL couldn't improve (it definitely can), but when it comes to data, I advocate simplicity over ease.

Very quick reading gives me the impression that MemSQL's replication is actually per-partition logical replication (might be physical, if partitions are physically separate databases, but they don't seem to be) and uses an algorithm (didn't find which) to decide which node to failover to when something occurs. It also seems to differentiate between metadata nodes and storage/worker nodes.

cromulent · on March 10, 2017

I was looking for the "I want my database to be performant under high random load" question. PIOPS can hurt.

Anyone have any experience running PostgreSQL on the new I3 instances?

kinghajj · on March 10, 2017

https://news.ycombinator.com/item?id=13842222

Not yet, but you're not the only one with that idea!

agentgt · on March 11, 2017

I have mentioned this on some previous posted articles but we are really happy users of both citus and pipelinedb.

Check out pipelinedb if you are a Postgres fan (obviously it is for a different use case than Citus).

The only thing I don't like about pipeline is that it currently is a fork and not an extension but that is supposed to change.

Consequently we syndicate to citus and pipeline through rabbitmq and Kafka.

We use google cloud as well. I'm contemplating on writing a post on what we have learned (and not :)) but I don't think I could ever match the quality of this article.

And yes invariably some one will mention memsql does both but it is proprietary and not Postgres. I probably should have spent more time investigating it though (and eventually will).

jacobscott · on March 10, 2017

Does Citus (Cloud?) have features that offer better high availability and failover functionality than what RDS provides? Managed Patroni and packaged workflows for zero-downtime failover would be quite interesting, but I don't see anything like that mentioned on https://www.citusdata.com/product/cloud.

craigkerstiens · on March 10, 2017

We don't use Patroni or any of the other off the shelf items. We rolled our own primarily from our years of experience on Heroku Postgres. We're actually working on a detailed post on how HA and disaster recovery works for Citus Cloud, though the core mechanism powering it all under the covers is our state machine. You can read a bit about how it works in this post: https://www.citusdata.com/blog/2016/08/12/state-machines-to-...

jacobscott · on March 10, 2017

Thanks! I look forward to reading about implementation details.

It does seem like documentation of Citus Cloud HA and disaster recovery behavior is a bit light, compared to e.g.

- http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concep...

- https://devcenter.heroku.com/articles/heroku-postgres-ha

"In addition to continuous protection which is explained above, high availability is available if your application requires less exposure to downtime. We provision stand-bys if you select high availability at provisioning time. This can be for your primary node, or for your distributed nodes." https://docs.citusdata.com/en/v6.1/cloud/features.html

Please let me know if I've missed any resources on this topic!

craigkerstiens · on March 10, 2017

Yep, our docs are definitely behind the product here, we'll be working to improve. FWIW it would be the closest to the Heroku Postgres HA as it's the same person that wrote both systems.

forgotpwtomain · on March 11, 2017

Why is it seemingly impossible to read a technical blog-post on a company-blog, without some seven-year-old-humor type meme mixed-in?

hayd · on March 10, 2017

I wonder how Postgres Aurora will fair against Citus... that's what we're considering migrating to in the next year or so.

dhd415 · on March 10, 2017

Aurora PostgreSQL and Citus are pretty different. The big draw of the former is that it provides PostgreSQL as a managed service on top of a cloud-optimized flash storage layer that facilitates high availability, durability, and low-latency read replicas for PostgreSQL workloads that fit on a single machine. The latter provides a product (or managed service in the form of Citus Cloud) that performs transparent sharding that allows for scaling relational workloads beyond the limits of a single machine. Most workloads will favor one or the other of those scenarios pretty clearly, either by not stressing the capabilities of a single machine or by being easily shard-able.

MichaelGlass · on March 11, 2017

Yeah, Aurora is great for scaling already-fast reads but if you wanna scale writes or get large indices in memory, it's not going to help you.

jordanthoms · on March 11, 2017

Any plans to take Citus in more of a data-warehousey, complex queries direction over time? We are starting to hit test limits of Postgres 9.6 and would like to move to a columnar store, but Redshift is hosted-only, Teradata looks expensive, Greenplum looks old.

mindprince · on March 11, 2017

Try ClickHouse if you need a columnar DBMS: https://github.com/yandex/ClickHouse

We started using it recently and it's been amazing.

jordanthoms · on March 13, 2017

That looks like a great suggestion, thanks! It looks like we'd need to break out our JSON data into columns (our events table is basically just an event name, time, user id, and then a jsonb column) and rewrite our queries, but it seems more suitable than anything else I've seen so far.

We've been getting good results with some tweaks to our Postgres indexing (partial indexes on event names makes a huge difference) and that'll be even more practical when we can move to logical replication, so will be sticking with that for now.

jaequery · on March 11, 2017

i feel DB hosting is such an underrated field right now. in terms of scaling everything is pretty easy to scale except databases. i would love to see more services like this.

BIackSwan · on March 10, 2017

Aren't most of these use cases already offered/handled by Amazon RDS? Maybe not transparent sharding - but otherwise everything else?

kod · on March 10, 2017

So basically everything except horizontal scaling?

Seems like a pretty big difference...

LogicX · on March 10, 2017

Why does the community link on your pricing page lead to a 404?

craigkerstiens · on March 10, 2017

Thanks for reporting, looks like a place we missed updating for our new tutorial. Should be fixed now.

marknadal · on March 11, 2017

1. I’d like my PostgreSQL database to be Highly Available

Highlight: "The first is the complexity associated with it: it takes twelve steps to setup streaming replication ... open source solutions such as Governor and Patroni aim to do just that. That said, this integration again comes with a complexity cost."

I cannot believe it is 2017 and streaming replication is still considered complex. I have spent the last half decade+ of my life to try and make this simple, here is a demo: https://youtu.be/-i-11T5ZI9o

2. I’d like my application to not worry about failovers

Highlight: "most PostgreSQL clients don’t have a mechanism to automatically retry different endpoints in case of a failure."

Master-Slave systems are not conducive to failover (determining a new Master involves its own locking/election mechanisms). If we have streaming Master-Master replication by default, you can have some easy automatic failover - https://youtu.be/-FN_J3etdvY .

4. I’d like my database to scale horizontally

Highlight: "Deploying a distributed RDBMS into production requires a good understanding of both relational databases and distributed systems."

We can do a lot of work to improve understanding out there, Kyle Kingsbury (Aphyr of Jepsen Tests) has done a lot to spread awareness. A couple years ago I did a tech talk that explains the ideas with stick figures so that way even laypersons could understand what is going on: http://gun.js.org/distributed/matters.html .

5. I’d like automatic backups for disaster recovery

Highlight: "Distributed database backups are even harder."

See the (1) demo, this doesn't have to be hard, it can be easy enough for frontend web developers IF the system is a streaming Master-Master database to begin with. Ontop of that, check out our "backup to S3" prototype where we scaled to doing 100M+ messages for $10/day (all costs, CPU, disk, S3) here: https://www.youtube.com/watch?v=x_WqBuEA7s8

My goal and argument here is that database vendors keep propagating the message of "this is hard, so trust us and pay for systems" that Aphyr has repeatedly proven to be broken (although, actually, Postgres did really well, Kyle was recommending it as the best general purpose database) - as Craig notes himself: "In fact, I’ve been on calls where we quoted $300K for the services work, and never heard from that user again."

We need to break these cycles, and I do believe Craig is trying to do that with these blog posts, which is great. But, we have a long ways to go (all of us).