Experiences with running PostgreSQL on Kubernetes

vasco · on Jan 22, 2018

So they say it's hard to run it in Kubernetes because they weren't running it in StatefulSets. Then they say you can actually run it properly in a StatefulSet but hand-wave it away with "but people run it on top of Ceph which has issues with latency". That sounds like a shitty excuse to just dismiss the whole thing as too hard. If one particular underlying storage system wasn't good, use a better one. Kubernetes allows you to abstract the actual storage implementation to whatever you want so it's not like there's lack of options.

Also I'd be curious to see if "most people" actually use Ceph vs network storage like EBS volumes where AWS guarantees me that I won't have data corruption issues in exchange for money.

twakefield · on Jan 22, 2018

"That sounds like a shitty excuse to just dismiss the whole thing as too hard"

Disclosure: I work at the company that published this post.

I read it differently (albeit, I have much more context). I read it as a cautionary tail that Kubernetes makes it easier to get in trouble if you don't know what you are doing - so you better have a deep knowledge of you stateful workloads and technology. Perhaps obvious to some, but still a good reminder when dealing with a well-hyped technology like Kubernetes.

toomuchtodo · on Jan 22, 2018

I appreciate pragmatic replies like this, providing opportunities to gracefully get off the hype train for those who bought in.

tyingq · on Jan 23, 2018

I agree. Kubernetes seems clear that "pets" isn't their strong suit. I'm surprised at the surprise here.

They will likely get better at it over time. Until then, either run it outside of K8S or deal with the warts.

aberoham · on Jan 22, 2018

Sasha's experience running Postgres under Kubernetes started before StatefulSets were a thing, back in the days of PetSets. At the time of his tech diligence the best option appeared to be pg/stolon https://github.com/sorintlab/stolon. The gist of the interview is that there are many more edge cases with HA Postgres + leader election than you can reasonably imagine at first glance.

Our experience building and running 24x7x365 HA Kubernetes clusters for clients is almost exclusively in air-gapped or on-premise environments where EBS (or even an IOPS QoS guarantee) does not exist. Most SaaS and managed service provider teams we encounter would much prefer to leverage a cloud provider's primitives than DIY.

rattray · on Jan 23, 2018

Would you say that EBS is a good solution for Postgres-on-K8s storage, for users who can tolerate Cloud storage?

oppositelock · on Jan 23, 2018

I run a bunch of Kubernetes microservices which use Postgres, and I treat Postgres in the same way as etcd - it runs on its own box, or RDS or CloudSQL, and the containers talk to it there. When you do it like this, you can use all your favorite postgres management tools, and not have to worry about Kubernetes doing something unexpected.

If you want any semblance of DB performance from Pg as a container, you've got to give it its own data volumes, mount that into its container, etc. Not sure what the benefit of containerizing it would be. Shared tenancy on kube workers makes your performance unpredictable as well, and if you use affinity to run it on a dedicated worker, what's the point?

willejs · on Jan 23, 2018

I think he makes some really great points here, and obviously has some pretty good experience at running data stores in the real world, at scale.

I think his comment about stateful sets is more focused about running data stores on distributed filesystems, or network attached storage. This isn't a good idea generally, and I am not going to go into why, but Cassandra and Etcd advise against this for a reason, its well documented. It wouldn't just be using a 'better' storage system, it would be designing and implementing a very complex operator as he mentions.

The real take away which I think we should all realise is that, this is shiny tech to run mission critical data stores on, unless you know what your doing and want to invest a lot of time and effort in to it. Right now it's probably better to use RDS/Cloud SQL, or just run a traditional postgres setup with a SAN and decent failover that is tried and tested.

carterehsmith · on Jan 23, 2018

>> obviously has some pretty good experience at running data stores in the real world, at scale.

I missed that part. Did they mention it in the article? Perhaps they should mention that kind of stuff more prominently. Numbers... and such.

SteveNuts · on Jan 22, 2018

> network storage like EBS volumes where AWS guarantees me that I won't have data corruption issues in exchange for money.

Source? I've never heard that claim, and in fact we've lost data on EBS (though it has been a couple years)

jdc0589 · on Jan 23, 2018

seconded. As far as I know, EBS makes no claim to replicate a copy outside of your availability zone or anything to garuntee data integrity in the event of a major region wide issue

dsmithatx · on Jan 23, 2018

EBS is replicated inside of the AZ (Availability Zone) that it is in which is what makes it durable. An AZ is a data center.

You could always sync to another volume in another region if you were concerned about a longer term region outage. Note that RDS already mostly uses EBS.

dijit · on Jan 23, 2018

I am highly skeptical of container systems for persistence. Docker does not have disk I/O as a first class citizen for a reason, they're focusing (rightfully) on isolated compute and deployment/dependencies, and kubernetes, to my mind builds on that quite nicely.

I generally avoid abstractions for persistence layers, but I believe I'm the minority and I believe I'm going against what the industry desires me to do.

I'm not convinced this is a good idea /yet/. I did, however, see some interesting docker/kubernetes integration with ScaleIO (clustered filesystem) which cut out huge chunks of the disk I/O pipeline (for performance) and was highly resilient.

The demo I saw was using postgresql, the dude yanked the cord out of the host running the postgresql pod.

Quite impressive in my opinion.

https://github.com/thecodeteam/rexray

https://github.com/kubernetes/examples/blob/master/staging/v...

llama052 · on Jan 22, 2018

Reminds of the Jurassic park quote:

"Your scientists were so preoccupied with whether or not they could, they didn’t stop to think if they should."

old-gregg · on Jan 22, 2018

Yes they should. It's nice to have a (ultimately) data-center-wide scheduling platform with unified role-based access control, monitoring, utilization optimization and reporting, cost control and even exotic features like hardware validation. Just because earlier Kubernetes versions weren't good at everything at once doesn't mean it's doomed forever to be unsuitable for something.

Moreover, the use case he's discussing is quite unusual: Gravitational takes a snapshot of an existing Kubernetes cluster (including all applications inside of it) and gives you a single-file installer you can sell into on-premise private environments, basically it's InstallShield + live updating for cloud software. So, running everything on Kubernetes opens entirely new markets for a SaaS company to sell to.

This level of zero-effort application introspection hasn't been possible prior to Kubernetes, so that's another reason to use it for everything: it promises true infrastructure independence (i.e. developers do not have to even touch AWS APIs) that actually works.

ris · on Jan 22, 2018

> basically it's InstallShield + live updating for cloud software

Sounds like a nightmare to me.

tomsthumb · on Jan 23, 2018

Honestly, kube is a no-brainer thank-you-sweet-baby-jesus improvement in ops posture over an easy handful of shops I’ve come across.

bonesss · on Jan 23, 2018

Just to clarify, as someone who is fully on the bandwagon: would you recommend k8s for PostgresSQL loads? Perhaps without high availability?

Postgres on Kubernetes seems close to the DevOps ideal for making lots of little "edge databases" and limited backends for uninteresting webservices...

old-gregg · on Jan 23, 2018

Short answer is "yes", but I would keep in mind that Kubernetes isn't just one monolithic tool, it's a toolbox and you don't have to use all of its features for everything.

It's absolutely fine to pin your Postgres to a handful of pre-selected hosts with locally attached storage fully exposed to a Postgres container. Yes, this won't be semi-magically "moving databases around" (not needed in most cases) but you'll still be getting other k8s benefits I listed above.

But even if you feel adventurous and want to have a fully dynamic storage under your RDBMS, there are tools for this now in open source / commercially supported form [1].

[1] https://portworx.com

bonesss · on Jan 23, 2018

Thanks, and much appreciated! :D

antoncohen · on Jan 23, 2018

I think the core point of this article is that to run replicated DBs on Kubernetes requires deep knowledge of the DBs in question. You can't just use a StatefulSet and expect it to work well.

You need to override Kubernetes' built-in controllers to customize them to the details of the database, for example when is it safe to failover, or when is it safe to scale down. Outside of Kubernetes these decisions are made by cloud services like RDS, or manually by people with knowledge like DBAs.

To put that knowledge into Kubernetes you need Custom Resource Definitions (CRDs).

I know know if it is any good, but I just found a project called KubeDB that has CRDs for various DBs: https://kubedb.com/ and https://github.com/kubedb

joshberkus · on Jan 23, 2018

I wouldn't say that you need deep knowledge at this point. However, you do need at least journeyman level knowledge. Lots of folks (including me, on Patroni) are working to make more things automatic and lower the knowledge barrier, but we're not there yet.

A big part of the obstacle is that preserving state in a distributed environment is just hard, no matter what your technology, and the failure cases are generally catastrophic (lose all data, everywhere). This is true both for the new distributed databases, and for the retrofits of the older databases. So building DBMSes which can be flawlessly deployed by junior admins on random Kubernetes clusters requires a lot of plumbing and hundreds of test cases, which are hard to construct if you don't have a $big budget for cloud time in order to test things like multi-zone netsplits and other Chaos Monkey operations.

Making distributed databases simple and reliable is a lot like writing software for airplanes, but clearly that's possible, it's just hard and will take a while.

joshberkus · on Jan 23, 2018

Also, the article does show us the kind of knowledge that admins will always need to have, such as the tradeoffs between asynchronous and synchronous replication.

UK-Al05 · on Jan 22, 2018

I don't get it, all the failure modes talked are not kubernetes specific. They happen if your running any ha database cluster?

If you have async replication with a large amount of lag, well of course your gonna lose data if master goes down if your not careful. Regardless if your using kubernetes or not...

Can anyone explain why these failure modes are kubernetes specific? They just sound like things you have to think about regardless if your running a HA cluster...

_lqaf · on Jan 22, 2018

They're not kubernetes-specific, but apparently there are people who think deploying PG in it will magically Just Work(tm).

What I took away from TFA is two-fold: clustered-DB administration is hard to automate (e.g., there's a reason DBAs exist), and a lot of the tooling DBAs use now has to be rebuilt to function in a k8 environment, or replaced.

The first struck me as blindingly obvious; the second I find highly relevant, personally.

zokier · on Jan 22, 2018

> clustered-DB administration is hard to automate (e.g., there's a reason DBAs exist)

This is why I'm excited about CockroachDB; it hopefully will make operating DB cluster easier.

bonesss · on Jan 23, 2018

Without pretending I've ever done it in production: I'd have to assume that distributed databases will make the hop into a Kubernetes environment easier than databases that started monolithic and added replication over time.

Most of the issues presented in the article relate to data loss during network segregation and leader election. Those are important, but distributed systems are generally a bit more explicit in their CAP compromises.

emmelaich · on Jan 23, 2018

It's not Kubernetes specific and that is actually addressed in the article!

It's not postgres specific either. It will happen whenever you have a cluster or HA solution which is not intimately cognizant of the app it is supporting.

Shit, it is hard enough to get any database extremely reliable even with their own solutions (Oracle RAC, Postgres SLONY, mysql replication).

Adding Kubernetes / Docker to that mix makes for an interesting life. Caution is advised.

drdaeman · on Jan 22, 2018

I think it's not K8s-specific but specific to anything when there is an external automated orchestration that can mess with the software at random (schedule and reschedule deployments etc etc).

If you have N servers that are deployed by hand or, better say, with some automation ran by hand - you can be sure things are not moving at random. If you schedule a payload in K8s (or Docker Swarm or whatever) - the scheduler will only ask you for preferences, but will otherwise make decisions on its own. Like deciding to drain a node because disk is getting full or whatever. Or just schedule an updated service to another node because it felt like doing so. And most of software - Postgres included - doesn't have any idea about all of this.

Some of the points raised (like leader election) are not specific to orchestration, but some (like persistent storage for containers) are sort of unique to containerized world. Unless there is a crazy sysadmin who can walk into server room and randomly shuffle drives. :)

majewsky · on Jan 22, 2018

Note that all those difficulties in the article only apply when you a high-availability setup. While usually appropriate for services with a defined service level, you can get away with single-replica DBs for a lot of things.

At work, we run OpenStack on Kubernetes with Postgres for persistence, and it's entirely okay if Postgres fails for, say, an hour, because we don't have a defined SLA on the OpenStack API. The important thing is that the customer payloads (VMs, SDN assets, LBs) keep working when OpenStack is down, which they do.

zzzcpan · on Jan 22, 2018

You will probably still encounter data loss.

There are two main things to understand. First is that your application is running in a distributed environment and will encounter data loss if it is not designed for it. Second, even if it is properly designed to run in a distributed environment it's also has to be aware that it's running on kubernetes, configured specifically for it.

majewsky · on Jan 23, 2018

> First is that your application is running in a distributed environment and will encounter data loss if it is not designed for it.

The data itself is on a self-replicating storage (and Postgres obtains the proper exclusive locks when using it), so I don't care if it runs on Kubernetes or a Raspberry Pi.

jmkirby · on Jan 22, 2018

I think I'm just echoing rb808's comment, below, who is more informative than I can be, but I originally thought to comment only that my impression without any kubernetes experience is that the likelihood is, based upon my impressions of the article alone, is that in all probability this article is recording a attempt to coerce the inappropriate solution to deliver a much more difficult to achieve result than at least the scope of the article indicates is appreciated.

I could have just said that even knowing nothing much beyond cursory reading about kubernetes, the article comes across as a excercise in attaining disappointment thru hurried assumptions about what constitutes both a silver bullet and the daemon to be dispatched from unruliness.

The part that is disconcerting is the introduction to the article as a interview with the CTO, but it only takes a turn for the worse almost immediately by admitting to I'm production deployment of the solution, to which the subsequent admission to encountered difficulties is not compounding the sin do much as burying this entire excercise beneath condemnation, if I simply put down the impression conveyed. This has to be at the very least terrible PR. I'm increasingly concerned too, about the abundance of misapprehension of not only the capabilities of file systems but just fundamental design constraints, at s level of understanding that I would have expected to be fired for from a operations position in any of my customers. Have I missed the redeeming features in my haste to comment? It just feels so imbalanced and insecure to be so forthright about the level of accomplishment that's claimed.

qaq · on Jan 22, 2018

https://github.com/kubernetes/charts/tree/master/incubator/p... There is Patroni Helm chart for K8s

aberoham · on Jan 22, 2018

Josh Berkus gave a really good overview of Patroni at KubeCon Austin: https://youtu.be/Zn1vd7sQ_bc

jcastro · on Jan 23, 2018

Semi-related, Josh just gave some great advice during the last k8s office hours regarding running postgres databases: https://youtu.be/Aj0yozuQ0ME?t=50m39s

kureikain · on Jan 23, 2018

I have run kafka, postgres, cassandra on K8S but eventually I move off Postgres and Cassandra to normal server.

Ideally StatefulSet does help a bit. Such as with StatefulSet you have DNS and hostname like service0.namespace etc. And you can change StatefulManifest without updating the pod. Pod only get updates when we deleted it(OnDelete updating policy) or rolling update with staging parition which mimick real server behaviour where we can pause/restart process on a server.

However, what I realized is resource scheduling and EBS volume.

1. Soon I realized the node run db pod should only run db and I make a dedicated node pool for it.

When this occurs, It feels like eventually I'm provisioning a server to run this workload.

2. EBS volume cannot mount to other zone. So it really annoying when I kill a pod and it cannot start because the volume cannot attach to node in other zone.

And when we need to upgrade Kubernetes itself in an immutable way, mean kill old node and bring up new node. It's a pain to control that process carefully to avoid casscade node re-balancing/replication.

More over, the ability to easily goes in server and edit/tweak config is lost with K8S. We have to use ConfigMap, some trick of init container, entry point script to generate custom configuration file etc.

An example is broker id in case of kafka or slave-id in case of MySQL. In other words, I feel like running stateful service on K8S is no longer a joy.

Once I move these stateful service out of K8S, suddenly everything is so smooth. Running and upgrading K8S itself become a walk in the part.

Also, on AWS, when you have a large amount of server, the chance that you got AWS notification about node replacement/rebooting (old hardware, host migration) is very frequent. Dealing with these when all of node have stateful service running on is not easy.

solatic · on Jan 22, 2018

So it seems like their TL:DR is: 1) Either go with a prebuilt solution like Citus, or be prepared to build an external service locator that allows a human to define the leader so that a human admin can manually trigger the failover; 2) Don't forget that it's a DB and it needs fast storage; 3) Nothing is new under the sun and you need to beware leaky abstractions.

This doesn't seem to me like a reason not to run PG on K8s?

joshberkus · on Jan 23, 2018

Yes, and (1) no longer applies because service locators which locate the master are now easy to do.

brugidou · on Jan 22, 2018

Clearly running stateful services is hard on something like Kubernetes without delegating volume management to something like ceph.

Would something like Mesos resource reservation mechanism with persistent volumes do the job? When you run on premise you usually want to recover from a temporary failure or reboot and maybe run a special admin script if you feel like the node is not going to come back anytime soon.

manigandham · on Jan 23, 2018

At this point, my bet is on CockroachDB and other database systems built from the ground up to be natively distributed. It will be far easier for them to build functionality (especially the 80% that most people ever use) then it will be to bolt on and coerce the same distributed behavior for a single-node RDBMS.

mikekchar · on Jan 23, 2018

The thing about stuff that is hard is that it's usually hard no matter which way you look at it. I don't really know anything about CockroachDB, but I use Couch on a daily basis. I like the way it works, but it's a massive foot-gun if you don't realise that you have to do things completely differently. In our shop we a variety of DBs (Couch, Postgres, MySQL and even MSSQL). If my only concern was scaling, I would not choose Couch. While it makes scaling and replication easier, it does so by making a whole ton of things harder (and basically unscalable, ironically). You still need to choose the tool most appropriate for your problem. (In case you are wondering, Couch is awesome for problems where you need versioned, immutable data -- for example financial systems. But there are a lot of trade offs with respect to querying).

jontro · on Jan 22, 2018

Wouldn't you have to wait until all replicas are in sync before a commit can be completed? Even though there might be a low replica lag I would be really careful of depending on transactions in application code using this.

aberoham · on Jan 22, 2018

Pg streaming replication is asynchronous by default and there are some big caveats besides performance when it comes to synchronous use cases, esp around automatic fail-over / recovery. The pg/stolon FAQ hints at some of these issues:

1. https://github.com/sorintlab/stolon/blob/master/doc/faq.md#h...

maga_2020 · on Jan 23, 2018

in the video referenced by @aberoham , a the end there is a question to the presenter (Josh Berkus), about the 2 types of replication their deployment tool is supporting.

this was on about 33rd minute of the presentation

He suggested , that if you need to use , synchronous replication, to use PostgreSQL 10 with synchronous quorum replication, so that it will not block transactions if one of the replicas fail.

joshberkus · on Jan 23, 2018

Yes, that's correct. If you are going to use sync rep because you can't lose transactions, you really want to use the latest version of PostgreSQL, which supports quorum sync (i.e. "one of these three replicas must ack"), even in complex configurations ("one replica from each availability zone must ack"). Note, though, that the existing HA automation solutions (Patroni, Stolon) don't currently have support for complex topographies, so you'd need to do some hacking.

It is a tradeoff though. With synch rep, you are at a minimum adding the cost of two network trips to the latency of each write (distributed consistent databases like Cassandra pay this cost as well, which is why they tend to have relatively high write latency). It turns out that a lot of users are willing to lose a few transactions in a combination failure case instead of having each write take three times as long.

Postgres also has some "in between" modes because write transactions can be individually marked as synch or asynch, so less critical writes can be faster. I believe that Cassandra has something similar.

UK-Al05 · on Jan 22, 2018

The synchronous replication is a difficult from a performance point of view.

bechampion · on Jan 23, 2018

I'm still not a big fan to run stateful services in a platform mostly built for stateless . Also not a huge psql user , but replication and failover in psql has always been a half-done job, kube isn't gonna fix that.

vermaden · on Jan 23, 2018

Whole team just to 'manage' ONE virtualization 'technology'?

No thanks.

One admin is more then enough to cover and master FreeBSD Jails and many other technologies ... no just Kubernetes.

collyw · on Jan 23, 2018

I don't know much about Kubernetes, but isn't the conventional wisdom that the database should not be containerised? Has that changed?

parasubvert · on Jan 27, 2018

No, but Kubernetes is at peak hype right now. It’s a floor wax and a dessert topping.