Self-hosting a high-availability Postgres cluster on Kubernetes

gchamonlive · 2024-01-02T17:13:16 1704215596

Just like with cloud providers, it seems to me like with kubernetes, it is not a matter of if but when orchestration problems will arise. Specially in this case of hosting databases, a composition of provisioning complexities (db operational complexities on top of k8s operational complexities) is really scary.

Is there any way to overcome hidden complexity biting your hand other than studying k8s extensively?

renegade-otter · 2024-01-02T17:18:28 1704215908

This kind of resilience is a form of art, and it's also kind of a full-time job.

I would not advise trying this for a "side project at work".

Generally we all agree that we move to the cloud and it's "fully managed". If it goes down - that's the price we pay.

If you have to ask the question "what if the RDS goes down", then you are really in a different universe.

That last guarantee of uptime requires a ton of work, testing, and money, because you are all the way up and to the right on the curve of diminishing returns.

figassis · 2024-01-02T18:46:11 1704221171

My company also runs on a k8s cluster, but I agree with this fully and I keep my DB outside the cluster. For me, the order of preference is: RDS, Cloud Instance, Bare metal DB server, something else that is not kubernetes.

RDS will fail, but it likely will come back up without much action on my part. K8s will fail, you will spend untold human hours figuring out the k8s failure modes, before figuring out the database failure modes (which likely are quite straight forward). It's just a cost that is not worth it.

williamdclt · 2024-01-02T18:01:59 1704218519

> If you have to ask the question "what if the RDS goes down", then you are really in a different universe.

It does go down though, don’t neglect the possibility because it likely will happen. With very average workloads, I’ve seen RDS databases restart unexpectedly, read replicas being completely out of service, and even databases being completely frozen (can’t even connect as root).

I’d still go with managed, but it certainly doesn’t give full reliability :) you still have to consider “what if it goes down” - it will!

debarshri · 2024-01-02T19:05:22 1704222322

Problem is that RDS comes at a price. It is purely about operation cost.

When you have 1500+ databases these cost add up. At that point, this kind of techniques are required to self host the databases. Price per DB with HPA na VPA is way lower than what you would pay for managed databases as well as you can hire a full time devops+dbadmin and still be cheaper.

SlightlyLeftPad · 2024-01-02T22:37:06 1704235026

This is a perfect example of a cost savings opportunity that offers no benefit and is really just a foot gun when it comes to operational complexity. The cloud cost savings almost never justify the operational costs here. There’s typically a very long list of other cost optimizations to be made before this would ever be on the table for me. If you’re talking about 1500+ databases and think a single full time devops+dbadmin in some kind of unicorn person is going to be a better option, I feel bad for both of you.

debarshri · 2024-01-03T10:43:13 1704278593

I am not implying hiring single full time dbadmin, you can hire a team of them and still be cheaper that what AWS would cost at that scale.

p_l · 2024-01-03T08:57:20 1704272240

Except the typical cloud prices are dialed for USA, and it's not that unusual in my experience for the cloud savings to be high enough to add more than one highly skilled engineer who will give large positive impact all around.

candiddevmike · 2024-01-02T18:05:41 1704218741

> That last guarantee of uptime requires a ton of work, testing, and money, because you are all the way up and to the right on the curve of diminishing returns.

Not when it's a core part of your business...?

snapetom · 2024-01-02T19:04:30 1704222270

I've never been a fan of deploying DBs, either traditional RDBMS or distributed, under containerization/orchestration other than quickly spinning up dev and test environments. Certainly not for production. Databases have been built for high availability for many decades. Sure, maybe RDBMS doesn't scale quite as easily as a node API application, but it's still not really rocket science and you usually have some leeway if you get a spike.

Adding Swarm or K8 just seems like redundant and unnecessary complexity.

steveBK123 · 2024-01-02T19:14:22 1704222862

Wouldn't most data driven, disk-IO intensive workloads be a bad fit for k8s? Especially when its not purely read-only..

That seems the crux of it.

cwiggs · 2024-01-02T20:38:45 1704227925

This seems to be the common advise given, but I don't fully agree. There have been many times in my career where a DB was on a VM with the storage attached via a cloud providers block storage. When asked if we should move it to k8s, people are quick to mention k8s doesn't do well with persistent storage. However all of the big cloud providers offer the ability to easily create a persistent volume in k8s that then just creates a block device, attaches it to the k8s host and makes it available to the pod.

So in both situations you have the same IO limits of block storage. The question is does k8s persistent volume api add enough of an IO bottleneck to cause issues, IME that isn't the case.

Now if you want direct attached NVMe drives for higher IO than a network attached block storage will give, then it might be easier with a VM vs k8s, but I can't speak to that much.

p_l · 2024-01-03T08:59:15 1704272355

Given k8s origin as bare-metal oriented system, attaching physical SAN volumes was there pretty early on, and now it's only more capable in that area (including passing devices for running exotic stuff)

snapetom · 2024-01-02T19:19:44 1704223184

Absolutely. I haven't gotten in the weeds of K8 volumes, but certainly have done a lot with Docker volumes including adding some PRs to fix some issues I've ran into. There is a lot under the hood with Docker volumes. With how DBs like to optimize their IO operations, I'm frankly a little amazed how it somehow mostly works between the DB -> orchestrator volumes -> file system.

p_l · 2024-01-03T09:01:29 1704272489

K8s mostly drops docker volumes code (and many other parts of docker, TBH). You can pass a random mount point or even a raw device and it will have only the overhead of kernel IO code for namespaces, overhead you hit anyway.

The only "docker graph" stuff is in OCI container data itself.

dilyevsky · 2024-01-03T08:31:46 1704270706

It works totally fine on k8s with iscsi/fcoe/nvme-backed volumes provided remote targets themselves offer sufficient performance for your db workload

szszrk · 2024-01-02T17:26:12 1704216372

Not really. Those are two areas you need to really understand when things go south.

In Kubernetes you "solve" a lot of application complexity by abstracting it into things like Helm Charts or Operators. It's fun and easy to use operators to deploy databases, monitoring, mesh, minio... This does not make complexity disappear, it just abstracts it by creating another layer with an interface (k8s API) you may be more familiar with.

Whole Kubernetes in practice is adding more and more layers on top of it. All those layers/tools are usually amazing but soon you need some real k8s skills AND some app skills to handle even unavoidable scenarios like ... upgrades.

That's why it's so confusing sometimes and has such diverse opinions on. There aren't many environments that run vanilla cluster...

levkk · 2024-01-02T18:15:25 1704219325

In the case of running Postgres on K8s, the problem arises immediately when you try to resize a data volume and you can't because the API doesn't support it. K8s is not really for stateful systems, yet, and systems like Postgres that prefer to manage their own resources, you don't want another layer which doesn't cooperate to get in your way.

gbartolini · 2024-01-03T00:03:06 1704240186

I don’t agree here. There are operators like the one I’m a maintainer of (CloudnativePG) which works directly with Kubernetes, teaching it how to handle Postgres clusters as a coordinated set of instances. Enormous improvements have been done in the last couple of years, and we are particularly focused in working together with storage groups in Kubernetes to handle database workloads, such as for example declarative support for tablespaces and volume snapshots.

I suggest you to read CloudnativePG documentation as well as this article: https://www.cncf.io/blog/2023/09/29/recommended-architecture...

Also watch the video of my talk at last Kubecon in Chicago about handling very large databases.

I hope this helps.

p_l · 2024-01-03T09:04:38 1704272678

Thanks for working on CNPG!

I have been a very happy user of CNPG even with occasional issues (database backup to GCS tripped me few times, but it works - mostly a bit of UX that I never was sure wasn't some fail of mine).

Now I only really need to add some automation for handling "recover the database and switch over clients to it" that is more automated (I understand why CNPG doesn't do recovery to existing database, but it is a bit annoying)

mikepurvis · 2024-01-02T18:48:56 1704221336

"K8s is not really for stateful systems"

As a relative novice in the space, I'm grateful to hear someone say this out loud. K8s seems perfect to me for quickly scaling transient stuff like pipeline workers, web servers, but I've always been pretty leery of giving up the trivial snapshotting and rollbacks and other creature comforts of old-school virtualization when it comes to deploying long running applications, databases, and so on. And I've always felt kind of kind of guilty for not being on board to just mindlessly k8s-all-the-things.

snapetom · 2024-01-02T19:14:26 1704222866

Our company is finally looking at containerization and orchestration, and one product had the gall to say, Docker isn't for us. The non-technical people gasped because all the other products are moving towards orchestration!

Why? it's an ancient Windows Client/Server app. Each "node" manages its own state and communicates with each other in this proprietary, janky-ass way. It takes 10 minutes to start up a node.

K8/Swarm isn't going to do squat for this team except maybe launch dev/test environments a little easier.

gnufied · 2024-01-03T18:34:24 1704306864

>but I've always been pretty leery of giving up the trivial snapshotting and rollbacks and other creature comforts of old-school virtualization when it comes to deploying long running applications,

You can do that too with k8s with APIs which support more than just one backend.

gnufied · 2024-01-03T18:31:58 1704306718

I am the author of k8s resizing feature and its been GAed for awhile and feedback we have got so far has been good. If anything running inside k8s makes it relatively easy to support resizing. You just need to specify new size for PVC and it will both perform resizing on the cloudprovider and of the file system (if needed).

Modifying IOPS and other volume attributes is something less frequently needed but we just released alpha support for that too, if you must need it.

We have also added support for reporting volume usage in CSI specs, which I know some operators use to automatically resize volumes when certain threshold is reached (I however do not recommend using ephemeral metrics for automating something like this). But point is - you can actually define CRDs that persist volume usage and have it used by an higher level operator.

Another thing is - k8s makes it relatively easy to take snapshots which can be automated too and that should give someone additional peace of mind if something goes haywire.

Obviously I am biased and I know there are some lingering issues that require manual intervention when using stateful workloads (such as when a node crashes), but k8s should be just as good for running stateful workloads IMO.

Another thing is - k8s volumes are nothing but bind mounts from host namespace into container's namespace and hence there should be no performance penalty of using them.

DasIch · 2024-01-02T19:15:48 1704222948

Here is an announcement from last year stating that volume expansion is stable since 1.24: https://kubernetes.io/blog/2022/05/05/volume-expansion-ga/

The zalando postgres-operator also mentions as a feature:

> Live volume resize without pod restarts (AWS EBS, PVC)

levkk · 2024-01-02T21:53:03 1704232383

What about changing iops or storage class?

moondev · 2024-01-03T01:52:12 1704246732

https://kubernetes.io/docs/concepts/storage/volume-pvc-datas...

cwiggs · 2024-01-02T20:41:57 1704228117

> ...the problem arises immediately when you try to resize a data volume and you can't because the API doesn't support it

What API doesn't support it? k8s has support for resizing PVs and has for a while now. AFAIK all 3 cloud providers (and more) support increasing the PV using their storage class.

> K8s is not really for stateful systems, yet

Is this written somewhere or is it just your opinion?

dilyevsky · 2024-01-03T08:38:34 1704271114

Volume resizing support has been added since ~1.11 which is like 5 years old at this point

znpy · 2024-01-02T18:27:44 1704220064

> Is there any way to overcome hidden complexity biting your hand other than studying k8s extensively?

No. Technology is hard. Operating services is complex, and gets more complex as the scale gets bigger.

Any attempt at making stuff simpler is usually either moving complexity elsewhere or making it more expensive (and in some cases, both).

Nican · 2024-01-02T18:37:42 1704220662

I am happy using CockroachDB. The performance is not as good, since all your database writes require a 2 out of 3 quorum. But managing the database with the CockroachDB is pretty simple, since it can perform a rolling upgrade with no downtime.

Upgrades is handled with an operator, and happens by waiting all queries to finish, draining all connections, and restarting the pod with the newer version. The application can connect to any pod without any difference.

I perform upgrades twice a year, never really worried about it, and never had any availability problems with the database, even when GCP decides to restart the nodes to update the underlying k8s version.

mplewis · 2024-01-03T08:28:49 1704270529

My advice is pay someone to run the cluster, and you can use all that time you saved on k8s complexity to operate the stuff inside the cluster and save some money.

Features like k8s Operators are essentially professionally-written code that stands in for a human agent that would take actions such as managing DB nodes, renewing certs, and performing backups. If you use mature operators, you can save a lot of money for a bit of effort.

jauntywundrkind · 2024-01-02T18:19:41 1704219581

Is the complexity better or worse than alternatives? What are the alternatives?

People use complexity as a boogieman to justify throwing together their own really wild chaotic & only so-so tested "simple" alternatives all the time.

To me, this feels like a modern wonder. We have layers of responsibility. Many people operate Kubernetes clusters already for all kinds of reasons. It provides a powerful broad base. Now on top of that, we can operate Postgres, with a very smart failover system that's super well tested & broadly used, that leverages this competent starting place.

What are the Fears Uncertainties and Doubts you have that make you scared about composition? What would help address specific concerns? To me, this division of responsibilities & use of consistent platform for a variety of needs feels like a huge win.

People love "simple" options but they're not. Run naked through the woods like savages option has appeal, but just getting started keeps adding up:

Sure, just add some bash scripts for some wal backups that go off-site. Easy! Install pgbouncer like one does, just a quick install, point it at the right systems. Setup some replication. Install and configure more kind of hairy software to make it HA. Configure some TLS cert yourself to not send naked traffic over wire. Add monitoring!

Then operationally, how quickly do you think you'll be able to fail over (with ogbouncer staying ok), failback, do an upgrade, add more replicas, lose a replica? Can you rotate cert reliably in a timely fashion? Can your replacement? How well did you document everything? Will you have all the monitoring you need when incidents start coming in, or did you just spitball a couple metrics into place?

The alternatives, in my view, are obviously bad. You can do them. Either cheaply with risk or industriously with effort & applied-talent. But having cohesive autonomic systems at our back that try to help, that can faultlessly do many common tasks with perfect accuracy (across unimaginable numbers of systems, with perfect consistency, in record time): that feels like a massively better place in the universe, one that I don't get why so many people kick scream & drag against. Rarely are their arguments well elaborated ("scary"), and their counter-suggestions feel like they massively underrated how multi-faceted & carefully connected production systems are, for good reason, and how hard it can be to remember to not forget to change X when you do Y.

marcosdumay · 2024-01-02T18:50:14 1704221414

> What are the alternatives?

Installing the Postgres on the computer, or on a VM.

Really, just the fact that some people keep asking this question is enough to question everything else they say. The alternatives are obvious.

seanhunter · 2024-01-02T19:09:13 1704222553

Exactly. The way people successfully did HA and scalability on databases for more than 20 years before Kubernetes existed.

p_l · 2024-01-03T09:18:22 1704273502

With a bunch of hand crafted, hairy code to link the necessary features like fail over, backups, PITR, clustering.

I know, I did all of that.

Which is why I punt that effort to a k8s operator these days, because at the end of the day it does everything I did manually plus makes it easier for me to spin a new database that has WAL shipping and backups to different location.

seanhunter · 2024-01-03T15:20:38 1704295238

If you're punting to an operator why k8s? There are lots of managed database providers and if they're managing it, then k8s (or not) is an implementation detail for them to worry about.

jauntywundrkind · 2024-01-03T17:46:51 1704304011

These are each specific DB-as-a-services. Instead of sinking my time into learning something universal & competent & capable at many things (Kubernetes) now I'm investing in learning & using a niche service, that I won't have fine control over & whose limits might only be clear down the road. That's a risk.

Sure the dbaas might have superb autoscing and reliability. It may or may not have good OpenTofu/terraform providers available, or other custom niche deployment tools one would need to pick up & adapt.

Same questions as roughing it hacking together pg yourself, except now you have no power. Is observability going to be up to snuff? When performance problems come, how are you going to feel when you discover that analyst techniques a and b work fine but c and d can't be done because you are on managed service that doesn't actually give you full access to the db?

Costs can also add up, and are hard to predict. If you have a managed service, you'll be using a lot more network, which often has some cost (also some latency too). This isn't for everyone, but Kube makes it easy to start with anything (ex: RPi) and scale out. Works with hosted hardware with some local SSD, or I can get a 16-core 7950X and a bunch of gen 5 nvme drives and a fraction of a terabyte of ram for <$2000 and that'll scale me to godlike levels. Whereas I'll keep paying for something hosted.

I think it's so so ideal to have a reliably capable universal autonomic computing later underfoot. Most of the backend has had special specific answers to each question, each problem. We have been disjointed. Kubernetes provides a platform applicable to a vast range of workloads, where yes you need to tune and build different clusters for different workloads, but where the skills transfer much more readily. Everyone in Kubernetes knows how to ship resources definitions/manifests, and that base knowledge is all you need to learn to start consuming services - any services - on Kubernetes. Skill transfer is immediate. Having universal patterns (the API server), backed by autonomic operators making & keeping these resources real: it's so much better than the thousand different paths road of technology we've been on until now.

p_l · 2024-01-03T20:01:03 1704312063

Because I'm not interested killing both performance (connecting to database over open internet), my opex (hello cloud bills!), and violating customer trust (one of the reasons they pay me is because they do not want to trust American cloud vendors), just so I can have slightly smaller complexity.

dilyevsky · 2024-01-03T17:36:41 1704303401

Lots of various reasons. Performance, cost, compliance, and reliability are the ones I've personally seen.

andrewmunsell · 2024-01-02T17:10:20 1704215420

My holiday project was doing another pass at my Homelab Kubernetes cluster, part of which involved switching to a proper operator to manage Postgres. Coincidentally, I setup cloudnative-pg (https://github.com/cloudnative-pg/cloudnative-pg) yesterday.

bo0tzz · 2024-01-02T18:24:25 1704219865

I've been using CNPG on my home cluster since it came out, and it's been an absolute pleasure to use. I haven't done a full comparison, but I get the sense that it's learned from (and improved on) the other postgres operators like Zalando and Crunchy.

ahachete · 2024-01-02T17:56:15 1704218175

I'm the founder of OnGres [1] the company behind StackGres [2]. I'd love to hear your feedback if you'd be interested in also trying StackGres. It's one of the most feature-full operators available, has a complete Web Console and REST API and supports close to 200 extensions.

Hope it would be interesting for you.

[1]: https://ongres.com [2]: https://stackgres.io

x86hacker1010 · 2024-01-02T17:11:32 1704215492

Any reason you landed on that Operator compared to what OP is using (Zalando)?

Szpadel · 2024-01-02T18:59:24 1704221964

I was setting fairly important database with Zalando pg operator and after first good impressions it went downhill. after like a month of use WAL files used for point in time recovery started failing to offload to dedicated nodes and kept growing on database pods filling up all the space. I firstly assumed that maybe there is not enough space for some scheduled work (I do not really know details how this process work, I assumed that operator should handle all implementation details for me) but even after upscaling database 2.5x it just kept failing with full storage and requiring manual recovery to bigger storage, where most of it was WAL files.

HA didn't handled this case at all whole cluster went in crash loop

there was also issue of huge pages caused crashing and not easy way to disable those without some dirty injecting of config files at runtime

there could be some my fault at misconfiguration on by side, but I wasn't able to figure anything better from docs

andrewmunsell · 2024-01-02T17:32:18 1704216738

Honestly no, it's mostly due to inexperience with operators and not really understanding what the "best" way to find operators is. I did also look at the Crunch Data one (I was having some issues setting that one up), but didn't even find Zalando during my search.

OperatorHub is currently the main resource I use, but GitHub stars aren't exposed in the search so I have been looking at the "Capability Level" chart and checking for Github popularity when I find one with the feature support I want.

I'm facing this exact same issue now when trying to find an operator for Redis. I am not sure if I am just missing out on the "right" option by limiting myself to Googling and Operator Hub and looking for the one with the most Github stars, so I am open to tips.

turtles3 · 2024-01-02T18:18:18 1704219498

A subtle advantage of cnpg is that it doesn't use statefulsets, instead the operator handles things like mapping storage volumes and stable identities. Regular kubernetes statefulsets have some tricky sharp edges for failure recovery.

I don't know if all of these alternatives use statefulsets but I remember several doing so.

I've personally found cnpg to be pretty robust, and supports everything you will eventually need once you're locked into a solution (eg. Robust backups, CDC, replica clusters).

I'm yet to find anything of a similar standard for mysql.

EDIT: it should also be noted that CrunchyData is a proprietary solution and requires a license to use in production. This is not particularly obvious from their docs.

activescott · 2024-01-02T20:46:07 1704228367

What sharp edges are you referring to with statefulsets?

turtles3 · 2024-01-02T21:51:51 1704232311

Cnpg's docs articulate this better than I could: https://cloudnative-pg.io/documentation/1.16/controller/

Statefulsets have their place but are surprisingly inconvenient for database workloads.

peterbecich · 2024-01-03T02:12:27 1704247947

I also recommend the Kubegres operator: https://www.kubegres.io/

ysofunny · 2024-01-02T17:50:02 1704217802

once upon a time I set up an elastic search cluster in kubernetes

after a lot of tweaking I made it so that the pods would be as big as the underlying hardware nodes. one pod one node. once that was working I realized that I was using the wrong tool for the job.

the kubernetes tooling added nothing but complexity. needless to say I let it run like that having had wasted about a week getting it to work

jhgg · 2024-01-02T18:21:26 1704219686

On the other hand, at a certain scale (running hundreds of ES nodes across 80 or so ES clusters), Kubernetes actually does make a lot of sense.

At work, we moved from hosting elastic search on bare VMs to kubernetes. By leveraging scheduler policies we are able to pack / over-provision ES node pods of different clusters onto the same Kubernetes nodes, allowing for far greater resource efficiency, while being able to handle node failure while maintaining availability across all clusters. Additionally, this simplified operations significantly as we can now leverage the operator to do cluster wide operations (e.g rolling restarts, node OS upgrades, ES version upgrades, etc...) fairly easily.

We did, however, go 6 years (and several hundred million users and trillions of documents indexed) without needing to use Kubernetes!

We will blog about this at some point this year.

marcosdumay · 2024-01-02T18:57:48 1704221868

Did you keep adding and removing replicas into your cluster based on a scheduler policy? How often did you adjust the number of nodes (and how long did it take to make a node available)?

At the high-level you are describing your setup, it doesn't make sense. You'd spend way more resources managing any cluster than what you would gain from a normal-looking policy. I seem to be missing some important detail.

jhgg · 2024-01-03T18:16:06 1704305766

You must be. It is very easy. We just let kubernetes scheduler take care of everything. We use anti affinities to disallow scheduling nodes from the same ES cluster on the same K8S node. We use the ES operator which will apply the appropriate PDBs to ensure that operations done to the k8s cluster as a whole doesn't end up turning off too many things, and threatening availability.

Things literally have just worked.

We are about to do a scale up operation of about 6 nodes each for 16 clusters (total of 96 ES data nodes being added.) This was editing a variable in a config file, and pushing that config to kubernetes. The operation will probably take a day to complete as we throttle the speed at which data transfers to new nodes to not adversely affect latency or ingest rate.

The amount of "human time" to kick off this operation and begin the scale up is measured in minutes, then it's just letting it do its thing.

jen20 · 2024-01-02T18:45:36 1704221136

A big part of the problem with Kubernetes is it doesn't make a ton of sense at small scale, and it just plain doesn't work at large scale.

Nomad is generally speaking a much more appropriate technology when you hit the point of needing such a system.

jhgg · 2024-01-03T18:16:35 1704305795

I would consider our scale pretty large here, and it works just fine.

xenic · 2024-01-02T18:02:48 1704218568

”Zalando is a Postgres operator that facilitates the deployment of a highly available (HA) Postgres cluster.”

Zalando is the company. ”Postgres Operator” is the software.

Happy user here, not much complaints about the operator come to mind.

Havoc · 2024-01-03T01:10:43 1704244243

I wonder how much of big cloud one can replicate on a diy cluster.

Database works via this. S3 via minio. Redis. And maybe openFAAS?

That seems like a lot of the key building blocks already.

siamese_puff · 2024-01-03T01:15:58 1704244558

I'm biased, but I agree! This is a great article that further illustrates how we are a bit conditioned to use off-the-shelf vendor tooling that we can run ourselves.

https://kiwiziti.com/~matt/wireguard/

Of course, there are tradeoffs you have to make (security, uptime, criticality of the system), but homelabs exist as perfect experimentation frameworks.

At $DAYJOB we run a global scale Ceph cluster (S3-like API/object storage), so even at a large scale it's not impossible to imagine.

ssijak · 2024-01-02T18:00:18 1704218418

Is Kubernetes still hard in 2024?

danielvaughn · 2024-01-02T18:22:24 1704219744

Granted I'm new to devops, only been on a platform team for about 9 months now, but I still feel incredibly dumb every time I try to work with it.

That being said, we're also layering a bunch of stuff on top - helm, nginx, GKE, terraform, as well as a mountain of other things, and then to top it off we have a bunch of shell scripts doing random things to help tie it all together.

Normally I can pick things up pretty quickly. I just built a parser with tree-sitter, despite knowing virtually nothing about language design. Didn't take very long.

But the modern devops stack is a learning curve like I've never seen before. It's taking me more energy to learn it than it did for me to learn programming itself. Then again, maybe I'm just getting old.

cjaro · 2024-01-03T15:39:13 1704296353

I appreciate your shared experience as I have a rather similar one. I've been on a platform SRE team for about eight months myself, with seven years of SWE experience before that, and feel as though I'm just able keep my head above water. It's one thing to learn about k8s, the cloud, terraform, etc, then quite another to pile it all together, particularly since it all becomes heavily customized. It's a different job than code-writing software engineering, that's for sure. To me, it feels less like there's a 'stack' so much as there's a word cloud of DevOps buzzwords to start throwing at problems. Even when directed by architects & principals, it overwhelms.

I resonate with feeling incredibly dumb whenever I pick up a new ticket from our backlog. It feels like gaining deep knowledge of these systems will be a nearly insurmountable challenge. It's been eight months, and while I know far, far more than I did on day one, I feel that every day is a day one of sorts.

p_l · 2024-01-03T09:28:36 1704274116

Depends (I know, I know...)

If you learn it from the top, immediately jumping through bunch of deployments, Helm (ugh) and other stuff?

You'll get fast to deployment but you won't know how to deal with things failing and there will be a lot of stuff that will remain "magic". Eady way to end up cargo culting despite best intentions.

Can be enough of you're just making simple apps to run on it but not running the clusteror the application in production. But you will have hard time understanding why things work and everything will be complex upon complex.

Go from the bottom up, learn basics of kube API patterns, kubelet, how scheduling works (don't have to be in depth), how kubelet works, how networking works (CNI, why kube-proxy is a bandaid for applications that handle networking badly, how services work), how storage works (how kubelet mounts things to containers, etc). Then how higher level controllers (aka operators) work - from Pods, through ReplicaSet to Deployment, StatefulSet, DaemonSet.

This way you'll learn the basic building blocks, which are quite simple despite the long list I just gave, because the architectural and API patterns repeat and build over each other. The core is simple which lets you build complex stuff on top while still understanding it.

jamesu · 2024-01-02T18:17:03 1704219423

From recent experience I'd say it's the sort of tech that starts off simple enough with the right distribution, but then gets more complicated the deeper down you dive.

Probably the most hard thing I found was wrapping my head around the way the storage works.

__MatrixMan__ · 2024-01-02T18:38:00 1704220680

I can only speak from the perspective of somebody trying to manipulate k8s environments for use in test, but I'd say yes.

Four times across three companies I have run across frameworks which set everything up out of band and then ask the tests to test it. Maybe these are shell scripts written by the k8siest person on the team. Maybe they're something like tilt. Whatever the case they're always a black box to the majority of the people who are writing application code.

They get you 90% there, but eventually somebody wants assurances that some environment variable has the desired effect, and suddenly you need to penetrate that black box and change it so that there are multiple kinds of "up" and the right tests run against each state.

K8s tooling is commonly installed via curl, so once you unravel the black box and integrate it with your tests you end up with a lot of fragile interfaces to things like kustomize, kubectl, kind... Fragile because maybe the other dev has a different version installed. Nix dev shells solve this, but you can't usually get the whole team on board with Nix so version mismatches come up often and are often difficult to debug. You end up in a state where whoever wrote the initial setup scripts is authoritative about the dependencies, and you have to ask them what they have installed if you want things to work (it was easy for them, they just used whatever was lying around at the time).

These aren't directly deficiencies of k8s, once you see the light (which takes a long time) it's pretty easy to work with, but like so many other technologies, the devil is in the peripheral tooling and the culture. K8s doesn't (yet?) have a very nice boundary with other language ecosystems, it reminds me of Java in that way. The die hard k8s people often want to solve problems by bringing them more fully into the k8s way of seeing the world and I just don't think that is consistent enough with reality to be our everything.

I've cultivated a begrudging respect for it, but I still don't like it. If I break free and start my own company, I'll publish an operator so that my stuff can be installed into k8s, but I don't intend to make it primary in any way.

8organicbits · 2024-01-02T18:41:21 1704220881

I've seen projects run on Cloud Run because they fit that model. But if you don't quite fit the model, then you fight the abstraction in odd ways. K8s gives you control over more things, so the easy stuff feels hard but the harder things aren't as hard. With k8s operators you can stand up pretty complex things very quickly and robustly.

cwiggs · 2024-01-02T20:48:15 1704228495

IME yes, yes it is. I usually tell people running a k8s cluster is similar to running your own "cloud". If you want to just deploy an EC2 instance you just tell AWS you want an EC2 instance and you are done (mostly). You don't have to worry about if the hardware under the EC2 VM has enough resources, you do have to worry about that with k8s though. If you want to lock down the EC2 VM to have certain permissions you use AWS IAM, with k8s you have to use cluster roles and cluster role bindings. You can apply these 2 examples to many other things in k8s vs "cloud" provider, Ingress, persistent volumes, etc.

cies · 2024-01-02T18:31:53 1704220313

I heard people say its not for small teams. On top of that I'm not convinced it's a good tool for anything "storage" (i.o.w. use it for compute/network loads).

jonfw · 2024-01-02T21:14:06 1704230046

It really depends on the design of the workload. Kubernetes makes orchestration more simple, but it doesn't remove all of the complexity of orchestrating something.

If you don't need much orchestration (which is true for a lot of postgres users), the complexity from kubernetes is compounded on top of the complexity from postgres without generating much value.

_joel · 2024-01-02T18:11:05 1704219065

I don't know about hard but it's fairly straightforward to spin up clusters and maintenance seems to have become less of a headache (at least imhe). It depends on what you will be doing with the cluster and how you use it.

azlev · 2024-01-02T18:11:29 1704219089

Yes. Orchestration is not easy.

siliconc0w · 2024-01-02T18:37:11 1704220631

I wonder why more don't take advantage of native k8s and just rely on it to move over the persistent volume and start the new pod. This may have some small amount of downtime but it's a lot less complicated.

siamese_puff · 2024-01-03T04:34:34 1704256474

That gets complicated when you're running an HA cluster and need to worry about write conflicts.

justanotheratom · 2024-01-02T18:02:24 1704218544

dumb question - where is the storage kept?

cwiggs · 2024-01-02T20:50:43 1704228643

All the cloud providers offer a storage class for k8s. The storage class allows you to tell k8s that you want a persistent volume (PV) and it will make API calls to the cloud provider to get you a block storage device. You can tell k8s you want to use that PV in your pod and k8s will automatically mount the block storage to the worker node that your pod lives on and makes it available to the pod.

OP uses Longhorn which is a whole other thing that I've only read about.

For at home you can use other storage classes like ceph, NFS, etc.

siamese_puff · 2024-01-02T19:44:49 1704224689

OP here. Currently I am using Longhorn on this cluster which does data replication on SSDs attached directly to the nodes. My backlog item is to run an external NAS with RAID. In this post specifically, the replication is handled by Zalando and not Longhorn, but the storage itself is on each node (specified by the node selector).

znpy · 2024-01-02T18:32:06 1704220326

The author mentions Longhorn, a storage solution from the people behind the k3s distribution, so I’m assuming the data is stored in pvc provisioned through longhorn.

siamese_puff · 2024-01-02T19:47:47 1704224867

OP here. In general, yes! That is correct. I am using Longhorn with K3s, it's very slick and easy to get started with. In this case, I'm using multi-replica cluster with Zalando _only_.

bo0tzz · 2024-01-02T18:25:27 1704219927

Wherever you want, Kubernetes supports basically every storage backend you can imagine.

doublerabbit · 2024-01-02T18:55:51 1704221751

> Kubernetes supports basically every storage backend you can imagine.

Text files?

dilyevsky · 2024-01-03T08:46:57 1704271617

If you can figure out how to create mount point out of it and provide your own csi driver then yes

doublerabbit · 2024-01-03T18:18:56 1704305936

So that's a no then.

Creating != Supporting

Anything can be supported by creating your own driver.

iamgopal · 2024-01-02T18:19:29 1704219569

Me who have never used kubernetes, what if node crash ? Will I lost everything ?

siamese_puff · 2024-01-02T19:46:54 1704224814

I think people over think how hard this actually is. Data replication isn't a new concept. You can use RAID with a NAS or setup async replication with operators for things like Postgres/SQL.

Obviously it's worth doing simulated disaster recovery to ensure you would recover if there is hardware failure. The larger the scale and throughput with parallel writes against the same keys, etc then the more complicated the setup will be. I hope to write more on this topic, but setting up a persistent volume with a NAS is a great way to ensure high durability.

rad_gruchalski · 2024-01-02T18:52:21 1704221541

The answer is: it depends. It depends on if you use persistent volumes, and how well is your pv isolated from the failed node. If done right, no data loss.

marcosdumay · 2024-01-02T18:59:43 1704221983

A big emphasis to the "if done right" part.

You should test your setup, because it's very often not done right, and it's easy to overlook a problem.

stanac · 2024-01-02T18:52:01 1704221521

No, attached storage is not part of the node (not directly). It's something like attaching external (host) directory to a docker container. Your can kill the node/pod and storage is not affected, later you can attach new pod to the same storage.

mailcheap · 2024-01-03T07:38:00 1704267480

Why does the author not use a native HA database such as Cassandra or ScyllaDB?

oxfordmale · 2024-01-02T18:59:57 1704221997

No, just no. K8s shouldn't be used to host database systems. Its main function is micro services.

Cloud provider provide managed versions of Postgres that are highly available. Even if you self host, Kubernetes isn't the answer.

dilyevsky · 2024-01-03T08:45:14 1704271514

Perfectly fit for running databases. HA databases that have built in replication (cockroach, tidb, foundationdb, etc) are easier, pg is just not built for proper clustering. Still doable tho