Hacker News new | past | comments | ask | show | jobs | submit login
Postgres scaling advice (cybertec-postgresql.com)
392 points by BrentOzar on Jan 26, 2021 | hide | past | favorite | 200 comments



In the opinion of a last semester CS student who has never written an application from scratch that needed more than a SQLite DB (so take me with a half grain of salt), it seems like premature optimization, while always talked about, is very common. I see people talking about using Kubernetes for internal applications and I just can't figure out why. If it's a hobby project and you want to learn Kubernetes, that's a different situation, but in the case of making a real application that people will use, it seems like a lot of us can get away with a single DB, a few workers of our app, and maybe a cache.

I'm speaking out of little experience though. I just think that a lot of us can get away with traditional vertical scaling and not think too much about it.


In big enough organizations, it is very easy to lose track of who owns what, especially when it is those little ad-hoc internal tools. Manually managing the infrastructure for them is a recipe for them to become permanently enshrined in the wasteland of "services we think we use, but do not maintain because we don't remember who needed it or put it up or how to configure it".

K8s isn't the only answer, but if you are already using it for your large applications, it isn't much work to reuse the existing tooling and infrastructure, and now you st least have the dockerfile as a reference if nothing else.

OTOH, if you have an existing tooling setup / pipeline that is not K8s, there isn't a good reason to use it for a small application.


docker and k8s adoption can force a company that over years has developed and perfected a standard practice of deploying application, with a half-ass'd solution that ends up solving the wrong problems and costing way more. The "shipping beats perfecting" mantra is very much at play here. This is due to the amount of time it would take to achieve parity. At the end of the day, the new solution ends up looking like a step-back.

Such a practice combined with the mentality that software engineers should do their own dev-ops can easily lead to an environment of spaghetti applications where every developer working on the new platform does things slightly different because the replacement solution wasn't complete and had to be addressed by countless band-aids by engineers across the band of talent.

Furthermore, for the features that were able to achieve parity, you're now forcing your entire organization to re-learn the new development process. The docs our abysmal and the software engineers that developed the original solution have moved on since the hard work of "productionalization" remains and they're not interested in that.


> docker and k8s adoption can force a company that over years has developed and perfected a standard practice of deploying application, with a half-ass'd solution that ends up solving the wrong problems and costing way more.

docker and k8s adoption can also force a company that over years has developed and perfected a half-ass'd solution to deploying applications, with a single(-ish) source of truth that ends up solving the right organizational problems instead of the wrong technical ones (and ends up costing way more, at least in the short term).


Having a Dockerfile that copies a few binary blobs into an age-old distro image isn't an improvement, it's a huge liability. And most of that stuff that no one knows anything about anymore is like that. Same as with an old VM or PM.

I'd rather have that old crap as a physical machine. Why? Because the hardware lifetime "naturally" limits the lifetime of such applications. If the hardware dies, it forces a decision to spend some money to keep it running or throw it away, which, given that hardware is expensive, usually results in throwing it away.


Set up your docker file to be part of your CI so that your binary blobs are built from source with regularity? That’s typically the solution I’ve seen work well. Manually maintained stuff (especially for stuff that may not be the thing everyone is primarily doing) generally doesn’t scale well without automation (speaking as someone who’s seen organizations grow). This is also true of “getting started” guides. Can’t tell you how much maintenance and run time I’ve saved converting those wikis to Python scripts.


Yes, of course. That would be ideal. That's what we do for everything we can control.

But as someone in the IT dept., far too often you get some container that either was built by someone who long left the company or an external consultant who got paid to never return. Sourcecode is usually unavailable, and if it is available, will only build on that one laptop that the consultant used. The IT department gets left with the instruction "run it, it was expensive" and "no, you don't get any budget to fix it". That results in the aforementioned containers of doom...

Yes, I'm bitter and cynical. Yes, I'm leaving as soon as I can :)


I hate to raise a seemingly obvious point, but this doesn't seem like a problem with Docker.


In theory that works. In practice it rarely does. Docker et al gained popularity because they made it way more practical for projects to be managed as the world works rather than as it should be, for good or ill. Before Docker it was moving old applications to VMs and before that it was running them in chroot horror shows.


> the hardware lifetime "naturally" limits the lifetime of such applications

Oh, my sweet summer child. Of all the things that "naturally" limits the lifetime of such applications, that is not it. Consider the case of the mainframes running the US banking system, for example.


Can still remember the multinational were wanted to host the webapplciation we developed from them at their data centre. They wanted to charge us (not the business unit for some odd reason) nearly €20.000 per year to host a RoR web application.

Ended up, hosting it ourselves, and in the last year they were paying €100k per year for it. As we would just sell the same setup for each deployment for their customers. They probably been cheaper off to host it themselves.


What you're describing is called resumé driven development. It happens every few years when people want to cash in on trends/buzzwords that people believe will be disruptive to all industries but are just tools to have in the toolbox for most. New tools pop up all the time that fit this mould. Over the past ten years I can think of Hadoop (Big data), MongoDB (NoSQL), Kubernetes, "Serverless" computing, and TensorFlow. While all these tools have legitimate use cases, they are often overused due to marketing or industry hype.

Adding artificial intelligence to your recipe application is unlikely to make any sense, but people do it because they want to have AI software engineer on their resumé.


For artificial intelligence, I think it's more often marketing driven development. It's easier to seem disruptive if you claim to have AI in your product. Easier to get funding and have people talk about your company. I feel like it comes more often from business executives than technical people.


You reminded me of the Phillips toothbrush with AI. Marketing like this makes the term AI worthless.


I think it's easier to run a small k8s cluster than it is to attempt to recreate a lot of the functionality provided manually, especially if you're running in a cloud where the control plane is handled for you.

It provides unified secrets management, automatic service discovery and traffic routing, controllable deployments, resource quotas, incredibly easy monitoring (with something like a prometheus operator).

Being able to have your entire prod environment defined in declarative yml is just so much better than running something like a mishmash of ansible playbooks.

If your application runs on a single host and you don't really care about SLAs or zero downtime deploys, sure, use some adhoc deploy scripts or docker compose. Any more than that, and I think k8s pays for itself.


Agreed. What is it that people are doing (or not doing) where a simple managed k8s cluster is more work than the minimum way to do this?

Are teams not even setting up automated builds and just sshing into a box?


For me that’s Heroku. I just push my app (RoR) and done. I’ve actually moved it once to k8s for a few months and decided to move it back after I understood how easy heroku made the opa side of the business.

Note: it’s a side project, 50 business users, $5k annual revenue, 4h per month as target for the time spent on customer support, admin and maintenance. So it’s both easy to pay heroku and important for me to not spend too much time on Ops.


k8s is not simple to learn for legacy admin teams, even those who have some container experience. It is simple to kubespray a semi-working cluster on-prem or cloud provider k8s your way along but if you need to actually learn how to deal with multiple ingress controllers, a service mesh, need multiple storage providers, affinity, gpu accelerated apps, secure k8s and the other problems solved in the legacy world (in myriad ways) k8s can be regarded as a disruptive interruption in your ability to operate. All this being said I lead both IOT edge(via k3s) and science (cloud & air gapped 'vanilla' k8s) platform teams and appreciate the chance to sell k8s and make easy money.


I see people talking about using Kubernetes for internal applications and I just can't figure out why.

There is benefit in having established platforms for running your code, and this is especially true for large orgs where the people who run the systems are an entirely different group from those that developed or assembled it. And people (+ their skills) are what cost the most money in any business.

It's true that many/most systems don't require a full Kubernetes stack (for instance), but if a critical mass of the business IT is going that way, doing the same with your own makes sense from an economies of scale PoV.


> There is benefit in having established platforms for running your code

You do know k8s is very new, there's a constant stream of changes and updates to it, etc etc? it's not established. It's known, but that's it.


1.0 was released in 2015. There are stable LTS vendors for it.

It's pretty established. And much saner than cobbling together Ansible/Puppet/Chef playbooks for everything.


Saying that 2021's Kubernetes is established because 1.0 was released in 2015 is like saying that 1991's Linux is stable because Unix had existed for 20 years at that point. Kubernetes 1.0 and 1.20 share the same name, design principles and a certain amount of API compatibility, but it's impossible to take a nontrivial application running on 1.20 and just `kubectl apply` it on 1.0. Too much has changed.

Kubernetes is just now entering the realm of "becoming stable". Maybe in five years or so it'll finally be boring (in the best sense of the word) like Postgres.


Of course 1.20 has a myriad of additional featues. But 1.0 concepts are in 1.20, the fundamentals are stable. Schedule and run containers, expose them to the external network via a load balancer (or node port).

The declarative aspect is stable. Yes, many people are writing insane go programs to emit templated ksonnet or whatever that itself has a lot of bash embedded, but that's the equivalent of putting too much bash into the aforementioned configuration/orchestration playbooks.


Playbooks are terrible. They are a replacement for expert knowledge of platform tooling. There is no replacement for expertise and knowledge of the platform.

Serious problems are always reduced to understanding the platform, not the playbook. Ansible and the python ecosystem are especially broken. I will _never_ use another playbook to replace mature ssh driven deployments.


Yep, agreed. I found that the active control loops (coupled with the forgiving "just crashloop until our dependencies are up" approach) that k8s provides/promotes are the only sane way to ensure complex deployments. (The concepts could be used to create a new config management platform, but it would be really hard, as most of the building blocks are not idempotent, and making them such usually requires wrapping them and combining them with very I/O and CPU heavy reset/undo operations, blowing caches and basically starting from scratch.)


Hmm. The expectations and engineering of a platform like kubernetes and 'fail till right' require a lot of alignment work to get to the point where 'fail till right' works.

There is no silver bullet. Automation and self healing is a selling point but when it hits engineering it usually is a dud in terms of incorporation in existing environments.

The real novelty would be to generate a declarative description from the customer and provide an in place deployment solution via k8s. That would be the ultimate replacement solution.


I think what they may mean is more that there is an established platform within the organization. One that you have expertise and experience with, monitoring/backup/security tools that work with it, etc. K8s might not have as long a pedigree as VMs, but if you already have a setup to run K8s, people who know how to use it, documentation and tooling that allow devs to run their apps on it securely and efficiently, etc. it's pretty reasonable to want to encourage devs to "just" use k8s it if they want to stand up a new service rather than spinning up whatever random collection of technologies that dev happens to know better.


Well, firstly, the core components of k8s are pretty well understood now, and the APIs to them are not just going to change overnight.

Secondly, it doesn't really matter how old the platform is. As I said, it comes down to how familiar the operations team are with it. If you are running an on-premise or hybrid cloud, ops will want something familiar over all environments. They're not going to be happy with k8s on one, ansible on another, etc.


> but in the case of making a real application that people will use, it seems like a lot of us can get away with a single DB, a few workers of our app, and maybe a cache.

A couple of points:

1. Kubernetes can run monoliths. It's certainly not exclusive to microservices or SOA. It's just a compute scheduler, quite similar to AWS's EC2 reservations and auto-scaling groups (ASG's).

2. I can't speak for every corporation, but if you already have patterns for one platform (note: "platform" in this context means compute scheduling. eg: AWS, GCP, Kubernetes, Serverless) then you will inevitably try to copy patterns you already implement internally. A lot of times, for better or for worse, it's not what fits best unless what fits best and what you have available are highly conflicting.

3. A lot of times "scaling" is actually code for multi-tenancy. As an industry, we should probably be explicit when we're scaling for throughput, redundancy, and/or isolation. They are not the same thing and at times at odds with each other.

4. I don't really like your use of "real application" here as it implies some level of architectural hierarchy. My main takeaway after 10+ years of professional development is that architectures are often highly contextual to resource availability, platform access, and personal preferences. Sometimes there's a variable of languages too, because some languages make microservice architecture quite easy while others make it a royal PITA.


I know one of the biggest Ecommerce shop in Asia were using 1 big DB with multiple read only slave in monolithic architecture for more than 5 years.

However not only driven by DB performance, but also on organizing hundreds of engineers they adapted microservice architecture. Then they slowly migrating to per domain specific DB, it is just classic microservice migration story.

While single DB may bring us pretty long way, designing the system into more discipline logical domain level segregation will help when there's need to move to microservice.

*looks like HN reader quite sensitive with microservice migration comment, usually this kind of comment got down voted easily.


Stack Overflow runs what is essentially a monolithic architecture. Though they do have a few services, it isn't what I would describe as a micro-service architecture.

https://stackexchange.com/performance


Monzo (UK bank) has 1600+ microservices, but mandates a common framework/library and uses Cassandra. (Which is basically a shared nothing, bring your own schema "database".)

So it makes sense to combine advantages of different approaches.


using non-ACID db for financial services probably requires lots of trickery.


it is web scale though


Upvoted, but I am not sure tokopedia is even in the top 10 in Asia.

Also, the fascination with GMV tends to make it looks like high scalability is required. In another HN discussion, someone mentioned about running the database for an ecommerce that had 1 billions GMV a few years back. Assuming a conservative $5 per order, that translates to about 6 orders per second on average.


It's resumé-driven development, and it's also entertainment-driven development. Bringing in new technologies gives you a chance to play with a new toy. That's an effective way to make your job more interesting when the thing you're supposed to be working on is boring. Which, in business applications, is more often than not the case.


In today’s job market resume driven development is a very rational choice. I work in medical devices so we are pretty conservative and generally way behind the cutting edge. This makes it really hard to find jobs at non medical companies. I would recommend anybody who has the chance to use the latest and shiniest stuff to do so because it’s good for your career .


Very good point. Seems like yet another example of how carefully optimizing all the individual parts of a system can paradoxically de-optimize the overall system.


I haven’t worked at a FAANG or any other company even close to that level of scale, so you can take me with half a grain of salt too.

But what you said is absolutely true. It’s also something you will very much experience once you start working professionally.

I’m in no position to give you advice, and I think I might be giving advice to myself...just don’t let it get to you.


I have worked at FAANGs before.

I am in firm agreement. I think that far too many people are trying to solve the problems that they wish that they had, rather than making it easy to solve the ones that they do have. Going to something like Kubernetes when there is no particular reason for it is a good example of that trend.

When you really need distributed, there is no substitute. But far more think that they need it than do.


I’m not saying in all companies, but as you grow you have lots of different teams with different needs. So then you spin up a tools team to manage the engineering infrastructure since you can’t do it as-hoc anymore (CI, source control, etc). So to make that team more efficient, you let them force one size fits all solutions. While this may feel constraining for a given problem domain, it actually makes engineers more portable between projects within the company which is valuable. Thus having one DB or cloud thing that’s supported for all teams for all applications is valuable even if sometimes it isn’t necessarily the absolute best fit (and the complexity is similarly reduced as good companies will ensure there’s tooling to make those complex things easy to configure in consistent ways). Your tools team and the project team will work together to smooth out any friction points. Why? Because for larger numbers of engineers collaborating this is an efficient organization that takes advantage of specialization. A generalist may know a bit about everything (useful when starting) but a domain expert will be far more equipped to develop solutions (better results when you have the headcount).


> I see people talking about using Kubernetes for internal applications.

I think the important issue when first starting a project is to create a "12 Factor App" so that if and when you create a Docker image and/or run the application in Kubernetes, you don't have to rewrite the entire application. Most of the tools I write run on the CLI but I am in fact a fan of Kubernetes for services, message processing and certain batch jobs simply because I don't have to manage their life-cycles.


12 factor apps sacrifice performance and simplicity of your environment for scalability. Unless you are guaranteed to start with a worldwide audience its complete overkill. A better solution is to write your application with the rules in mind with the goal of making it easy to transition to a 12 factor style app when its needed. Scale up then scale out will result in the best performance for your users.


The 12 factors are mostly common sense that apply in pretty much any situation - they help with fancy deployments but also with single servers on DO or even on-Prem servers.


The thinking on efficiency vs scalability is largely influenced by the US tech company meta where founders give up equity so they don't have to worry about profitability for a very long time.

In that case, it is preferred to burn piles of cash on AWS instead of potentially needing to sacrifice revenue because you can't scale quickly enough.

An architecture that is not scalable is considered a failure whereas one that is complex and inefficient is much more tolerated (as long as it can scale out) ... at least until the funding dries up or Wall Street activists get involved.


Very many successful applications can indeed run on a single DB server (modulo redundancy in case of failures). Vertical scaling isn't trendy, but it is effective, until it's not.

I have yet to encounter a real situation where it suddenly became impossible to run a production DB on a single, high-spec server, without knowing far enough in advance to plan a careful migration to a horizontally scaled system if and when it was necessary.


Speaking as someone with 20 years in the industry, what you say is correct. Most applications would be find on a single server and a classic LAMP stack. But that ain't cool these days.


Kubernetes is for when you need to allocate CPU like you allocate RAM, and you don't want to be tied to a higher level API sold by a vendor.


Like sched_setaffinity didn't exist and cgroups can't be used outside some container env?


Service and cluster autoscaling. Automatically allocating and provisioning new nodes on compute demand, and releasing them when done.

It can be done without k8s, but without something similar (e.g. Mesos) you're coding to cloud vendor APIs. K8s is like a cloud operating system, it gives you portable APIs for allocating compute and other cloud resources.


qemu/kvm and libvirt as an example api to address your resource idiom + a load balancer/rev proxy API. You are missing my point afaict.


Kubernetes isn't only about scaling. The repeatability of deployment process is a great asset to have as well.


But there are much simpler ways than K8s to achieve automated/repeatable deployments, if that is your goal.


Can you please name a few?


As I'm not a K8s user take this with salt.

Heavily depends on your exact situation. Keeping your source code in git and let it build on a CI server was almost always enough for me. If your build server runs windows this is usually just fine.

C and C++ compilers on Linux have this IMHO very unpleasant property that the operating system wants to manage your development environment, so that the build output is a function not only of the repo, but also of things not under your control. I have little experience with that (I'm mostly a desktop developer, and these desktop applications are for windows), but so far simply not installing any -dev packages seems to have eliminate that problem. Put the library source code in the repo, use a git submodule, use a per repo package manager like nuget or cargo, whatever, just make sure your build input is in the repo and only there. For other programming languages this is generally not a issue. Thankfully no Linux distribution I'm aware of tried to sneak in their version of log4net, so far. Same seems true for every other languages I'm aware of, so this is a non-issue for any other language than C and C++.

Automated tests can run in a chroot to make sure all runtime requirements are contained in your build artefact. On Windows running them in a clean VM might be a good idea, but keeping the machine clean seems enough. Don't give people the password to that machine, so they are not tempted to "fix" problems by installing software there, fiddling with system settings, creating "magic" files, but by chaing the code in the repository.

I used repository singular, but this works for both monorepo and polyrepo.


That allow for rollback/rolling, seamless or canary transitions and encapsulate the entirety of a mature release cycle into a declarative operational rubric? None I am aware of outside of k8s competitors.

However I think this value can be oversold and it is a simplification of processes that were used before k8s to achieve the same end.

Say that I have v1 ready for release and want 20% of my users to hit those endpoints in prod. Simply done in k8s. Also simply done with rev proxying and automated configs in nginx + custom app provisioning _without_ k8s in announced maintenance. None of these ideas lives in isolation.


Micro-service architectures are an absurd overcomplexity for smaller internal apps. They only really make sense when a monolithic system becomes too large for a single team to manage, at which point the micro-service boundaries reflect team boundaries.


The author says this in the first paragraph.


Kubernetes is a container system, mostly orthogonal to 3- or 4-tier application design.


And typically a single DO droplet would suffice for a toy project or POC, for which Ansible is probably the more expedient option. But maybe they're not in a rush, and learning K8s is just another feather in their cap .


Ah, but is it? I'm not sure. One difference between these platforms is that they're very optimized for the "typical app deployment" process, whereas Ansible and such are more generic.

Just getting a simple deployment script in Ansible requires programming a bunch of steps - copying archive, verifying it, unpacking it, then atomically configuring the system to use the new version, finally cleaning it up - whereas with k8s you just let it know you want to Deploy something and it does.

Of course, that's only true if you already know k8s, but then again if you work with both toy and larger projects, chances are you'll have to, sooner or later.


I'm assuming that it's someone who is unfamiliar with k8s or Ansible. You can quickly cobble together something solid from geerlingguy. It might need a day or two to hash up some basic yml - not long. Or... you could just ssh on your server and configure it manually, and cross your fingers - even quicker.

Even if you start with k8s and you want to scale, at some point you'll want to configure your images in a predictable way with something like Ansible. Whether you do that for the images you deploy, or your configure it in situ for a single-node, it's still the same complexity cost.


The assertion that PostgreSQL can handle dozens of TB of data needs to be qualified, as this is definitely not the case in some surprising and unexpected cases that are rarely talked about.

PostgreSQL's statistics collection, which is used by the query planner, doesn't scale with storage size. For some ordinary data distributions at scale, the statistical model won't reflect any kind of reality and therefore can produce pathological query plans. It is quite difficult to get around this scaling behavior. Consequently, I've moved away from using PostgreSQL for data models with distributions such that these query planner limitations will occur. These pathologies occur well before the "dozens of TB" range.

The statistics collector is not a scalable design generally, but that is another matter. In its current state it does not degrade gracefully with scale.


This seems to be unfounded criticism. When statistics are gathered, PostgreSQL samples a certain percentage of the table, so that obviously scales. The number of "most common values" and histogram buckets scales up to 10000, which should be good even for large tables. While I'll readily admit that not all aspects of cross-column dependencies are dealt with, and cross-table distributions are not considered, that has nothing to do with size. I guess you hit a problem somewhere, couldn't solve it and jumped to unwarranted conclusions.


It is ironic that you accuse me of "unwarranted conclusions". I've been customizing and modifying PostgreSQL internals for almost two decades, I know how to read the source. You aren't as familiar with PostgreSQL as you think you are.

This wasn't my problem, I was asked by a well-known company with many large PG installations and enterprise support contracts to look at the issue because no one else could figure it out. The limitations of the statistics collector are not only evident in the source but they are documented there. There are also deep architectural reasons why you can't trivially modify the statistics collector -- I looked into this option -- to work for larger tables without introducing other serious issues.

If you have a uniform distribution of values, the statistics collector will work fine. In the above case, the values followed a power law distribution which created extreme biases in the statistical model due to necessary restrictions on sampling. Other distributions can have similar effects once you exceed the ability of the statistics collector to acquire a representative sample.


> I know how to read the source. You aren't as familiar with PostgreSQL as you think you are.

Oh, maybe you have read some of Laurenz Albe's many contributions to Postgres, then. https://git.postgresql.org/gitweb/?p=postgresql.git&a=search...


murdered by words


Nice deepfake. It's so surprising how far the field has gone. I re never seeing the fake Nixon moon speech and being blown away


Naive question. Wouldn't sampling a power law dataset be straightforward? The idea is there's only a few outlier values, and the rest are uncommon. This distribution seems extremely common. Ie column with mostly NULL values and the rest somewhat unique non null strings.

I'm curious what data you saw and why the sampling didn't work?


I'm curious, will he answered given his vested interest in "exabyte-scale analytical database" ? :-)

And laurenz_albe has a vested interest in PG as he's a contributor.

So I, too, would love more concrete details.


I think the problem is PG does not know underlying data distribution and use one sampling strategy which works for normally distributed data.


Vacuum is a way bigger problem, if you have a few million updates per day on a 5TB table you are going to have a hard time keeping index and page bloat down.

Sure, if your tables are insert only you might be fine but doing any kind of DDL or maintenance (analyze after version upgrades) is going to ruin your day.


Even when using as much stat samples as possible (3000000) for each of those tables: ALTER TABLE table_name ALTER column_name SET STATISTICS 10000; See https://www.postgresql.org/docs/current/sql-altertable.html

AFAIK the PostgreSQL's approach is based upon http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.53....


You can change the statistics target, see https://www.postgresql.org/docs/current/runtime-config-query...

You can also create more advanced statistics over multiple columns: https://www.postgresql.org/docs/current/planner-stats.html

But if your statistics are bad, it will certainly mess up some of your query plans.


Yes, everyone knows these things, it does not address the issue. Due to its architecture, there are difficult to change internal limits on how PostgreSQL samples the data.

Under some conditions, it is not possible for the sampling mechanics to build a representative model -- the actual statistics used for query planning will be quasi-random. One of the ways this manifests is that every time you rebuild the statistics you get a completely different statistical model even if the data has barely changed.


I have seen many PostgreSQL benchmarks having solid performance with TB data but my real world experience is the complete opposite.

Here are some of the main issues that I have encountered so far:

1. Queries on large tables (around 10 GB) are slow even when "index only scan" is used because of MVCC and the way postgreSQL manages concurrency.

2. Hot-standby instances can't be used for anything serious since all queries are dropped regularly (I believe it's not safe to use "hot_standby_feedback" config to overcome this issue).

3. It is not possible to have tables with heavy "update" workflows. (because of simultaneous autovaccum execution)

I would be very happy if anyone could show me that I am wrong.


1. You'll have to define "slow" - I have a 3TB table where an index only scan takes under 1ms

2. hot_standby_feedback is absolutely safe. I've got 5 hot standbys in prod with that flag enabled

3. Again, it depends on how "heavy" your update throughput is. It is definitely tough to find the right balance to configure autovacuum between "so slow that it can't keep up" and "so fast that it eats up all your I/O"


> 2. Hot-standby instances can't be used for anything serious since all queries are dropped regularly (I believe it's not safe to use "hot_standby_feedback" config to overcome this issue).

Hot-standby instances are fine. We've been using these for GitLab.com for a number of years now, and never have they caused any issues. We do have some code in place that retries queries when they are cancelled by the hot-standby, but IIRC this doesn't happen often.


1 mostly applies to update heavy tables since index only scans use the visibility map, which would be frequently invalidated.

3 is definitely true, especially the larger the table. I've had success splitting frequently updated columns out into their own much smaller table, or any other trick to concentrate updates into a small table. Also MVCC bookkeeping requires updating pages, so an UPDATE that doesn't change any field and SELECT FOR UPDATE will cause the same problem.


> PostgreSQL's statistics collection, which is used by the query planner, doesn't scale with storage size.

Out of curiosity, do statistics collectors for other offerings (e.g. MySQL, SQL Server, Oracle) scale with storage size?


Theoretically your argument would make sense, but practically your data distribution should be uniform enough for the statistics collector to work for terabytes and terabytes.

I've seen query plans get messed up with partial indexes, but if you have a regular index, you needn't worry.


This is intriguing; could you give an example of a distribution Postgres struggles with at scale, and an application that produces such a distribution?


We’ve recently been surprised by wrong estimates made by the planner resulting in inefficient query plans.

It seems that power law distributions over multiple tables aren’t handled very well: https://stackoverflow.com/questions/65861780/is-there-a-way-...


Neither of you have given a reproducible example one way or the other! I could share that I know of TB of core analytics data running postgresql and it does alright. But it would be nicest if there were a more tangible study to talk about.


Does statistics collection scale better with partitioned tables? TFA mentions that partitioning helps with vacuum, and AIUI vacuum workers also collect statistics.


Is it easy to disable it? Personally I would not want any unpredictable behavior from query planner anyway, especially at scale.


You would get predictably bad behavior and performance without the statistics collected.


Predictably bad won't result in a sudden spike in IO usage that brings down a production system. This is why Oracle has SQL profiles.


Then use something like pg_hint_plan: https://pghintplan.osdn.jp/pg_hint_plan.html (never used it myself)

I've been bitten by bad optimizer choices which took way too long to figure out how to debug / fix, so I know the pain. I do hope the PG optimizer continues improving. It's one of the weakest areas in comparison to commercial DB's IMO.


This sounds really in depth and useful, are there more in depth articles about how to avoid these types of query.


The difficulty with this advice is that it assumes that you have many small transactions.

Yes, of course, you should try to build your application so queries and transactions are very short. That solves a great many problems.

But sometimes you can't. Sometimes you just have to do joins across large tables. There just isn't any other way. Your query is going to run for 5, 10, maybe 30 seconds. That's a huge burden on your server, and it will slow down the other tenants. In that case, the only answer is to distribute the queries across other boxes.

I agree with the advice in general -- delay moving to a distributed system as long as possible -- but sometimes you just can't.


Point #1, there is a world of difference between a reporting database and a transactional database. If you need a reporting database, ship logs and set up a reporting database separate from your transactional one. That solves most of the problem.

Point #2, the fact that you've hit performance problems does not mean that you need to distribute. Every real system that I've seen has had order of magnitude performance improvements left when it first looked like it was topping out. Add a caching layer, have better indexes, more targeted queries, appropriate use of window functions, etc. Give those a try before biting the bullet on distributed data.

Point #3, here is a good rule of thumb for distributed systems. Whatever scale you hit on a single machine, you can probably gain 1-2 orders of magnitude of performance by switching to a faster language and carefully optimizing. If you switch to distributed, you'll LOSE at least an order of magnitude performance due to the overhead of RPCs, but are able to scale indefinitely afterwards.

If you're distributing for reliability, great. But if you're distributing for performance and you have less than 20 machines in your system, either your problem is embarrassingly parallel or you likely aren't achieving a net win.

I've seen a lot of people prematurely distribute, run into performance challenges, solve them, then pat themselves on the back for being smart enough to have distributed their code. While failing to recognize that they were addressing a self-inflicted injury.


#2, so much #2. And also #3.

One of my earliest and still one of my favorite career achievements was optimizing a stored procedure in Oracle. Our application called the procedure and depending on the result dropped a character from the input and called the procedure again.

This was the highest load on our database. We were negotiating a new license to scale up the Oracle database. Estimates were in the six figure range.

I simply reproduced that modification/retry logic inside the procedure. The application always got the desired result on the first call over the network. No application side code change needed at all.

IIRC my optimization reduced load by ~an order of magnitude. We no longer needed the additional Oracle capacity. Saved a multiple of my salary in an afternoon, just looking for a hot spot. I think my boss bought me a sandwich. It was a good day.


What if you ingest 10s of millions of rows a day, and need to conditional updates based on those 10s of millions of rows?

Either you're going to do 10s of millions of fetches and a whole lot of code, or you can push some of the work to the database and perform updates that involve joins. Those queries may take minutes to complete; but the SQL will be pretty short.

Big queries that fetch millions of rows isn't solely the preserve of reporting. Some applications - business SaaS in particular - has a different user:row-count distribution to consumer businesses; instead of a handful of rows for every consumer, with each consumer doing something simple, a business SaaS may have tens of users who each individually manipulate millions of rows every day.


What if you ingest 10s of millions of rows a day, and need to conditional updates based on those 10s of millions of rows?

Know your problem, then set things up appropriately.

I was addressing someone whose problem looked like a reporting query about a consumer application that was affecting the responsiveness of their consumer application. For that case separate the transactional and reporting database and tune each appropriately. That means, for example, different numbers of connections, different memory per connection, different amounts of temporary table space, and so on.

The problem that you're describing is much like the one I'm currently facing, ingesting time series from a busy factory floor, then doing various kinds of analytics. For that case, you can simply use one database, optimize reasonably, tune appropriately for the workload, and set appropriate expectations on responsiveness.

In all these cases you don't need a distributed system. Not unless your requirements are a lot harsher than what has been described so far.


My relational experience is largely in the analytics world. Can you give me a (simplified/obfuscated) example of this type of workload? Why do the rows need to be updated together or so frequently? My imagination is failing me but this sounds like an interesting problem.


Most recently I was chief architect at London-based fintech SaaS company that does reconciliations. Recs, as they're known, are in practice a big diff, or join (in the relational sense), between two sets of tabular data.

The clients are financial organizations, and the records are things like trades, positions, transactions of various kinds. Larger organizations perform millions of these a day, and for regulatory and risk reasons need to have controls in place such that they are accounted for in different systems and with different counterparties (trade repositories, brokers and the like).

So the system looks like this: it takes in lots of records for one side, lots of records for the other side, and compares the two. This produces 4 classes of output: fully matched data; partially matched data (some fields in the "join", like ISIN, are matches, but others are different - e.g. the price or quantity or fee is incorrect); and unmatched data on the left and right sides.

All this data needs to be written into a database and fetchable and queryable and sortable on demand by end users, by any field in the data, or by metadata associated with the records. This in itself is a challenge, since you can't index everything and you especially can't index across joins.

But there's more. Unmatched data may get matched by subsequently submitted data, and partially matched data may be updated by refreshed submitted data. Workflow metadata needs to be associated with outstanding items (unmatched, partially matched), and this workflow information has further details, like labels, comments, assigned users and groups, which in turn may be updated manually or automatically, with configurable rule-based automatic workflow classification and assignment.

Millions of records a day added, across lots of customers, adding up to billions of rows a day (multiple rows per record, naturally). Maybe 5% of inserted records will get meaningfully updated later, but 5% of a big number is still pretty big.

Ideally recs would have 100% match rate and not need any workflow metadata because there would be no problems to assign to anyone, nor label or comment on. But match quality (often a function of customer maturity and sophistication) is variable, and more importantly, input batch skew is very common. That is, maybe an org gets a batch every evening from its counterparty, but submits an up-to-the-minute snapshot of its own data. If the skew is bad, every rec might have 30% unmatched data which will mostly be fully matched after the next batch comes in the next day. This overhang (called 'carry over' internally) needs to be managed; if it gets too stale (timestamps, expiry), if it doesn't in fact get matched off, if it gets refreshed by new data, etc.

Oh, and you can't stick everything denormalized into a document database and build nice fat indexes, because the bigger customers are hyper-cautious and don't want data intermingling, and running separate redundant services for every customer is too expensive.

It is an interesting problem - or at least it was, but it's not my problem any more!


Thanks for taking the time to explain that. Food for thought.


MySQL was the chosen store, chosen primarily for its (async) replication story which was better than Postgres back in 2012. Though I'm not sure it wouldn't still be a better choice today with the update load, I've had problems in personal toy (~20 million row) projects with slow bulk update speed in Postgres.

The last thing I did before I left was build out a mechanism for mirroring a denormalized copy of recent data in ClickHouse, partitioned by time (for matched) and type (for partials / unmatched). CH, being analytics oriented, works much better for the ad-hoc end user queries, which filter and sort on a handful of arbitrary columns - ideal for a columnar store. Interactive updates can be handled by CH's versioned merge tree table engine, and batch updates by rewriting the relevant partition from the source of truth in MySQL.

I chose CH primarily because it scaled down well for a columnar store - much better than anything in the Hadoop space. The conservatism of the market meant you couldn't just throw random cloud provider data mechanisms at it, nor a whole lot of fancy big stuff.


My thoughts exactly on point #1. Nothing in a hot path should take multiple seconds.


The hot path may be a SET inside an UPDATE which uses a join which needs to touch millions of rows.

You can break that query apart and run it in application logic: do some fetches, do application-side joins, do lots of little updates. Or you can write a single piece of SQL. The former is a whole lot more code and will run slower but individually each item will be fast. The latter is a lot less code and runs faster overall, but the single individual SQL statement will be slow.

No simple rules. It depends on the application.

(Yes, there are middle ways. Break up the giant UPDATE using some kind of batching strategy. Long-running update-heavy transactions aren't healthy, particularly for Postgres.)


Every system has ways that it can fall down hard.

Here is a fun one for Postgres. Modify your query to be using a stored procedure that creates/drops temporary tables. Watch your database fall over from needing to VACUUM system tables.

(This was not a hypothetical disaster. It was the result of trying to use a third-party ETL tool that had been designed for Oracle and didn't understand how temporary tables differ on Postgres.)


VACUUM system tables? Is that a thing? At $dayjob I use Redshift quite a bit and as far as I know there's never a VACUUM operation on any system tables but maybe I'm not looking closely. Are system tables even "real" tables?


Yes, system tables are just tables. And every time to create/drop a temporary table you've inserted/deleted a bunch of data from https://www.postgresql.org/docs/12/catalog-pg-attribute.html. If you do so faster than it can vacuum, you're in for a world of hurt.


Huh. For a long time there was no autovacuum in Redshift and AFAIK this was never a problem for us. I wrote our VACUUM script and it would never pick up system tables. Maybe Redshift does something differently.


Yes, that is a thing in Postgres. They are real tables and use the same underlying storage management as the rest of the database.


AFAIK most of the Postgres bits of Redshift have been rewritten except for the SQL parser and client protocol.


I've seen queries running for 1 minute 2 minutes raising user complaints. Then we looked at it, and with a few changes in indexes and query hints brought it down to sub-second execution.

Before thinking about distributed systems, there is an entire database optimization toolkit to make use of: primary key review, secondary index creation, profiling, view or stored procedure creation, temporary tables, memory tables and so on.


Indexes are not free, they take up space and they make mutations more costly. Also, building the index may not even be possible while your application is running, because postgresql and other RDBMS have inadequate facilities for throttling index construction such that it doesn't harm the online workload. You might have to build indexes at midnight on Sundays, or even take your whole system offline. It can be a nightmare.

This isn't just a problem for SQL databases. Terrible-but-popular NoSQL systems like MongoDB also rely heavily on indexes while providing zero or few safety features that will prevent the index build from wrecking the online workload.

I personally prefer databases that simply do not have indexes, like bigtable, because they require more forethought from data and application architects, leading to fundamentally better systems.


I've seen https://www.postgresql.org/docs/12/sql-createindex.html#SQL-... work well in practice on busy transactional databases. I'd be interested in knowing about cases where it doesn't work well.

My experience on systems without indexes differs strongly from yours. Yes, they can work well. But if you have multiple use cases for how your data is being queried, they push you into keeping multiple copies of your data. And then it is very, very easy to lose consistency. And yes, I know about "eventual consistency" and all that - I've found that in practice it is a nightmare of special cases that winds up nowhere good.


Just from personal experience, if they can build gmail on top of a database (bigtable) having neither indexes nor consistency, then probably it will also be suitable for the purposes of my far smaller, much less demanding products.

On the other hand I've seen, and am currently suffering through, products that have desperate performance problems with trivial amounts (tens of GB) of data in relational databases with indexes aplenty.


That's not exactly true, bigtable still has some indexes. You're just restricted to one per table and only on the row key. Instead you spend time finding the right row keys for each value that you're going to want to look up. Picking bad keys will still lead you to have terrible performance, since you'll be looking up the wrong thing to get what you actually want.


This. Using any database requires building up an expertise and understanding how to use its capabilities properly. If you hit a wall with a non-distributed database and your solution is to replace it with a distributed one - you will have a bad time. The surface area of what you need to know to use it properly still includes your basic data modeling, indices, troubleshooting that was required before, with a whole lot of networking, consensus protocols and consistency models to worry about (just to name a few).


> Then we looked at it, and with a few changes in indexes and query hints brought it down to sub-second execution.

This is exactly what the parent comment said: "you should try to build your application so queries and transactions are very short".

If you're claiming that the parent is incorrect about "sometimes, the only answer is to distribute the queries across other boxes", my guess is that probably don't work at a scale where you've learned that query optimization can't solve every database performance problem.


Based on my past experiences, I'd be happy to take an even money bet that more than 95% of organizations that go distributed for performance reasons actually caused themselves more problems than they solved.

This does NOT mean that there are no use cases for distributed - I've seen Google's internals and it would be impossible to do that any other way. But it does mean that when someone is telling you with a straight face that they needed horizontal scalability for performance, you should assume that they were probably wrong. (Though probably saying that is not the wisest thing - particularly if the person you're talking to is the architect whose beautiful diagrams you'd be criticizing.)

So yes, there are problems that require a distributed architecture for performance problems. That doesn't contradict the point that every other option should be explored first.


> Based on my past experiences, I'd be happy to take an even money bet that more than 95% of organizations that go distributed for performance reasons actually caused themselves more problems than they solved.

Oh, I'm absolutely positive you're right! And as you know, even in distributed architectures there's a huge range of solutions from "let's do these n reasonably simple things" to "let's just rebuild our system to solve every conceivable future problem". I don't know of a scenario where the latter has ever worked.


I'm referring to parent's point "Sometimes you just have to do joins across large tables.". Indexes and query hints can make a world of difference once you profile how the query is being executed.

Another option is to create a stored procedure which first extracts a smaller amount of data from first table to a temporary table and then join it to the other large table.

Measuring and testing different optimization methods takes a lot of effort and time. We should only do that when faced with a real problem of slowness.

"The first rule of optimization is: Don't do it. The second rule of optimization (for experts only) is: Don't do it yet. Measure twice, optimize once."


> Sometimes you just have to do joins across large tables. There just isn't any other way. Your query is going to run for 5, 10, maybe 30 seconds.

While that is true, I would speculate (based on my own experience, not any actual research) that it is far more common that users have an unoptimized query or database schema (perhaps it just needs an index to be created) which is taking tens of seconds to run but doesn't NEED to, than it is that users have a need for a complex query that cannot run faster than that.

So for MOST users, the best advice is to learn how to use analyze query and other database tools to optimize execution. Only if you ALREADY know that should you be considering moving to a read replica or some sort of distributed database.


You could always start with postgresql and use the transaction log to materialize views for particular use cases (ie differential data flow). This post is clickbaity in that it doesn’t give any administration advice but right in the broad sense that postgres/mysql are still the best low-latency single sources of truth available


Postgres has materialized views for the case of expensive queries. Instead of doing a view query in real-time it physically caches the results and has the same accessibility as a regular view.

They don't solve all situations with expensive queries but they help a lot. The fact they behave like dynamic views means you can migrate a dynamic view to materialized if you run into performance issues without making changes to the client application. With views in general you can shuffle around tables and relationships without the client application knowing or caring.


Pretty solid advice, nice article.

One thing:

> For example, on my (pretty average) workstation, I can do ca. 25k simple read transactions per 1 CPU core on an “in memory” pgbench dataset…with the default config for Postgres v13!

Forget about reaching those numbers on managed DB-as-a-service instances, specifically Azure managed postgres. In my experience these have comparatively poor peak performance, with high variability in latency to boot.

Bare metal all the way, if you can spare the allocation of a dedicated DBA.

Also... if rapid reads and steady performance is what you're after, provision a replica for the occasional analytical workloads.


Good info; I came here looking for experience with managed databases. Does anyone have experience with managed instances on other platforms?


Avoiding sharding and complex replication is very smart to postpone as late as possible with any database (mysql, postgres, mongodb). It can be very fragile or fail in unexpected or unusual ways and most importantly it can take much longer to fix. E.g. 18 hours instead of 2 hours of downtime.


Once your data grows very large, a successfully-implemented sharding solution actually improves availability, rather than reducing it.

With a huge monolithic database, a failure causes downtime for your entire product/company. Replica cloning and backups are slow. Major version upgrades are stressful because it's all-or-nothing.

With a sharded environment, a single shard failure only impacts a portion of your userbase, and smaller databases are faster to perform operational actions on.

There are definitely major downsides to sharding, but they tend to be more on the application side in my experience.


Shards increase the number of failure modes and increase the complexity of those failure modes. For most businesses, the recommendation holds true... keep it simple, don't shard until you need.

I find it somewhat concerning that MongoDB has a better architecture for upgrades than Postgres. You add a replica to the cluster running the new major version and then switch that replica over as your primary once you've replaced enough instances in the cluster. Having worked in Oracle and MySQL for years then having switched to a company with a MongoDB framework I forgot how stressful upgrades would be with such archaic limitations.


> Shards increase the number of failure modes and increase the complexity of those failure modes.

I would only agree with this during the initial implementation of sharding. Once deployed and stable, I have not found this to be the case, at all.

I say this as someone who has directly architected a sharded database layer that scaled to over a trillion rows, and later worked on core automation and operations for sharded databases that scaled to an incalculable number of rows (easily over 1 quadrillion).

In both cases, each company's non-sharded databases were FAR more operationally problematic than the sharded ones. The sharded database tiers behave in common ways with relatively uniform workloads, and the non-sharded databases were each special snowflakes using different obscure features of the database.

> keep it simple, don't shard until you need

I would have wholeheartedly agreed with this until a few years ago. Cloud storage now permits many companies to run monster 10+ TB monolithic relational databases. Technically these companies no longer "need" to shard, possibly ever. But at these data sizes, many operations become extremely painful, problematic, and slow.


> In both cases, each company's non-sharded databases were FAR more operationally problematic than the sharded ones. The sharded database tiers behave in common ways with relatively uniform workloads, and the non-sharded databases were each special snowflakes using different obscure features of the database.

That's because sharded tables restrict what features you can use (e.g., no JOINs). If you constrained the features on the non-sharded databases, you'd achieve the same net result.

Sharding _necessarily_ only solves one problem: queries operate against only a subset of data. While what you're saying is true (sharding avoids certain problems) it also restricts your ability to perform other operations (more complicated queries or reports are ~impossible). It is not without its tradeoffs.


> no JOINs

Sharded environments still have some JOINs. Typically all data for a single user/customer/whatever should be placed on a single shard, and it's still very useful to join across tables for a single user.

> If you constrained the features on the non-sharded databases, you'd achieve the same net result.

No, that's not the main reason for operational benefits. Rather, it's simply because the shards all have a uniform schema, uniform query workload, and relatively small data size (as compared to a large monolithic DB). You can perform operational maintenance at better granularity -- i.e. on a single shard rather than the entire logical data set. And if a complex operation is fine on one shard, it's very likely fine on the rest, due to the uniformity.

For example, performing a DBMS major version upgrade on a large monolithic DB is a nightmare. It's an all-or-nothing affair. If you encounter some unforeseen problem with the new DB version only in prod, you can expect some significant application-wide downtime. Meanwhile for a sharded environment it's both easier and safer from an operational perspective, assuming your team is comfortable automating the process once it has been proven safe. You can upgrade one shard and confirm no problems after a week in prod before proceeding to the rest of the fleet. If unforeseen problems do occur, worst-case it only impacts a tiny portion of users.

> it also restricts your ability to perform other operations (more complicated queries or reports are ~impossible). It is not without its tradeoffs.

Yes, this is why I said above "There are definitely major downsides to sharding, but they tend to be more on the application side in my experience." OP claimed the downsides were operational (e.g. more complex failures or larger downtime), which I do not agree with.


One of those companies is facebook right? I think this blog post provides a much more balanced view: http://yoshinorimatsunobu.blogspot.com/2017/11/towards-bigge...


Yes. Yoshinori is among the best of the best, and I could not hope to capture that level of nuance in a quick HN comment :)

But in terms of being a more balanced view, my read of his post aligns pretty closely with my main point in this subthread: the disadvantages of sharding fall more on the application side (limitations on joins, secondary indexes, transactions) than on the operational side (availability, performance and resource management, logical backup, db cloning, replication).


Wouldn’t sharding generally result in overlap between the key ranges so that a database shard going down doesn’t have to result in any downtime?

Then your issue is replication of writes I suppose. Probably depends on use case what configuration you choose.


With a typical sharded relational database setup, each sharding key value maps to exactly one shard.

There should be replicas of that shard, which can be promoted in case of a master failure. But in rare cases all replicas may also be degraded or inconsistent and therefore non-promotable. When this happens to a giant monolithic non-sharded database, the impact is far more catastrophic than when it happens to a single smaller shard.

In any case, replication is a separate concern from sharding. Each sharded database replica set (traditionally 1 master + N replicas) has a data set that is unique / independent of other sharded database replica sets.

That said, some of the "NewSQL" distributed databases may arrange their data in more complex ways. [Edit to clarify: I mean in respect to the combination of both sharding and replication. With a traditional sharded relational database, you have replica sets where each replica contains the exact same shard or shards; with a NewSQL distributed DB, data may be replicated in a more complex arrangement]


Most, if not all, commercial newsql distributed databases do range based splitting of the data, with each range managed by a raft group. Raft groups get migrated between nodes to avoid hot spots, among other scheduling criteria. TiDB does this for sure, I'd be surprised if CockroachDB (and yugabyte and dgraph) doesn't do it.


Totally depends on how you shard. In my case (b2b), i'd be sharding by tenant. Having a single tenant go down would not have the same impact as every single tenant going down.


Honestly I never found the case when this happens, data always falls into 1 shard according to the key. Then comes the concept of shard replica where the shard can live in several nodes and form a redundancy.

However I'm noy sure how usually it's being setup on Postgres


One point I found very interesting was the following paragraph:

> For example, on my (pretty average) workstation, I can do ca. 25k simple read transactions per 1 CPU core on an “in memory” pgbench dataset…with the default config for Postgres v13!

In my own very unscientific experiments I never got values as high as that, but in the area of around 4k transactions per second total on multiple cores. Of course I'm comparing very different things, I was looking at more than just the database and there are lots of other aspects I probably did in a different way.

I find this interesting as it probably means that my bottleneck wasn't entirely where I thought it was. I have to play around with that some time again.

If I wanted to find out the baseline Postgres performance on my hardware for trivial read queries like looking up individual rows by primary key with a database that fits entirely in RAM, how would I do that? I know there is pgbench, but that performs more complex queries as far as I understand.


As this is a public reference: GitLab's Postgres cluster handles peaks of 300K tps, where the master node alone supports around 60K-80K. And this is not in-memory (datasize in the order of 8TB, RAM 600GB).

https://about.gitlab.com/blog/2020/09/11/gitlab-pg-upgrade/

And there's still room for vertical scaling.

Disclaimer: we provide Postgres support for GitLab.


HN is missing an user follow and user tagging button :-)


I wish HN published the user comments/subscriptions pages as RSS.

For example, I "follow" a few users on StackOverflow by adding the feed that SO makes available linked from their profile. Super useful.


If that's for me, I really appreciate that :)

There's obviously that option on Twitter (same username as here), if you want to follow there ;)


Did it already :-) Good idea putting twitter link on your profile


You can give pgbench files to have it run the query you want (see -f flag)

https://www.postgresql.org/docs/current/pgbench.html


It's not surprising to hear such advice from PostgreSQL consultancy shop: don't bother and relax until the day you desperately need PostgreSQL consultancy! /s

Seriously though, the main point stands. PostgreSQL isn't designed with "distributed" in mind, so try vertical scaling tricks first.


The skepticism is founded, but there's some truth.

PostgreSQL (and traditional RDBMSes) aren't built with horizontal scaling in mind. That's both a blessing and a curse. The curse is obvious. The blessing is that it means that they will let you get much further with vertical scaling. Especially if you take the time to learn how to tune them.

Support for horizontal scaling comes at a cost. When I've done performance analyses, I regularly find that parallel and distributed implementations spend most their wall clock time on synchronization and communication. Occasionally more than 90%. That creates a big up-front cost you need to overcome before you see a net benefit. It also leads to a sort of horizontal scaling version of the rocket equation where returns can diminish rapidly.


When it comes to replication & automatic failover in PG what is the 2021 gold standard setup?


Use built-in physical replication, it works and it's good.

Patroni is widely used for automatic failover, at least where you don't want a possibility of a split brain.


pg_auto_failover is good https://www.citusdata.com/blog/2019/05/30/introducing-pg-aut...

Disclosure: I work for Citus


In 2016 I was expecting some sane defaults like MySQL would have arrived by 2018 or 2019.

Looks like 2021 isn't that much different to 2016. There are work being done, but doesn't seems to be anywhere close to the level of MySQL.


wow, are you kidding? MySQL replication is possibly the worst I've ever seen.

There is almost no consideration for the target being up to date, I have personally experienced missed inserts, replication lags, and the replica being set to read/write.

PGs integrated replication is far superior, it even has support for bootstrapping a new replica node without rsync and a command to `promote` the replica to write master.


Replication issues in modern MySQL are caused by user error, not inherent bugs. For example if you're configuring replication manually, you must ensure that replicas are set to read_only. Clustering options introduced over the past few years do this automatically though.

> PGs integrated replication is far superior, it even has support for bootstrapping a new replica node without rsync and a command to `promote` the replica to write master

MySQL has this functionality too, e.g. https://dev.mysql.com/doc/refman/8.0/en/clone-plugin.html and https://dev.mysql.com/doc/mysql-shell/8.0/en/working-with-re...

To be clear, I am not claiming MySQL is superior to PG. However, I have found that many PG users are unfamiliar with the huge improvements introduced in MySQL over the past ~8 years.


What about Percona? I haven't used it, but a common sentiment on HN has been that Postgres doesn't have something quite on par with it.

Citus and, more recently, Patroni, seem to be the dominant Postgres analogues; have they caught up? Where do they dominate?


For anyone interested in general advice regarding scaling traditional SQL databases, check out my article [0] where I collected various techniques when I investigated this topic in my last workplace.

[0] https://stribny.name/blog/2020/07/scaling-relational-sql-dat...


Over- engineering has become a disease, and has reached pandemic -level proportions, primarily driven by resume-padders and clueless management folk.


I have experience with other relational DB products, but the principle should be similar; I have a few databases over 1 TB, but scaling to tens of TB would require more RAM than what a typical 1S or 2S can support. CPU's are not that problematic with the huge number of cores in AMD Epyc, but RAM is a serious limitation, in my world I need between 15% and 60% RAM to database size ration, depending if it is transactional, reporting or somewhere in between. Taking a 50 TB database as an example, it is too much for a 4TB RAM 2 socket Epyc system.


I'm building an app using Postgres for the first time. Naturally I was a bit worried about performance and scaling if the not launched yet app becomes a major success.

I began simulating a heavy use scenario. 100k users creating 10 records daily for three years straight.

100000 x 10 x 365 x 3 ~= 1 billion rows or about 200 GB with a record's length of 200 bytes. This is peanuts for modern databases and hardware.

Seems like a single node can support it for a long way before I have to worry about performance...


You classifying 11 writes per second as "heavy use" reminds me of how people on average completely underestimate how fast computers actually are (when they're not bogged down by crappy programs).


I don't believe the grandparent's simulation actually took three years, it was likely operations with a particular data size that was tested.

Still, your main point stands. Around 2001 I wrote a C-program to record every file and size on a large hard disk. We were all amazed that it finished (seemingly) before the enter key had come back up. Must be a bug somewhere, right? Nope.

Much earlier I wrote a Pascal program on a 486 in school that did some calculations over and over again, writing the output to the screen. It blew my mind then how fast the computer could do it.


>> A single PostgreSQL instance can easily do hundreds of thousands of transactions per second

For example, on my (pretty average) workstation, I can do ca. 25k simple read transactions per 1 CPU core on an “in memory” pgbench dataset…with the default config for Postgres v13! With some tuning (by the way, tuning reads is much harder in Postgres than tuning writes!) I was able to increase it to ~32k TPS per core, meaning: a top-notch, dedicated hardware server can do about 1 million short reads! With reads, you can also usually employ replicas – so multiply that by 10 if needed! You then need to somehow solve the query routing problem, but there are tools for that. In some cases, the new standard LibPQ connection string syntax (target_session_attrs) can be used – with some shuffling. By the way, Postgres doesn’t limit the number of replicas, though I personally have never witnessed more than 10 replicas. With some cascading, I’m sure you could run dozens without bigger issues.

This sort of hand-wavy "benchmark" is not really good for anybody other then the author's satisfaction. Real world scenarios are not like that.


>"try all the common vertical scale-up approaches and tricks. Try to avoid using derivative Postgres products, or employing distributed approaches, or home-brewed sharding at all costs – until you have, say, less than 1 year of breathing room available."

Very healthy approach. I've always followed the idea of vertical scalability when writing my modern C++ app servers with local Postgres. Since I do not sell those to FAANG I've never failed finding decent very reasonably priced piece of hardware be it dedicated hosting or on prem that would not satisfy client's need for any foreseeable future. More then that. I've never needed even top of the line hardware for that. I concentrate on features and robustness instead. Using C++ also gives nice speedup.


I like how the article starts with the metrics of what a single node can already do. The trend appears to be to scale right at the start, even before you have your first real customer ... the implications and results are obvious.


I wonder when using a distributed database (like CockroachDB) will be the default for new applications. Right now it seems that they are less feature and harder to set up than traditional RDBMSes but I can only assume that this gap will narrow and at some point in the future things will be "scalable by default". (Of course no DB is going to prevent all ways to shoot yourself in the foot)


This question is similar to asking on a car forum when using a 40 foot lorry will be the default starter car for everyone. The answer is "probably never" because while it does offer superior cargo transport scalability, the tradeoffs are not worth it for the vast majority of users. The question is posed like distributed databases have no disadvantages over non-distributed databases, but that is simply not the case. Clustering introduces all sort of problems, from network splits to slow(er) joins across nodes.


To back up how far one server can go, Stack Overflow used a single database for a long time

https://nickcraver.com/blog/2016/02/17/stack-overflow-the-ar... (2013 post had much less redundancy, but even their 2016 architecture is pretty undistributed in terms of being able to recreate everything from the single source of truth database)


Yeah, it's been a while since I've been in a high traffic situation, but back in the late 00s we had a couple sites with 3-5 million daily unique users. Each site had a single master, a couple read replicas, and some queries were cached in memcached.

The problem in MySQL we eventually ran into was the read replicas fell behind master during peak load (by 30-60 minutes depending upon the day). So we still had to hit the master for certain queries. I left before we took the next step to fix this issue.


I don't think it is quite the same.

- Switching from a car to a van to a lorry is fairly low cost. You don't need to recreate your product (probably).

- You don't need to run distributed databases in a cluster to start.

But I think most importantly the decrease in dev speed and performance is an investment in future scalability. And I only imagine that this different will shrink over time to where for example a 1 node "distributed" DB isn't that different to work with than a 1 node "traditional" DB. And that small difference pays off because adding a second node doesn't have huge tradeoffs.

I agree that right now it doesn't make sense. If I was starting a new product I would still fire up a PostgreSQL instance. However I think that the day will come where the difference is small enough and software support is complete enough that we will start with something distributed, much like people don't often start with SQLite today, even though it is quicker to get going (also not a great comparison because it has less features).


Making investments in future scalability at the cost of dev speed and performance is exactly the wrong tradeoff given that 90% of startups fail. At the start, when you have very little income, you want to allow for as much speed and flexibility as possible so that you can get to product/market fit ASAP. By the time your company gets big enough that plain MySQL/Postgres can't handle the load any longer, you will have more than enough money to afford a few experts that can help you migrate.


> I agree that right now it doesn't make sense.

This CRDB engineer respectfully disagrees. This thread takes it as a given that a non-distributed DB is better if you don't need to scale up (i.e. if you run a single "node"). Let me offer a couterpoint: it's easier to embed CRDB into some software you're distributing than it is to embed Postgres. This is to say, we do try to compete at every scale (well, perhaps not at SQLite scale). CRDB doesn't have all the SQL features of PG, but it does have its pluses: CRDB does online schema changes, it is generally simpler to run, comes with a UI, comes with more observability, can backup to the cloud, can be more easily embedded into tests.

Online schema changes are a big deal; the other thing that I hope will help us win small-scale hearts and minds is the ever-improving observability story. I hope CRDB will develop world-class capabilities here. Other open-source databases traditionally have not had many capabilities out of the box.


For RDBMS, Firebird SQL has managed to survive well in this sphere, for all of it's warts.

You can run it either as an embedded DLL or as a standalone server, and changing between the two is often just a connection string change; It seems to be fairly popular in the POS space for this reason.


> But I think most importantly the decrease in dev speed and performance is an investment in future scalability.

It makes sense only if you'll ever need this scalability. And you take a hit on other fronts too: Infra cost, Deployment complexity. And both deployment complexity and code complexity also increase QA cost, instability, product and company reputation.

>> much like people don't often start with SQLite today

Maybe they should start with SQLite by default


For personal projects, I start with SQLite by default. Never came close to its scalability limit too ;)


Why would it ever be default? Very few applications need a distributed database. These days you can get a single machine with hundreds of terabytes of storage and terabytes of RAM. Add a spare machine for failover and you have a reliable setup for any but the biggest tasks. And the tasks that that setup isn't sufficient for will certainly demand more thought than a cookie-cutter setup of some random distributed DB.


To me it looks like there are not many affordable options right now.

CockroachDB understandably wants you to use their Cloud or Enterprise products: the OSS version is quite limited. For example it doesn't support row-level partitioning (https://www.cockroachlabs.com/docs/stable/configure-replicat...). Which means, if I understand correctly, it is not of much help for scaling writes to a single big table.


Hi cuu508, CockroachDB engineer here. You are correct that row-level partitioning is not supported in the OSS version of CRDB. However, it sounds like there's a bit of confusion about where manual table partitioning is and is not needed. The primary use-case for row-level partitioning is to control the geographic location of various data in a multi-region cluster. Imagine a "users" table where EU users are stored on European servers and NA users are stored on North American servers.

If you are only looking to scale write throughput then manual partitioning is not be needed. This is because CRDB transparently performs range partitioning under-the-hood on all tables, so all tables scale in response to data size and load automatically. If you are interested in learning more, https://www.cockroachlabs.com/docs/stable/architecture/distr... discusses these concepts in depth.


Thanks for explaining, and sorry -- looks like I jumped to conclusions too quickly!


No, you didn't understand correctly. The feature that isn't supported is row level replication zones. Replication zones allows to define the location of replicas.


Oh, I didn't realize that the free version was that limited. I guess I need another name to use as the default open source distributed database.


What we need is a database that can both scale-up and scale-out. Most distributed databases offer poor efficiency and performance on a per node basis, which has a high operational cost. This is why people avoid using distributed databases unless they need it. A scale-up database can serve as much workload as pretty large scale-out database in practice. This discontinuity creates the market for scale-up systems.

There is literally nothing preventing distributed databases from having excellent scale-up performance too. Unfortunately, people who design distributed databases have a strong bias toward unnecessarily throwing hardware at performance and scalability problems. This is partly because very few people know how to design a modern scale-up database; making something (trivially) distributed is easier.


If nothing prevents scaling up a scale-out database does that not also mean nothing prevents scaling out a scale-up database? Aren't all scaling problems trivial at that point?


I think the "default" will evolve with whatever offers the best "serverless" experience in the public clouds. In particular, the cheapest and most granularly-billed option.


If you are going to allocate sharded databases per client with identical schemas, might as well give each of them an sqlite database? Since you're massively cutting down the writes if it's one database per customer.


And then just one query comes along where you need to make an update that should be globally visible, and not just visible to the shard.

I can see why you would stick with full Postgres for as long as practicable.


If it's that global of an update, it's probably something that could be run as part of a schema migration and could just be handled during update rollout.

If it's some form of state that needs a global update on the regular, it's likely a sign that your data is not sharded properly.


I am totally pro-hosted solutions. However, hosted postgres is always none of it. I had experience on same size machine some queries took 30x time longer on hosted vs selfhosted


This was an interesting read for a database novice. It seems like a lot of the quoted stats are about in memory datasets - is that realistic?


Yes. Memory can reach 768GB on a single instance today and I imagine that to expand. From there you can scale by sharding.

In memory provides real-time transactions you can't guarantee when using disk-based storage.


You can get more than that on common servers, see e.g. the recent post by Let's Encrypt on their new database server:

https://letsencrypt.org/2021/01/21/next-gen-database-servers...

They have 2TB RAM in there, and I suspect that is not the largest possible amount if you're willing to spend more money (though probably the largest possible amount for that particular server).

I played around a bit on the Dell website, and the largest server I could find supported up to 6TB RAM, with a price at ~300k EUR (I assume nobody pays list price for these).


4TB seems to be the limit for normal-ish servers. 32 slots filled with 128GB DIMMS. You can find servers with more slots (48 is common, 6TB), but you're going past most people's definition of "commodity".


> In memory provides real-time transactions you can't guarantee when using disk-based storage.

This is changing as we discover better ways of coding IO operations.

https://www.globenewswire.com/news-release/2019/11/05/194114...

https://itnext.io/modern-storage-is-plenty-fast-it-is-the-ap...


EC2's u-24tb1.metal is 224/448 cores/hyperthreads and 24TB RAM ;)

https://aws.amazon.com/sap/instance-types/


And nearly 2 million for a 3 year reserved instance.


This reminded me of a story from 15 years ago.

I once worked for a company that was writing a proposal for a US Homeland Security IT system. This was 2006. I wasn't involved in it but my office-mate was. He randomly turned his chair around and said "hey, can you go on Dell.com and see if you can build the most expensive server imaginable" – so I did and I ended up at around $350k. I don't remember what it was, but at the time the stats just felt absolutely obscene. I do remember that, at that scale we were buying a closet, not just one box.

And I told him the price, and he said, "They have set aside $20 million for server hardware, and what you configured is way way overkill for what they think they need" – so we were both so damn perplexed because, at least from what he told me, this didn't need that much. And it wasn't one of those classified projects where they couldn't tell you what it was for. It was a pretty boring line of business type thing (like HR or building access control or something).

Maybe that's probably more a story of just how much cash was being poured into that agency during the years after 9/11.


They got to the end of their financial year and had $$$ left to spend... happens all the time.


But this was for an RFP. they would’ve had to continue to obligate that money likely for years to come.


How long does it take to get the data off disk and into memory after coming online? A decade ago filling just 64GB of memory with our hot dataset was a painful process. I can't imagine it's any nicer with 768GB.


On a modern raid of NVME, you can do several gigs per second - with 20 PCIe Gen3 lanes on a 3400G about 10 GB/s is reasonable, for a theoretical max of 20GB/s.

Assuming your system is very badly configured and you can only do 3 gigs per second, about 400 s, so about 5 minutes, which is not very painful


Yes. That’s why the OP talked about separating hot/cold data and scale up instead of scale out.


It depends a lot on what you’re doing. In my last role, keeping the Postgres database in memory was absolutely not an option, and we ran into major issues related to physical I/O.


> A single node instance is literally bullet-proof

That is exactly the wrong way to use that word.


TLDR: As always, don't fix a problem until just before you have it.


anybody know if this applies to Temporal Tables? I've thought of using Datomic but it seems like Temporal Tables does the job, not sure about immutability and how that might impact performance because you are now dealing with keeping a timestamp record of every transactions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: