Hacker News new | past | comments | ask | show | jobs | submit login

When does a database need to maintain large number of open connections? You really don't need a lot of connections to serve high throughput. The database system typically becomes a bottleneck before the connections do. A good client-side pooling implementation will manage and limit connection usage.



Say you have something really simple, web -> db. But at a minimum, you have 2 web servers for HA. So you're already at 2 client-side pools.

Now make it a tiny bit more complex and split this into 2 services, so you're now at 4 client-side pools. Should we make it 4 services instead? Or do you see where I'm going with this? (then add ad-hoc scripts / crons, ...). What if we do make it 4 services and some of them are being deployed dynamically, and at peak, you might have 8 instances of some of them.

Now, say you're at the point where you want to run some on-demand reporting queries on your DB, but you're only talking about tables with 10-100million rows - not quite the point where you want to manage a reporting-specific database. So maybe you need a slightly larger pool because some of the queries can take over a minute.

Now, work_mem is a global setting, but it's super important and you really want to set it to 16MB (but 32MB would be ideal for a few of your cases). Drats, now you're getting squeezed on the most expensive hardware component: memory.


Well, 32MB * 400 clients ~= 12GB. It's not an irrelevant amount, but if you have that much load on your DB, you should have a beefy machine anyway, and that for one of the 2 main memorysets isn't any extravagant (and the shared memory is much smaller). In fact, I wouldn't even try to run SQL Server on that amount (not to talk about Oracle).

But yeah, if you go full microservices automatically deployed over kubermenets all connecting to the same place, you will need something in between so your DB server doesn't get crazy. This setup would break any central DB anyway, it's just that different DBMS would break for different reasons.


But no one says you need a separate connection pool for each client application. There are cases when it's the right thing, but you can just as well have a single connection pool for each PostgreSQL instance (and all apps will go through that). So you absolutely don't need "8 instances" at peak or whatever.

The problem with Postgres vs. many connections is a simple combination of two simple facts.

Firstly, Postgres has a dedicated process managing each connection - which significantly simplifies some of the programming because most resources are "private" etc. This made perfect sense back when machines very few cores, threading was much more expensive, etc. And development time is always precious, of course.

Secondly, people often don't distinguish between active and idle connections. But a system with X cores reaches top throughput at 2X-3X active connections, and at some point the throughput tanks because the active backends have to share resources (CPU, work_mem, etc.).

And the database and/or DBA has to manage that somehow - if you have 128GB of RAM available for query processing, it matters if you allow 100 or 10000 connections. With 100x more connections you can't allow the backends to use as much memory, which probably means less efficient sorts etc. You may assume most connections will be idle at any given time, but that may easily change (a bug in new app version or whatever), storming the DB, exhausting resources, etc.

Moreover, some (fairly hot) parts of the transaction management code need to walk active connections to check visibility etc, and the more connections you have the more expensive this is. And there are probably more places with similar behavior. We've fixed / optimized lot of them, because the number of cores is growing and it was hurting even "reasonably" configured instances, but that gets you maybe to 500 - 1000 connections (?), not 100k.

And I don't think that'll change anytime soon, so your best bet is still a connection pool.

Now, I'm not claiming it's ideal - but there simply are historical and practical reasons why things are designed the way they are, and it's not a matter of swooping in and optimizing one or two places.


At that point you really want to front all database access via a service otherwise it's calling for other problems with schema management, security, resource quotas etc.

You can even have separate pools for different use cases like reporting if you don't want a separate server. But each pool really doesn't need more than like 5 connections, even with multiple pools and multiple replicas of your service it will rarely add up to the hundreds.


Modern web frameworks generate a lot of connections.

Let's say, a typical application may run on 4 instances each one with 16 cores. That means a total of 64 processes because one process per core.

Each process opens 16 connections to the database because bad configuration or default to the number of cores. We're now facing a total of 1024 SQL connections out-of-the-box for nothing.

That's the reason databases are commonly subjected to insane amounts of connections.


The fact that some other part of the software stack does something silly does not mean the database has to cater for that. If it's a misconfiguration, fix the misconfiguration.


When you have junior devs insisting in using orm, then the connection setup costs become an issue


Don't ORMs use connection pooling by default to avoid these issues ?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: