Hacker News new | past | comments | ask | show | jobs | submit login
PostgreSQL 16 Beta 1 (postgresql.org)
154 points by jkatz05 on May 25, 2023 | hide | past | favorite | 60 comments



For being "boring and stable" technology, Postgres 16 includes 200 features, which is on par with each of prior years releases. If you're looking for the full set of features it's available here - https://www.postgresql.org/docs/16/release-16.html.

Personal favorites on the list include:

- load_balance_hosts, which is an improvement to libpq so you can load balance across multiple Postgres instances.

- Logical replication on standbys

- pg_stat_io which is a new view that shows IO details.


Could you elaborate on the load balancing?

Is this a replacement for PG bouncer and similar?


IMO the biggest reason folks use pgbouncer is not for load balancing (which it can do, -ish) but instead for connection pooling. Postgres connections are expensive for the db server (one process per connection not one thread) so if you have say thousands of web application pods you need to use pgbouncer or similar as a proxy to multiplex those thousands of connections down onto a more manageable number (~200). So no, not really.

(EDIT: if you don't know this already - the _establishment_ of connections is also super expensive. so another reason to pgbounce is to keep connections persistent if you have app servers that are constantly opening and closing conns, or burst open conns, or such like. Even if the total conns to pg doesnt go super high, the cost of constantly churning them can really hurt your db)


Out of curiosity, if the problem of connections being expensive is solvable by PGBouncer-style connection multiplexing, why doesn't Postgres just do that by itself?


If it were easy to do well in-core, I think we'd do it immediately. Unfortunately the interesting pooling modes in pgbouncer also break a few things - which would likely not be acceptable in core postgres. Avoiding such breakage requires non-trivial architectural adjustments.

Historically connection state and "process state" have been tightly coupled, for good server-side pooling they have to be divorced. While good pooling is doable with the current process model (passing the client file descriptor between processes using SCM_RIGHTS), it's much harder with processes than with threads - this is one of the reasons I think we will eventually need to migrate to threads.

Eventually I want to get to a point where we have a limited number of "query execution workers" that handle query execution, utilized by a much larger number of client connections (which do not have dedicated threads each). Obviously it's a long way to go to that. Ah, the fun working on an complicated application with a ~35 year history.

There also are use cases for pgbouncer that cannot be addressed on the server-side - one important one is to run pgbouncer on "application servers", to reduce the TCP+TLS connection establishment overhead and to share connections between application processes / threads. That can yield very substantial performance gains - completely independent of server side pooling support.


Because pgbouncer's transaction-based pooling, which is what the previous poster was referring to, breaks a few postgres features. This is fine for most applications, but not all. See the table on https://www.pgbouncer.org/features.html


I really wish there was a way to handle transaction pooling without losing session pooling features.

Essentially, a "please serialize everything (temp tables, SET GUC values, etc) from this session to disk and load it back when necessary".


yeah good callout. I'd would be totally fine with a server mode, or connection option, that opts you into pooling but then disables the incompatible features. You choose pooled mode if you need it, and the tradeoffs are documented.


I believe there has been talk and perhaps progress towards building it into PG, but i cant find the dev thread right now, and i dont know what the status of that is.

interestingly enough this is what Oracle does AFAIK. They are also process-per-conn & have an optional sidecar proxy thingy that you can run on your oracle host to do the pooling. I would rather it be built more tightly into the rdbms but thats not a terrible solution.


There were changes made to idle sessions in postgres 14.0 that were supposed to reduce the resource usage of open but idle connections.

Crunchydata mentioned it on their blog a while back (https://www.crunchydata.com/blog/five-tips-for-a-healthier-p...) and the pg 14 release notes mention a few changes to idle sessions (https://www.postgresql.org/docs/release/14.0/)

I don't know if they were sufficient that pgbouncer is no longer necessary, haven't had a need to try it.


Postgres forks an OS process for each connection, which is relatively heavy weight. Oracle has a similar architecture to Postgres, and they solved it with a "multi-threaded listener". MySQL, in contrast, uses threads, which makes connections lighter weight.

It's fundamentally a question of how the connection listener communicates with the rest of the database, e.g., using shared memory or some other IPC mechanism, work queues, etc. Having too many connections results in problems with concurrent access and lock contention independent of how heavyweight the actual listening process is.


That's pretty much been my take as well. Been looking at CockroachLabs (CockroachDB hosted cloud), and been kind of hard to get any detail if pgbouncer is beneficial for this, since I'm planning to connect via node_compat via Cloudflare Workers, and have kind of decided to punt the issue and connect directly for now in development, and closer to prod release do some tests with/without pgbouncer at each cloud location for the larger prod cluster instances for release.

Not the same service(s) as PG even if it's the same protocol, so I know it's really beneficial for connection queueing WRT my scenario for PG, but no idea on the CDB side.


Postgres 14 already made huge improvements on this front.

https://techcommunity.microsoft.com/t5/azure-database-for-po...


Postgresql is amazing but I must say the last few updates have been a bit disappointing.

Still waiting for automatic incremental updates for materialized views - been worked on for several years but still not released!

https://wiki.postgresql.org/wiki/Incremental_View_Maintenanc...

https://github.com/sraoss/pg_ivm


To be fair view update is a very hard problem in computer science. It's not as though it's just a lack of time to implement it.


I've done some work in that space, which was mainly to build an alternative view materialization scheme[0] in PlPgSQL that reifies the internals of CREATE MATERIALIZED VIEW and REFRESH MATERIALIZED VIEW but with the following properties:

  - the materialized view is a plain table, so
     - you can write to it from triggers
     - you can have triggers on it
  - refreshing a view records the deltas in a
    history table (which is useful as a poor
    person's logical replication scheme)
  - you can mark a view as needing a refresh
Then in the application I have hand-coded triggers to either update the view's materialization directly or to mark the view as needing a refresh. A background job can asynchronously refresh views as needed.

I've also spent some time thinking about the AST form of view queries that PG stores and how one might automatically generate triggers on source tables that update the materialization or mark it as needing a refresh.

As you note, many queries can be very difficult to transform into queries that compute incremental deltas. Moreover, even where it's possible to do that, the time it takes to execute the delta computation might be unacceptably long. For example, if you have a recursively nested grouping schema and you want to maintain a view of the expanded transitive closure of that data, then removing a large from from another might require thousands of row deletions from the materialized view, and that might make the transaction take much too long in a UI -- the obvious thing to do here is to say "sorry, that kind of update takes a while to propagate, but your transaction will complete quickly", so just mark the view as needing a refresh and refresh it asynchronously.

[0] https://github.com/twosigma/postgresql-contrib/blob/master/m...


https://materialize.com/ are doing interesting things with this - although not inside Postgres (they use a few bit of Postgres as components).


I think PlanetScale implement something similar.

https://planetscale.com/blog/how-planetscale-boost-serves-yo...


On the one hand, I totally understand.

On the other, it's pretty confidence-inspiring that they don't put stuff in until they're sure it's ready.


Yeah, that's one of my favourite characteristics of PostgreSQL: if a feature makes it into a release, you can be VERY confident that it works well even across weird corner-cases.


I have one use case where this functionality would be huge in terms of performance, patiently waiting for it


You can use https://github.com/twosigma/postgresql-contrib/blob/master/m... and write triggers to update your views or mark them as needing asynchronous refreshes. It's not as automatic as what you'd like, but it's a lot better than having to keep waiting.


My list of unexpected but very welcome changes:

- pg_hba.conf and pg_ident.conf can include other files

- Logical replication apply can use non-PK btree indexes

- Integer literals in non-decimal bases

- Underscores in numeric literals

- Subqueries in the FROM clause can omit aliases

- Addition and subtraction of timestamptz values

- pg_upgrade can override new cluster's locale and encoding


> - Subqueries in the FROM clause can omit aliases

This is great. It never made any sense to me that this was required. For people who are unaware, say you want to understand a table a natural way of doing it might be

    select * 
    from the_table
    order by some_metric desc 
    limit 10
so you'd think you can do the same for queries like

    select * 
    from (
      select blah blah blah the rest of the query
    ) a
    order by some_metric desc 
    limit 10
you need to put the alias 'a' to placate existing postgres even though it's never actually used, which never made any sense to me.


It makes sense in that every table needs a name to reference, however if you only have the one table there isn't any ambiguity.


It never made sense.

And FYI this comes from ANSI SQL.


It does the second you want to join to the results of that subquery...this is an optimization that recognizes that the alias is never used and thus can be omitted. Nice QOL feature.


Not a bad list!

https://commitfest.postgresql.org/19/1741/ (index skip scan/loose index scans) would be very welcomed... I think. Not sure how many people run into it in the wild.

It says "target version: 16" but "returned with feedback" and hasn't been bumped in 14 months. :( First opened in 2018.

https://wiki.postgresql.org/wiki/Loose_indexscan


"parallel execution of the string_agg and array_agg aggregate functions" caught my eye - that feels like one of those minor improvements that might enable some pretty interesting new applications.

array_agg is particularly interesting, because it lets you implement patterns like this: https://til.simonwillison.net/sqlite/related-rows-single-que...


COPY from files up to 300% faster

Load Balancing from client libs

Support for CPU acceleration using SIMD for both x86 and ARM architectures, including optimizations for processing ASCII and JSON strings

new pg_stat_io view that provides information on I/O statistics


Have to say, I find the SQL/JSON support in Postgres (jsonb) is probably the single biggest killer feature IMO. I can't believe MS (still) doesn't have an implementation (their SQL to JSON functions suck exponentially by comparison).

Nice to see the continued advancement and progression all around.


FYI: MS is releasing support for a native JSON data type in next release.

But yeah, obviously they're behind in that area!


Is MSSQL used often in production? And where are all the windows servers this would run on?


Yes... many, many internal deployments are definitely using MS-SQL services. StackOverflow.com uses MS-SQL, for a prominent, public example. There are client libraries via ODBC or direct in many languages and platforms, including Node.js (mssql/tedious) and Rust.

As to where, AWS offers Windows as do many other cloud providers, including a significant portion of Azure VMs. Not to mention, that MS-SQL and SQL-Edge both run on Linux. IIRC, Azure Cloud SQL is also MS-SQL running a non-windows version. There's also Linux/x86_64 Docker images.

Aside: if you're willing to write a big check, MS-SQL replication configuration is far easier than pretty much anything else to setup and configure (UI based flows or scripted). While I personally advocate for PostgreSQL, I've used and mostly like MS-SQL fine.


That's really interesting, a lot of that is new information to me! Thank you for the detailed response, I appreciate it.


Postgres getting good updates at slow but steady pace over many decades is really impressive.


This is an exciting release to be sure. Very happy to see that the CPU improvements include ARM, diversity is good!

Given the current release pace, I'd love to have another upgrade path than `pg_dump | psql`. That would remove a great deal of friction in prod.


> I'd love to have another upgrade path than `pg_dump | psql`

pg_upgrade


pg_upgrade is not an option on many systems if you are installing with the package manager as it requires both old and new binaries to be simultaneously present.


Well, that's a problem of the package manager, not a Postgres problem.

At least with Ubuntu and RedHat/CentOS/Rocky/... installing two versions in parallel is absolutely no problem.

I think Ubuntu even wraps that into their upgrade scripts and you can pass an option to use pg_upgrade or pg_dump/pg_restore


Arch Linux e.g. solved this by having a `postgresql-old-upgrade` package.


Or logical replication to the new box then migrate traffic.


There is pg_upgrade and logical replication. I have used both and they both work excellently. Logical replication can be a bit of a hassle to set up though, especially if the database is huge.


I look forward to seeing some benchmarks. Especially to see the effect of the SIMD optimizations.


Can't wait to have it available in RDS in 2030.


AWS has gotten much better in recent years. Pg15 is available on Aurora as of last month. That's about 6-7 months after it was released, a significant improvement over years past.


Yup, but the lack of access to beta versions also means that we will basically get access to access to PostgreSQL 16 in RDS by next year around this time.


Hey, I helped to write the original article, and I also work at AWS.

I'm happy to let you know that PostgreSQL 16 Beta 1 is available in the Amazon RDS Database Preview Environment (https://aws.amazon.com/about-aws/whats-new/2023/05/postgresq...). I hope you get a chance to test, I'm very eager to here any and all feedback about PostgreSQL 16.


Thanks for the info!

Would you also know by any chance when RDS will get Graviton3 instances in non-Dublin EU regions?


Bunch of interesting loosenings ups on logical replication. Allowing standbys to also have subscribers; great for fan out! Bunch of perf improvements (honestly a bunch of it is removing rather onerous limitations I didn't know about!) At the end there seems to be a developer flag for allowing seeming bidirectional replication too.


Glad to see the move off of autoconf to meson.


I am personally very excited to see that work on direct IO has started. It is a huge undertaking so I do not expect any production ready thing any time soon but nice to see that some code landed.


Yeah- we've been working on that for quite a while, and it'll take a few more years to finish it. I hope we can get some important parts into 17 though.

FWIW, there have been smaller prerequisites merged into 15 already, and 16 has a number of improvements that are part of that work. E.g. the more scalable relation extension (making COPY scale much better), and the related buffer mapping changes, come from the AIO effort.

Personally I think the feature is using asynchronous IO and direct IO support is part of that :)


Meanwhile, PG15 is still in preview mode on Azure


[flagged]


PG 15 has an EOL of Nov 2027, so you should be ok to stick with it for a while. ;)


It can be. We still running PG11 and 13 in production. Not obligated to upgrade - especially when the software is rock solid.


PG 11 is end-of-life, right? So you won't be getting new security updates. What's worse, libraries will stop supporting it over time, so you'll find it difficult to stay up to date in other places in your stack.

If I was running PG 11 in production, I'd be looking to upgrade sooner rather than later. Being on older-but-stable versions is good risk management, being on end-of-life version is not.


> PG 11 is end-of-life, right? So you won't be getting new security updates.

Not quite yet https://www.postgresql.org/support/versioning/ - the final release will be November 9, 2023.


We roll onto new stuff. So soon-ish the 11 will be replaced with a 14 or 15. And then later the 13 will roll up to whatever is good, stable and tested in our environment. And our roll-up plan has never seen an old PG lose client library access. PG has never ever, in 20+ years been the thing that holds us back. I also think risk-management has many more (important) factors than EOL.

Edit: quick check shows we also have PG9 and PG10 in some Live roles too.


Whenever you do a one day full of updates and fix small issues but also gain new features and performance, just remember there is someone who uses some super old db no one wanna touch it again and the migration of that old thing will take weeks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: