Hacker News new | past | comments | ask | show | jobs | submit login
PostgreSQL 11 and Just in Time Compilation of Queries (citusdata.com)
213 points by ScottWRobinson on Sept 11, 2018 | hide | past | favorite | 54 comments



HPE did an excellent technical, no marketing BS writeup on PG11[0]. I highly recommend everyone interested in the new features read this.

[0] https://h50146.www5.hpe.com/products/software/oe/linux/mains...


Cool! Is there a similar doc for pg10? I'm still using 9.5. Thanks!


There is:

PostgreSQL 10 New Features With Examples https://h50146.www5.hpe.com/products/software/oe/linux/mains...


I'm curious if we will see similar improvements for full-text search. If PostgreSQL continues to improve at this rate, there will be little reason to use many of alternatives...

Personally, I already view PostgreSQL comparable (for many use cases) to paid options[1].

Interesting, for queries such as:

   EXPLAIN ANALYZE SELECT * FROM (SELECT * FROM sort1 ORDER BY 1) AS a LIMIT 5 ;
You'll see an order of magnitude speed improvement (page 47 of [2]).

[1] https://austingwalters.com/fast-full-text-search-in-postgres...

[2] https://h50146.www5.hpe.com/products/software/oe/linux/mains...


Seriously if PG gets better text search to put it on par with Lucene (Rum indexes, BM25 and TF*IDF) I don't see much reason to use anything else either.


From the top of my head, I can list at least four reasons to actually use anything else:

- Not distributable

- Poor resiliency

- Hard to upgrade

- Schemas remains at postgres core

Every solution to address any of these points in PG are hacks (bucardo, londist, slony, pgbouncer, things relying on triggers or proxies). Quite honestly, PG contributors are doing an impressive job. PG is arguably the best relational database in the market as of today and still improving very fast. However it's also one of the most (if not the most) bloated. And despite all these features nobody needs and after all these years, it's still lacking some very basic yet essential items anyone would ask for in any modern application. It feels like things haven't changed that much since the 90's. DBAs have to hack around things and root issues never gets addressed. Implementing JIT optimizations looks like fun. It will undoubtedly improve some queries by 2x, 4x or even 10x, but it won't be a game changer anyways. It'll just increase the postgres bloat and the overall complexity of the system. Meanwhile, we'll still be lacking essential things that would make PG suitable for pretty much any use-case.


People work on the features that they either are paid to work on, or are otherwise interested in. While far from a perfect selection, there's some correlation between what companies are willing to pay people to work on in PG and what their users need.

It may turn out that your essential, and a lot of other users essential aren't exactly the same.

I worked on the JIT stuff because there was quite some concern about query speed (yes, from actual users). And after some other micro-optimizations that was the right point to attack the performance.

If you have strong feelings what you want to be worked on, you'd be more than welcome to help. Or alternatively some of the dev companies around PG certainly would appreciate business.


I totally upvote and agree with your comment. Thanks a lot for working on JIT, that's invaluable. We (as a community) will never thank the PG contributors enough for their work. I'd love to contribute to the postgres project, though I couldn't make it at work at the moment unfortunately. And the cost of entry is quite expensive for what I can give in my free time. I'm very sorry about this. I know well that it's always easier to complain about things rather than fixing them. Doesn't mean there's absolutely no content in it. I hope my comment did not hurt you, it was not my intention.


> "- Not distributable" This is something that's actively being worked on. It's also useful to look at the systems that are more easily distributable, and compare other features they posses with Postgres. Aphyr has provided a wonderful service with Jepsen showing just how difficult correctness in general (with and without the added complexity of distributability) actually is.

- "Poor resiliency" Do you mean something by this beyond as an aspect of distributability?

- "Hard to upgrade" The story here has been much improved, both with pg_upgrade and though features such as logical replication.

- "Schemas remains at postgres core" Would you elaborate what you mean by this?

I would also ask you to elaborate what you mean by "However it's also one of the most (if not the most) bloated." In terms of binary size? Feature set? Actual on-disk data representation? If so, what are you comparing it to?

Similarly, "It feels like things haven't changed that much since the 90's.": Having used Postgres over the years, I can certainly say that things have become much easier. What in particular are you thinking of?

I also struggle to square what appears to be complaints of capability while also expressing frustration with "postgres bloat and overall complexity of the system." As with all things, there are tradeoffs and resource limitations. What would you in particular are you looking for? And given the nature of Postgres as an open source project, what are you doing to ensure that those features you desire are being worked on? There are plenty of developers and organizations that are willing to work on bespoke features given the appropriate support.

Jepsen: https://jepsen.io


Thanks for replying. Honestly, I expected a more closed and rude answer because I know my opinion on this is quite unpopular. The context of my answer: I love PostgreSQL, I've been using it in production, day-to-day, at work and for personal projects. Still it happens that I have also used more modern databases and they fixed some flaws postgres has (while having alot of other flaws that's not the point). I was answering to the question "what would make anybody choose another database given that Postgres can do pretty much anything?".

> This is something that's actively being worked on. It's also useful to look at the systems that are more easily distributable, and compare other features they posses with Postgres. Aphyr has provided a wonderful service with Jepsen showing just how difficult correctness in general (with and without the added complexity of distributability) actually is.

Thanks. I'll definitely have a look into this. I'm definitely not saying that everything would be easy and ideal to implement in such a robust and mature software (I am even sure it would be a nightmare for the developers to develop and stabilize and that for this reason it's unlikely to ever happen). Other databases provides this out of the box, postgres just don't, and that's a good reason "why would anyone use anything else". Still, not a requirement for everybody, but more frequent than JIT IMO.

> Do you mean something by this beyond as an aspect of distributability?

I mean that physical replication is great, but you always get into troubles when it comes to failover. Pretty much any modern DB system can be put in front of a basic HTTP LB, and replicate data on n nodes. It's indeed tightly coupled with distribution, but I meant distribution in term of performance and sharding (horizontal scalability) and resiliency in term of reliability. Relying on DNS for failover induces a downtime, relying on a proxy induces a hack. I'm wondering why it would be so hard to fix this in postgres though.

> The story here has been much improved, both with pg_upgrade and though features such as logical replication.

This is definitely true. While I think logical replication could/should go even further to really fix the upgrade issue, PG 10 is a huge move in the right direction. Thanks for pointing that out.

> "Schemas remains at postgres core" Would you elaborate what you mean by this?

Maybe it's more a general feeling than something really concrete, but for instance, would you create a table with only one JSONB column. It would feel weird, like abusing/hacking around the schema. From the user perspective, it's impractical, not natural, and it doesn't look like a document store at all (as we see in nosql DBs). In this respect, I don't think I will ever use postgres as a document-only database because I feel like it's just not how postgres is built (unless I'm missing some great feature?). That's a subject were document-only database may have the point.

> In terms of binary size? Feature set? Actual on-disk data representation? If so, what are you comparing it to?

I'm only speaking in term of feature set. A lot of documentation comes with them. A lot of side-effects. A lot of wrong expectations from the user. A lot of complexity. A lot of maintenance burden which makes postgres more complex to evolve on huge topics.

Also, this is a common problem across the most used relational DBMS (MySQL, Oracle, SQL Server and Postgres). To give another perspective, ElasticSearch dropped a massive amount of features in version 6 just so that they can focus on the core things. I personally think it was a good move.

> Having used Postgres over the years, I can certainly say that things have become much easier. What in particular are you thinking of?

I have also used PG for years and while I reckon you are completely right (a lot of improvements have been made in a tremendous amount of places), I was pointing out that the core topic I listed were still such a pain to handle, while so important since day 1.

> I also struggle to square what appears to be complaints of capability while also expressing frustration with "postgres bloat and overall complexity of the system." As with all things, there are tradeoffs and resource limitations. What would you in particular are you looking for?

Ok, let me fix this: I'm not complaining at all, really really not. PG guys are doing a fantastic job, PG is a great software, no doubt about it. I would even go further: in my opinion postgres is the currently the best mature RDBMS in the place. I am just saying that postgres is not and probably never will be the answer to all the DB use-cases in the world. More than that: postgres is even failing on very basic stuff needed in almost every modern app (mostly because it was started at a time were HA was not really a concern). Put another way: some modern DBs are solving some modern problems better than postgres and might disqualify postgres for some of these use-cases. I did not meant anything else, believe me, and I apologize sincerely to all the developers that have worked on the project if my comment was taken like that.


You make a lot of mention of "modern databases" without naming any. I specifically referenced the Jepsen project as it's repeatedly shown that many/most of the "modern databases"† (and those with more history) actually don't provide the features you describe without issue. Postgres is upfront about the current limitations rather than market itself as something it isn't. I'd ask to you be specific when calling out projects you prefer for your use cases so they may be actually compared. Handwaving serves no one well.

No system is the absolute best for every use case. No one is saying it is. But please be fair and lay all the cards on the table so we can all see and judge for ourselves.

† As an aside, this use of "modern database" sticks in my craw a bit. It's very imprecise. It seems you might be using it to mean anything non-relational or perhaps non-SQL. The so-called NoSQL database systems are often quite different even amongst themselves: lumping them together does a disservice to the individual implementations. Then again, perhaps I'm misinterpreting what you mean.


I did not mention any database behind the very imprecise term "modern database" because it was not the point. Those databases at least try to solve those real-world issues in good or bad ways. Some people/companies have the empirical proof that they work for their use-case in production, just like I do, and I don't think the conversation is specifically about how good the DB solves the issue when Postgres does not even try to solve it. And again, I'm not saying this is bad (well, some of these points have become so essential nowadays in most of real-world applications that I think it's a very practical issue), but these are elements to take into account when opting for one DB or another. Sometimes, it may disqualify postgres, some other times it may not. In this respect, the original assertion (the one that made me react in the first place) “if PG gets better text search […] I don't see much reason to use anything else either.” just sounds very wrong to me. It depends on so much other (more important) things. At work, I have colleagues that are huge postgres fans to the point where they loose any critical sense on it. I think this situation is never ever good for engineers when it comes to taking the right decisions seriously.

I also know that distributed systems don't come without their own issues and complexity (CAP mostly, but also distributed systems = more complex = more bugs, and also younger = less mature = more bugs, and configuration issues, sharding issues). Some databases are very clear about those bugs and limitations (https://www.elastic.co/guide/en/elasticsearch/resiliency/cur...) some aren't (I don't think MongoDB documents them). Behind "modern databases", I am thinking of ElasticSearch, CouchDB (they solved a lot of issues regarding scalability recently by merging BigCouch in v2), MongoDB (arguably one of the worst way to address all the mentioned problems, but whatever), DynamoDB, for the databases I've been using or I'm currently using in production. I've also played around with AWS Aurora (yes, AWS is forking MySQL and Postgres to solve those issues at root which is a good proof that there is actually a demand), and also more specific databases like InfluxDB or key-value stores like Consul's. They all have their own solutions and tradeoffs. But I'm not sure mentioning them is very relevant for the argumentation.


Honestly, the issue I have is I don't care if modern databases try to solve those problems, I only care if they _actually_ solve those problems, instead of them lying that they provide CAP and are distributed/infinitely scalable, when in fact, it isn't.


> for instance, would you create a table with only one JSONB column.

Sure. To make it more practical, I'd use a primary-key column too (like most document-only databases do).

We actually do use a (INTEGER PK, JSONB) table in Pg, for a variable-depth permission-model store. The permissions (per PK) are encoded into a JSON document and queried using a 'get json object at path' lookup. Turns out, the way json & jsonb operators and functions are designed in Pg, it feels exactly like a document store, just available in your SQL (which is a plus, in my book).

I think the larger theme here is your feeling that this is abusing or hacking around "the schema", but really, what is "the schema"? I've seen people use MySQL as just a store, using plain, NULLABLE VARCHAR and INT columns, and enforce a "schema" at the application layer, serializing and de-serializing their data from the VARCHAR columns. I've seen projects reinvent MVCC using an INT 'version' column in their tables.

Databases can only allow defining and enforcing a schema at the database layer, they can't force it. Some databases allow a lot of control and many ways to define a schema, some allow little, but there's no requirement that one MUST use all the ways when available. In fact, even with rich databases like Postgres, sometimes even the db-level enforceable schema is not sufficient and one has to make do with enforcing some of it at application-level, and I'm talking about just the traditional data types here.

Using JSON columns and operators and functions is — at the end of the day — using just another data type. Given Postgres is an object-relational database (it IS, after all, post-INGRES), it's only natural to use Postgres with data types like JSON (and arrays, and ranges, and GIS).


> I don't think I will ever use postgres as a document-only database because I feel like it's just not how postgres is built (unless I'm missing some great feature?)

Partial jsonb indexing (\w full-text), transactional fast joins on jsonb fields, and higher Jepsen rating than most noSQL DBs is pretty killer for me.


What are these amazing modern DB's you're speaking of that are kicking the pants off of Postgres that I seem to be unaware of?


I use PostgreSQL as a document db using Marten.


> And despite all these features nobody needs

Please do speak for yourself.


"nobody needs" is to take in the unix philosophy sense here. If you take it first degree, you're part of the reason why PG has been missing these essential points for so long (IMHO). In comparison to the items I've listed, those features are nice to have but definitely not used by the majority of users. And I do use these advanced features (nobody needs) in PostgreSQL. I use them extensively and really love them. It's just that I could workaround them easily and properly. That's just not the case for the things I've listed.


Yes, the famous "People can work around our lack of features" unix philosophy.

What?


People are smart enough to install the software and the plugins they need. If only .1% of the users use a feature, it doesn't have to be maintained by the core contributors (and slow down the other 99.9% users from getting important features and bugfixes). It's a tradeoff and FWIW, while I think postgres is great, I also reckon the tradeoff could be improved.


There are plenty of cases where it's preferable for those things to be owned and maintained by core teams. One of the biggest benefits of the Linux Kernel is that drivers are distributed and maintained as part of the core kernel. Likewise, one of the biggest disadvantages of not having this model is in software like Jenkins, where recently the creator of Jenkins explicitly called this out as one of their biggest problems. It's not so cut and dry to say what pieces of "added functionality" should be "core" or "not-core".


So far nobody was able to figure out what is that exactly you listed?


Shame comments like this get downvoted. Because it's 100% true.

PostgreSQL still is lacking with its horizontal scalability story and there is no reason to choose it over something like Solr/ElasticSearch right now largely because of this.


You're the reason why I inherit project that uses only ES and then end up doing relational schema in ES because they didn't use relational database in general.

If you had to choose between PostgreSQL or ElasticSearch then you have no idea what you're doing.


I recently interviewed at a startup that explicitly uses both, with PG being something of an authoritative backend store that feeds into ES as the primary frontend store (since the frontend is search-heavy).

It might be possible to get it to work without PG, but I imagine it would require a huge engineering effort (akin to an entire product on its own, which, IIRC, is what CrateDB is).

(Although I believe I understand your point, I fear that your tone and ad-hominem language detracts from it).


https://github.com/zombodb/zombodb works for me by turning Elasticsearch into an index type (PG10 only)


If it is 100% true then can you explain what he is referring to with "Poor resiliency" and "Schemas remains at postgres core"?

Also PostgreSQL is not hard to update, the only issue with PostgreSQL and upgrades is that it is hard to upgrade with zero downtime (pg_upgrade is very fast but still requires a small bit of downtime).

I can agree with the complaint about distribution, but that is a common issue for most databases not built from the start around distribution and nothing I would call "fundamentals".


Here's a real world example - I cannot use pg_upgrade because I have PostGIS installed and you cannot use different versions of PostGIS when upgrading, but I cannot install the older version in pg10.

We had to go to a slony hack (and put auto increment columns on a lot of tables temporarily) because of that.

So he's totally right, the upgrade path is not easy.


Sorry if it was unclear. I think I explained preciely what I meant in my responses to grzm. I don't think there's any debate on the "poor resiliency" point for anybody that has already used a distributed database or even just used postgres in production and experienced a slave promotion. For the "Schemas remains at postgres core" thing, it's probably not as straightforward, but once again, I really think anybody who has ever used a document-driven database can understand what I mean through this. You really don't use postgres as you use a document-driven database because it's impractical. (More generally, this point could be completed with a list of all the features that are easy to handle in pretty much any non-19xx database and that are a nightmare in postgres (users, permissions, indices creations/deletion, …).)

threeseed has summed my thoughts up. Solar/ES are solving issues that postgres just don't address at all at the moment, and it's definitely not only about text search (which is also a point). Saying that postgres should be/can be used for any use-case apart from TS is an absolute nonsense and that's what made me react in the first place. Moreover, the feeling that the community don't largely agree just on the fact that postgres doesn't meet these four points is indeed a bit appalling. OK, saying that postgres should try to solve them is another debate, but let's just agree on the fact that if you absolutely need a very resilient OR distributed OR document-based OR easy-to-upgrade DBMS (although for this last point, I have to agree that things have changed recently, partly thanks to the Uber gate I think), postgresql is probably not a good choice. There's definitely a HUGE trade-off and we should agree on this. Postgres don't come even close to Solar/ES in term of horizontal scalability. I'd really like it could, but as of today it can't.


> a nightmare in postgres (users, permissions, indices creations/deletion, …).

Could you expand on what you mean here? By 'users, permissions', I assume you meant management of users and access control to records at the database layer. If so, I should tell you I use database-enforced roles to manage access to various areas of the schema I manage, and RLS to protect access to records. If not, I'd very much like to know what exactly you meant.

Could you expand a bit on what exactly you mean by "indices creations/deletion"?

> things have changed recently, partly thanks to the Uber gate I think

Logical replication in core was in the works long before Uber's post, and already available as a plugin for at least two releases at the time. I remember, because we ourselves used it (for syncing data between environments, not for upgrades) in 9.6 (the then latest release).


I have benchmarked a single Postgres node performing comparably to a six node Cassandra cluster on timeseries data (using TimeseriesDB). The more I’ve had to deal with the limitations of other databases, the more I’ve come to think that people should just start with Postgres and only pick something else if Postgres doesn’t work out


> I’ve come to think that people should just start with Postgres and only pick something else if Postgres doesn’t work out

Even that standard results in people giving up on PG too early. It's possible to get it to continue working out, but it can require effort beyond a naive implementation and often more hardware knowledge than is common any more.

The other options can make naive implementations much easier, at least initially, and on almost any hardware. Of course, that usually ends up costing far more, even at modest scale.


Sometimes the issue with a comment isn't about its truth or lack therefore, but with its tone or potential for inciting flamewar (a.k.a. trollishness). With a brand new account, the intent is even harder to discern.

> PostgreSQL still is lacking with its horizontal scalability story

To the extent this is true, it may still be irrelevant for many environments. The ability to increase performance by just adding more servers (what I believe is usually meant by "horizonatal scalability") is not, despite its popularity, a desirable end in and of itself.

It's only desirable as a means of scalability, if scaling a "single" server (a.k.a. vertical) would reach the limit of available hardware, or if even approaching that limit becomes prohibitively expensive.

My impression from talking to managers (including startup founders) is that they often erroneously believe they fall into the latter category. Specifically, they believe that the overhead inherent in distributed systems is, essentially, negligible, so that they're avoiding the 2-4x price premium of a high-end single system. The reality is not so negligible, especially considering inefficiencies unique to a particular distributed database, regardless of if it's inescapable due to its nature.

The former can, of course, be a real concern, especially for VC-funded startups, for whom a lack of fast enough growth is its own failure. However, even in that situation, there's reason for skepticism, since most of the original "web scale" horror stories tended to involve MySQL (which had, and maybe still has, reduced capabilities for performance improvements through hardware) and database servers that hadn't even maxed out their I/O capabilities (not trivial to do naively pre-SSD).

> there is no reason to choose it over something like Solr/ElasticSearch right now largely because of this.

Even the OC points recognizes [1] that these distributed systems aren't without their drawbacks. Those drawbacks are additional reasons to choose Postgres over them.

[1] https://news.ycombinator.com/item?id=17964732


Distribution is also very application specific, in terms of how you distribute. If you’re going multi-tenant Citus (article source) works beautifully.


honest and naive question:

What would be the difficulties to have these in PG, outside of "getting developer interested in implementing it" (i.e is there some current architecture decision in PG that prevent it, like it's the case for "why vanilla PG is not suitable as a column store")


My understanding is gin/gist indexes do not store enough information to use better scoring and RUM indexes could change that:

https://github.com/postgrespro/rum


exactly this. I wish they would update the relevance algorithms to use TFIDF and BM25. We've been praying for this for a long time.. and not really sure why they havent been implemented.


My guess would be because no developer with the skillet necessary has either been interested or been paid to do it.

It is an open source project at the end of the day, which means the features that get added are those that are wanted by either the developers or the sponsors, not the the general user base


Actually this might be a good idea for kickstarter if anyone with relevant experience is around :) I bet it would not be a problem to raise 200-300K for this


i would definitely contribute.

BTW - elastic.co is just going for an IPO. There's a lot of money in search!


Yeah for whatever advantages Elastic Search might have, I'd rather not read a 500 page book and then have to deal with dozens of potential security issues. I'd gladly have slightly less relevant search results in exchange for getting to use the same permissions system I use for all my other queries.


You can use Elasticsearch within Postgres by using zombodb.


I doubt it unfortunately - the FTS operators are too heavyweight to be inlined (or benefit from inlining), so JIT isn't likely to help a ton.

Parallelism can help however, depending on what your bottleneck is (if the query produces a bitmap indexscan the heap access part can be parallelized).


Very exciting stuff happening in the PG world. One of the items I'm watching is support for foreign keys in arrays - https://commitfest.postgresql.org/17/1252/

Author is currently stuck and I hope someone with the right knowledge can help him out so he can continue the work.


Oh damn that'd be great. Similarly, foreign keys in json(b).


why? so that you can simulate nosql hacks in an rdbms...?


Sometimes joins (and left joins) can be much more expensive than if the foreign keys are in a single indexable column.


It looks like (reading through the release notes and some not very recent mailing list threads) that CTE optimization fences will not be removed in PostgreSQL 11.

This makes me very sad.


There is a work in progress patch (https://commitfest.postgresql.org/19/1734/) which you can help out with by reviewing and testing it.


I wasnt aware this was being worked on. This is great news. Best case scenario though this would come in 12?


Right.


Oh beautiful, thanks! I'm really looking forward to it.


PostgreSQL for all its great points has the worst hinting strategy in the world.

There's no hinting because the planner is so smart, and yet there is mandatory hinting on a legitimately useful language construct that in some cases is the only way to do something (e.g. recursive queries).

Just a very weird philosophy.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: