I love checking this site to see the cool stuff that makes it's way into postgres each year. Fantastic work to all the postgres committers, cheers all around.
> Fantastic work to all the postgres committers, cheers all around.
Not to forget the postgres contributors: only a portion of the people who write code for postgres are 'committers'. I do agree that the committers do amazing work and play an essential role in postgres' overall ecosystem, but we should not forget that a not insignificant part of postgres' code, code reviews and bugfixes is provided by non-committers.
Very valuable. Do other DBMSes have this feature? It’s quite incredible how much you can put the database to work. So much less application code to write.
They found that only Oracle supported the CYCLE keyword, DB2 SQL Server, and Postgres (version 9 at the time) didn't. SQL Server now has a graph extension so that might have changed. MySQL has introduced support for WITH RECURSIVE in 2017 but it does not support the CYCLE keyword.
This a really cool feature being added to Postgres 14. Each version is coming up with one or more features that is enabling more Oracle to PostgreSQL migrations. Move to Open Source :)
The world just started moving to PostgreSQL from Oracle. More and More migrations are expected. However, some commercial vendors are trying to side-track the Crowd by creating Vendor lock-in stating they provide Enterprise features on Postgres through their commercial forks.
However, is PostgreSQL not already an Enterprise Grade ?
I hope the world understands the capabilities of Community (Open Source) PostgreSQL and do not get carried away by the some of the commercial vendors.
Cool feature, however I'm always worried of anything that encourages developers to push more of the business logic from the code towards database queries.
Of course you have to balance that out with the cost/time of retrieving a possibly large amount of data for that computation, and computing locally, which is what these SEARCH and CYCLE features aim at preventing.
Still, I'd think twice before jumping into using them, to ensure that making the queries more complex does bring enough bang for the buck.
> the cost/time of retrieving a possibly large amount of data for that computation
in this case doing it in application space is really problematic because you can't formulate the next query until you see the result of the first - so you will be iterating over and over with sequential queries in your bread or depth first search - really problematic. So while there are many things where it's neutral technically and more a matter of design aesthetic / principle where things go.... in this case it's solving something that just cannot be done sensibly in application space (at least, without large and complex caching logic ...).
Scaling. You can autoscale app servers by orders of magnitude, no sweat. Database replicas require much more logistical overhead and provide diminishing returns. The less demand your app places on a database overall, the more scalable it will be.
(Of course, running a whole lot of simple queries can be more expensive in the grand scheme of things than one big query, so there's no one-size-fits-all solution to that principle.)
One thing I saw at a company I worked for, although it's tangential to technical concerns and more of an anecdote, were political ones.
A small team of sysadmins was maintaining a set of very large queries that had grown organically over time, to the point that some of those queries where 20,000 lines.
As you know, database queries don't scale as gracefully as code in size, because there's no modules, namespaces, classes, and so on. It's easier to distribute different parts of an application to different people/teams than it is to distributes parts of a database query.
Only those sysadmins knew the queries ins and outs and could evolve those queries to the ever-changing business needs. They had an internal monopoly, which gave them a lot of political power over the organization and became a problem over time.
Sounds interesting. Recently, I have been looking for a quick approach to make a clone of a production database for use in staging.
The idea to avoid failing migrations by applying against a thin clone of production, as somehow migrations fail in production but work fine in staging. Anyone aware of a quick way to make a clone?
Currently copying a 0.5TB db takes 3 hours while I am dreaming of like minutes.
There is also interesting work around using filesystem volume snapshots or clones for database cloning, but that requires infrastructural changes… I think there was a startup around that but I can’t find the name :)
Is it just me or is this a severe case of "if you only have a database, everything looks like a SQL query"? Don't get me wrong, I think it's incredible that Postgres is flexible enough to do this, but there are better data structures for graphs than tables.
If you need to do this at relatively small scale, just load the data into memory and don't bother with doing it in the database. If you need to do this at large scale, get a dedicated graph database. I guess I just don't see the niche where doing it in Postgres fits in. Anyone with a real world example? I'd love to be proven wrong here since it's definitely cool.
Its all well and good to sit there and say "get a dedicated graph database".
But there are two problems with that :
(a) In many cases, running a graph database will require getting involved with the monstrosity that is managing Java (since that is what most of them are written in). Managing Java is all well and good if you (or your team) have experience. For mere mortals its a nightmare to manage.
(b) In the case of pretty much all graph databases, you only get basic features in the free / open-source version. If you want access to any genuinely useful features, you have to pay for "enterprise" with the associated $$$$$$$$$$.
Meanwhile Postgres gives everyone a solid database, with all the features included out of the box, no pay to play.
So what if you need to - shock horror - actually learn how a database works and write some proper SQL queries. The reason so many people have horror stories about SQL databases is because they let some inept developers loose on their RDBMS who treat a database like a black box dumping ground with zero thought about table design, index design, SQL queries or anything else.
Not to mention that dumping "big data" amounts of stuff into any system and running complex search and sort computations on them is still a complexity theoretically hard problem.
SQL or dedicated graph DB won't save anyone from coughing up the resources.
Sure, it's easy to look at SQL and see the big ugly hundred line queries, but they would be hundred line code for graph DBs too.
It's not a big surprise that FB for example for a few years just treated everything that stored data as a fast NAND store with some compaction algorithm on top plus a fancy API. SQL is that. (There was a great talk about this but I haven't managed to find it :/ )
I don't mind writing SQL queries, quite the opposite in fact. In all my jobs so far I've been the go-to guy for rewriting queries to be more efficient. Don't know what you are ranting at tbh.
It's just that the UNION ALL trick to fetch all related rows and then join them seems like it would need to rebuild (the relevant part of) the graph for each query and that simply "looks inefficient" compared to keeping all data in graph form all the time. If you are at a scale that you don't have to worry yet about CPU efficiency in your database layer, by all means just stick it in Postgres.
If you have a mix of "normal" relational data and some hierarchical data then it absolutely makes sense to stick with a relational database.
And why use a full blown graph database if just want to model (and query) a strictly hierarchical structure? e.g. categories that can contain categories or similar things.
And it's part of the SQL standard, so it makes sense to add it to be more compliant with it.
It's a case of "I want to keep all my data on the same place" because nobody has the attention span to manage a relational database, a document database, an hierarchical database, and a key-value database (did I miss any?). All that while you lose atomicity, because your data is now distributed.
Also, data often changes in form, and you don't want to rewrite everything to get the data from a different place just because its type changed.
I've heard it said of Python that it's the second best language for doing anything. For one specific task, there's probably something more tailored to the purpose. For integrating lots of different tasks into a single project, there's nothing that's better at doing everything pretty well.
I see PostgreSQL the same way (although I don't think there's a better RDBMS). It's not the best document database, but it's pretty good. It's not the best key-value store, but it's pretty good. It's not the best graph database, but it's pretty good. And here "pretty good" means "enough that you can build large, complex projects on top of it and expect that it will always work and do the right thing".
Also, the converse is not true: you can't replace PostgreSQL with a K-V store, or a document database, or a graph database. It has a million convenient features that nothing outside of another ACID RDBMS could fully support without massive engineering efforts. Unless I know beyond the shadow of doubt that I'd need, say, a dedicated document database for a specific reason, I'd infinitely rather start on PostgreSQL and use it until we outgrow it, than start with something like MongoDB and find out that we actually need some ACID or relational features that don't exist in Mongo.
I like thinking about the best tools for the job, or "jobs to be done" as Clay Christensen would put it. But why wouldn't you use MongoDB for things that require ACID? MongoDB has had cross-collection ACID transactions since 4.0 (years ago) and in-collection transactions (less useful, of course) since a long time before that.
PostgreSQL is indeed a pretty awesome database. That said, the JSON support is just so very painful. It's nice that it supports indexes on JSON, though, once you get through the syntax.
Thanks for your post.
> Jepsen evaluated MongoDB version 4.2.6, and found that even at the strongest levels of read and write concern, it failed to preserve snapshot isolation. Instead, Jepsen observed read skew, cyclic information flow, duplicate writes, and internal consistency violations.
I would not use that where data integrity is mission critical.
2020-05-26: MongoDB identified a bug in the transaction retry mechanism which they believe was responsible for the anomalies observed in this report; a patch is scheduled for 4.2.8.
Lots of databases have transaction bugs which they hurriedly rush out patches to. MySQL, Oracle, SQL*Server, even the cloud databases like DynamoDB, CosmosDB, Aurora. Looks like MongoDB fixed this one.
There are a few reasons why doing things like this in the RDBMS is a good thing for many use cases.
First, there is a ton of already built tooling for dealing with an RDBMS. This alone is why I think every project should default to using an RDBMS and only move if a good case can be made.
Second, like it or not, sql is the defacto language for querying the data. After 20+ years of writing software, I've used many languages, but sql (of course it has evolved) has been a constant across RDBMS. I've seen many service layers and front ends change over the years, all while sitting on the same RDBMS. Given the speed at which tech moves, it's pretty amazing when you think about it.
Finally, for things like graphs or json, it's often the case that I only need handle it in addition to my existing relational datastore. In some cases it might make sense to stand up an entire system for the use case (like a solr server for document searching), but many times I just need to store a bit of json or I have one or two graphs that need to be represented. In these cases, standing up an entire separate system adds unneeded complexity.
> I've seen many service layers and front ends change over the years, all while sitting on the same RDBMS
This.
And the youngsters never believe me, when I tell them that the application that they are writing is not the only one and will not be the last one to access the database.
My standard line for decades has been that the database is more important from the program. You can reconstruct how the program works from the database; you can’t reconstruct the database from how the program works.
A business might be interrupted if their programs accessing the database break. A business will _die_ if their database is lost.
I'm currently staring at a PostgreSQL database created by a Django app with a lot of inheritance between models.
My favorite /s is the table with a single column containing the id of the record itself.
It's basically impossible to start from the database and understand what's going on. It will be a huge pain to interface to this database from something else.
On the other side, I'm maintaining another Django app for the same customer, with saner tables (no inheritance) and that's on par with other ORMs.
Never use inheritance at all?
I was thinking you could reconstruct part of the app stack by observing and clustering the requests and transactions. But then some people think join is the devil and you'd be back to square one. In a perfect world every app would have only views over what they need and prepared requests. But then CI/CD and all kind of tests can become very painful...
I don’t make friends as easily as hating on any technology but I do make friends easily on kindness. Hi! I learned a lot great and not great working with DRF. Kind of took it and ran with it on TypeScript and Node. Working on a much better HTTP interface for that platform.
Reminds me of the Fred Brooks quote, "Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious."
Diving more into postgres recently, I‘m kind of surprised there is not the kind of tooling I know from (frontend) development. I would have expected there to be a number of tools for: version control, CI/CD, (integration) testing, performance monitoring, safe migration, developing stored procedures (in vscode, with hot reload), data shuffling (snaphshots, rollbacks...). Both open source and startups, SaaS services with online dashboards etc.
There are some ok options in these categories but nothing like the explosion of open source tooling out of bigcos and devtools/SaaS startups you see elsewhere.
There‘s... aurora, prisma, some pg extensions to get performance info?
Absolutely! I evaluated a bunch of migration management tools recently and was very disappointed. I needed something to develop and version stored procedures and just didn't find much. Ended up using graphile-migrate which has been great but also tightly bound to other parts of my stack. Would have expected more options from a 30+ year old technology.
I use recursive queries like these at work, for hierarchical accounts/groups/locations/equipment/etc. where each level can have permissions that cascade down or not. Loading the hundreds of rows into memory necessary to calculate what any particular user can see is very costly in terms of network time, particularly for deeply nested stuff. Much easier to do the legwork in SQL and return the relatively tiny result set. There are certainly other tools that could handle these relationships more gracefully, but they also come with the questions of code integration time/backups/redundancy/scaling/etc, which postgres already has good answers for
Thing is, if the primary source of truth is in the database, often it's beneficial to do the computation as close to the data as possible, as the bottleneck is likely to be I/O, not computation.
"just load the data into memory" requires the database to provide to the client all the data that might be needed for that algorithm, which is more load on the DB server than scanning the (much smaller) subset of the data that actually gets traversed during the algorithm directly on the database itself and providing just the results, as that requires less I/O. That's the reason why it's often more efficient (even if we care only about the server-side load) to do filtering, aggregation and summarization for tabular data in the DB instead of loading all the data into a client and calculating it there, and the same applies for graph data.
At my company we used Neo4J, that sucked and it was too expensive. So we moved to OrientDB, that also sucked. So we moved the graph database to Postgres like all our other databases, and it works great now.
As it turns out, Postgres is a very effective/efficient graph database, and is used as such quite often.
But to address a more core issue:
> if you only have a database, everything looks like a SQL query
Postgres, here, is just implementing SQL standard features. And SQL is supposed to be the be-all end-all Standard Query Language.
SQL makes no claims about being suited only for querying relational databases. (It uses a lot of relational terminology, but that’s because relational algebra is the strict superset of other querying models, in the same sense that a Turing machine is the strict superset of other automata. If you want a fully-general query language that can be used to query anything, it’s going to need to have relational algebra somewhere in it, at least for the cases where your query is about relations and can’t be expressed in any more compact/abstract way.)
Graph databases could—and probably should!—expose SQL-based query parsers for their querying engine, along with the bevy of other more graph-focused languages they support. (Yes, SQL isn’t as efficient of a syntax for making graph queries than e.g. Datalog, but SQL can still express said queries; and a query parser can turn those SQL-expressed graph queries into the very same query plans it’d build from a Datalog query.)
And, if you want to be able to query a graph database using SQL, you need features like this in SQL. Or rather, to reverse that: given that the SQL committee wants SQL to be “the” standard for querying, they’re going to add features like this to SQL to ensure that it is a viable query language for things like graph databases to expose.
That being said, you can think of this move by PostgreSQL less as “Postgres trying to be a graph database” and more as “Postgres ensuring that, when porting your standard SQL query—which you may have very well written originally to talk to an SQL-speaking graph database—that you don’t have to rewrite the query, but can continue to use standard SQL features.” Think of it like a “compatibility feature” allowing Postgres to “host” this foreign model.
(Which is not to say Postgres is not good at being a graph database. This “compatibility” is first-class! As I mentioned above.)
—————
P.S.: The deeper issue is that nobody other than database maintainers is really aware of what exactly is in the SQL standard; probably because it’s a proprietary document that you have to pay for access to.
On top of that, there’s also no secondary source that lays out what the SQL standard defines, vs. what various DBMSes actually implement. There’s no CanIUse.com for SQL features.
Really, there’s not even anybody out there making blog-posts about newly-standardized not-yet-universally-supported SQL features, the way people make blog-posts about newly-standardized not-yet-universally-supported JS features.
Instead, all most people know about SQL is the particular dialect their DBMS of choice implements; and they only become aware that certain features are in SQL at all, when their DBMS-of-choice implements those features.
It’s all faintly ridiculous, and reeks of the siloed enterprise distribution model of the 1980s. (Remember back before POSIX was a thing, when shell scripts were often written to assume the flavour of Unix the writer was familiar with—usually the one created+maintained by the vendor of their mainframes and workstations? Somehow, despite having the POSIX equivalent—the SQL standard—we’re still stuck in that culture in database land.)
I'm glad to be proven wrong about this! It's not a site I or any of my other heavily database-engineering-focused colleagues were aware of. It seems to be rather new (~2015), which might be why.
It seems to just be one fellow's blog, though, addressing only a haphazard collection of topics so far as he feels like writing about them. Whereas CanIUse.com (despite also mostly being the work of a single author) is more designed as a complete reference work, and is a FOSS project open to contribution (https://github.com/fyrd/caniuse). So, not quite the same.
> Is it just me or is this a severe case of "if you only have a database, everything looks like a SQL query"?
Honestly I’m starting to feel like if it looks like tool flexibility is a hammer, all you see is a tool belt. Databases should be able to do search. If you need to pipe tools (and likely duplicate storage), probably one of your tools is dropping the ball or you’re being reflexive about your tools.
Yeah i see value but tend to agree it’s catering to an interesting segment that i don’t understand either.
More generally, I get concerned when too much functionality ends up in the db as it’s often the hardest thing to scale, so I’d prefer to spread those cycles out elsewhere. I feel this is likely to be something someone relies on and causes a production db to get pretty warm at a bad time.
These iterative queries convert hierarchical (not graph) data into relational data. That means queries can perform aggregations and joins against other relations, which is obviously useful in a relational database.
I'm always so impressed by Postgres' features and performance. I am unfortunately not as impressed with Postgres' operational model. As a disclaimer, I'm not highly familiar with very many databases, but it seems like Postgres sticks out as a poorly suited for robustly automated infrastructure.
Maybe I'm totally wrong on this, looking forward to reading other perspectives.
edit: If you’re going to down my post can you at least reply and tell me why? I’m asking questions to learn things outside my own perspective.
I didn't down vote you, but I suppose that if you are looking for others to respond with why, you could start by explaining why you have that perspective yourself. Else, what is anyone to respond to?
This feature looks really cool