Hacker News new | past | comments | ask | show | jobs | submit login

Is it just me or is this a severe case of "if you only have a database, everything looks like a SQL query"? Don't get me wrong, I think it's incredible that Postgres is flexible enough to do this, but there are better data structures for graphs than tables.

If you need to do this at relatively small scale, just load the data into memory and don't bother with doing it in the database. If you need to do this at large scale, get a dedicated graph database. I guess I just don't see the niche where doing it in Postgres fits in. Anyone with a real world example? I'd love to be proven wrong here since it's definitely cool.




Its all well and good to sit there and say "get a dedicated graph database".

But there are two problems with that :

(a) In many cases, running a graph database will require getting involved with the monstrosity that is managing Java (since that is what most of them are written in). Managing Java is all well and good if you (or your team) have experience. For mere mortals its a nightmare to manage.

(b) In the case of pretty much all graph databases, you only get basic features in the free / open-source version. If you want access to any genuinely useful features, you have to pay for "enterprise" with the associated $$$$$$$$$$.

Meanwhile Postgres gives everyone a solid database, with all the features included out of the box, no pay to play.

So what if you need to - shock horror - actually learn how a database works and write some proper SQL queries. The reason so many people have horror stories about SQL databases is because they let some inept developers loose on their RDBMS who treat a database like a black box dumping ground with zero thought about table design, index design, SQL queries or anything else.

Rant over. ;-)


Not to mention that dumping "big data" amounts of stuff into any system and running complex search and sort computations on them is still a complexity theoretically hard problem.

SQL or dedicated graph DB won't save anyone from coughing up the resources.

Sure, it's easy to look at SQL and see the big ugly hundred line queries, but they would be hundred line code for graph DBs too.

It's not a big surprise that FB for example for a few years just treated everything that stored data as a fast NAND store with some compaction algorithm on top plus a fancy API. SQL is that. (There was a great talk about this but I haven't managed to find it :/ )


I don't mind writing SQL queries, quite the opposite in fact. In all my jobs so far I've been the go-to guy for rewriting queries to be more efficient. Don't know what you are ranting at tbh.

It's just that the UNION ALL trick to fetch all related rows and then join them seems like it would need to rebuild (the relevant part of) the graph for each query and that simply "looks inefficient" compared to keeping all data in graph form all the time. If you are at a scale that you don't have to worry yet about CPU efficiency in your database layer, by all means just stick it in Postgres.


If you have a mix of "normal" relational data and some hierarchical data then it absolutely makes sense to stick with a relational database.

And why use a full blown graph database if just want to model (and query) a strictly hierarchical structure? e.g. categories that can contain categories or similar things.

And it's part of the SQL standard, so it makes sense to add it to be more compliant with it.


It's a case of "I want to keep all my data on the same place" because nobody has the attention span to manage a relational database, a document database, an hierarchical database, and a key-value database (did I miss any?). All that while you lose atomicity, because your data is now distributed.

Also, data often changes in form, and you don't want to rewrite everything to get the data from a different place just because its type changed.


I've heard it said of Python that it's the second best language for doing anything. For one specific task, there's probably something more tailored to the purpose. For integrating lots of different tasks into a single project, there's nothing that's better at doing everything pretty well.

I see PostgreSQL the same way (although I don't think there's a better RDBMS). It's not the best document database, but it's pretty good. It's not the best key-value store, but it's pretty good. It's not the best graph database, but it's pretty good. And here "pretty good" means "enough that you can build large, complex projects on top of it and expect that it will always work and do the right thing".

Also, the converse is not true: you can't replace PostgreSQL with a K-V store, or a document database, or a graph database. It has a million convenient features that nothing outside of another ACID RDBMS could fully support without massive engineering efforts. Unless I know beyond the shadow of doubt that I'd need, say, a dedicated document database for a specific reason, I'd infinitely rather start on PostgreSQL and use it until we outgrow it, than start with something like MongoDB and find out that we actually need some ACID or relational features that don't exist in Mongo.


I like thinking about the best tools for the job, or "jobs to be done" as Clay Christensen would put it. But why wouldn't you use MongoDB for things that require ACID? MongoDB has had cross-collection ACID transactions since 4.0 (years ago) and in-collection transactions (less useful, of course) since a long time before that. PostgreSQL is indeed a pretty awesome database. That said, the JSON support is just so very painful. It's nice that it supports indexes on JSON, though, once you get through the syntax. Thanks for your post.


According to Jepsen[0], as of 9 months ago:

> Jepsen evaluated MongoDB version 4.2.6, and found that even at the strongest levels of read and write concern, it failed to preserve snapshot isolation. Instead, Jepsen observed read skew, cyclic information flow, duplicate writes, and internal consistency violations.

I would not use that where data integrity is mission critical.

[0] https://jepsen.io/analyses/mongodb-4.2.6


FYI: Same report update:

2020-05-26: MongoDB identified a bug in the transaction retry mechanism which they believe was responsible for the anomalies observed in this report; a patch is scheduled for 4.2.8.


Lots of databases have transaction bugs which they hurriedly rush out patches to. MySQL, Oracle, SQL*Server, even the cloud databases like DynamoDB, CosmosDB, Aurora. Looks like MongoDB fixed this one.


Cool. Hopefully that means it passes the next round of Jepson testing. :)


Indeed! That would be a great confirmation.


MongoDB's licensing situation is much more of a pain.


There are a few reasons why doing things like this in the RDBMS is a good thing for many use cases.

First, there is a ton of already built tooling for dealing with an RDBMS. This alone is why I think every project should default to using an RDBMS and only move if a good case can be made.

Second, like it or not, sql is the defacto language for querying the data. After 20+ years of writing software, I've used many languages, but sql (of course it has evolved) has been a constant across RDBMS. I've seen many service layers and front ends change over the years, all while sitting on the same RDBMS. Given the speed at which tech moves, it's pretty amazing when you think about it.

Finally, for things like graphs or json, it's often the case that I only need handle it in addition to my existing relational datastore. In some cases it might make sense to stand up an entire system for the use case (like a solr server for document searching), but many times I just need to store a bit of json or I have one or two graphs that need to be represented. In these cases, standing up an entire separate system adds unneeded complexity.


> I've seen many service layers and front ends change over the years, all while sitting on the same RDBMS

This.

And the youngsters never believe me, when I tell them that the application that they are writing is not the only one and will not be the last one to access the database.


My standard line for decades has been that the database is more important from the program. You can reconstruct how the program works from the database; you can’t reconstruct the database from how the program works.

A business might be interrupted if their programs accessing the database break. A business will _die_ if their database is lost.


I'm currently staring at a PostgreSQL database created by a Django app with a lot of inheritance between models.

My favorite /s is the table with a single column containing the id of the record itself.

It's basically impossible to start from the database and understand what's going on. It will be a huge pain to interface to this database from something else.

On the other side, I'm maintaining another Django app for the same customer, with saner tables (no inheritance) and that's on par with other ORMs.

My take is never use inheritance with Django.


Never use inheritance at all? I was thinking you could reconstruct part of the app stack by observing and clustering the requests and transactions. But then some people think join is the devil and you'd be back to square one. In a perfect world every app would have only views over what they need and prepared requests. But then CI/CD and all kind of tests can become very painful...


Django is just plain bad at data modeling there I said it


I don't know you, but I think we'd be great friends based solely on this statement.


I don’t make friends as easily as hating on any technology but I do make friends easily on kindness. Hi! I learned a lot great and not great working with DRF. Kind of took it and ran with it on TypeScript and Node. Working on a much better HTTP interface for that platform.


Reminds me of the Fred Brooks quote, "Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious."

It always starts with the data.


> there is a ton of already built tooling

Diving more into postgres recently, I‘m kind of surprised there is not the kind of tooling I know from (frontend) development. I would have expected there to be a number of tools for: version control, CI/CD, (integration) testing, performance monitoring, safe migration, developing stored procedures (in vscode, with hot reload), data shuffling (snaphshots, rollbacks...). Both open source and startups, SaaS services with online dashboards etc.

There are some ok options in these categories but nothing like the explosion of open source tooling out of bigcos and devtools/SaaS startups you see elsewhere.

There‘s... aurora, prisma, some pg extensions to get performance info?


Absolutely! I evaluated a bunch of migration management tools recently and was very disappointed. I needed something to develop and version stored procedures and just didn't find much. Ended up using graphile-migrate which has been great but also tightly bound to other parts of my stack. Would have expected more options from a 30+ year old technology.


If you haven't, Sqitch[1] is a great tool.

[1] https://sqitch.org


Just started using graphile migrate after hand rolling some shell scripts initially

And working pretty nicely so far


I use recursive queries like these at work, for hierarchical accounts/groups/locations/equipment/etc. where each level can have permissions that cascade down or not. Loading the hundreds of rows into memory necessary to calculate what any particular user can see is very costly in terms of network time, particularly for deeply nested stuff. Much easier to do the legwork in SQL and return the relatively tiny result set. There are certainly other tools that could handle these relationships more gracefully, but they also come with the questions of code integration time/backups/redundancy/scaling/etc, which postgres already has good answers for


Thing is, if the primary source of truth is in the database, often it's beneficial to do the computation as close to the data as possible, as the bottleneck is likely to be I/O, not computation.

"just load the data into memory" requires the database to provide to the client all the data that might be needed for that algorithm, which is more load on the DB server than scanning the (much smaller) subset of the data that actually gets traversed during the algorithm directly on the database itself and providing just the results, as that requires less I/O. That's the reason why it's often more efficient (even if we care only about the server-side load) to do filtering, aggregation and summarization for tabular data in the DB instead of loading all the data into a client and calculating it there, and the same applies for graph data.


At my company we used Neo4J, that sucked and it was too expensive. So we moved to OrientDB, that also sucked. So we moved the graph database to Postgres like all our other databases, and it works great now.


Are you aware of https://pgrouting.org/ ?

As it turns out, Postgres is a very effective/efficient graph database, and is used as such quite often.

But to address a more core issue:

> if you only have a database, everything looks like a SQL query

Postgres, here, is just implementing SQL standard features. And SQL is supposed to be the be-all end-all Standard Query Language.

SQL makes no claims about being suited only for querying relational databases. (It uses a lot of relational terminology, but that’s because relational algebra is the strict superset of other querying models, in the same sense that a Turing machine is the strict superset of other automata. If you want a fully-general query language that can be used to query anything, it’s going to need to have relational algebra somewhere in it, at least for the cases where your query is about relations and can’t be expressed in any more compact/abstract way.)

Graph databases could—and probably should!—expose SQL-based query parsers for their querying engine, along with the bevy of other more graph-focused languages they support. (Yes, SQL isn’t as efficient of a syntax for making graph queries than e.g. Datalog, but SQL can still express said queries; and a query parser can turn those SQL-expressed graph queries into the very same query plans it’d build from a Datalog query.)

And, if you want to be able to query a graph database using SQL, you need features like this in SQL. Or rather, to reverse that: given that the SQL committee wants SQL to be “the” standard for querying, they’re going to add features like this to SQL to ensure that it is a viable query language for things like graph databases to expose.

That being said, you can think of this move by PostgreSQL less as “Postgres trying to be a graph database” and more as “Postgres ensuring that, when porting your standard SQL query—which you may have very well written originally to talk to an SQL-speaking graph database—that you don’t have to rewrite the query, but can continue to use standard SQL features.” Think of it like a “compatibility feature” allowing Postgres to “host” this foreign model.

(Which is not to say Postgres is not good at being a graph database. This “compatibility” is first-class! As I mentioned above.)

—————

P.S.: The deeper issue is that nobody other than database maintainers is really aware of what exactly is in the SQL standard; probably because it’s a proprietary document that you have to pay for access to.

On top of that, there’s also no secondary source that lays out what the SQL standard defines, vs. what various DBMSes actually implement. There’s no CanIUse.com for SQL features.

Really, there’s not even anybody out there making blog-posts about newly-standardized not-yet-universally-supported SQL features, the way people make blog-posts about newly-standardized not-yet-universally-supported JS features.

Instead, all most people know about SQL is the particular dialect their DBMS of choice implements; and they only become aware that certain features are in SQL at all, when their DBMS-of-choice implements those features.

It’s all faintly ridiculous, and reeks of the siloed enterprise distribution model of the 1980s. (Remember back before POSIX was a thing, when shell scripts were often written to assume the flavour of Unix the writer was familiar with—usually the one created+maintained by the vendor of their mainframes and workstations? Somehow, despite having the POSIX equivalent—the SQL standard—we’re still stuck in that culture in database land.)


> There’s no CanIUse.com for SQL features.

Yes, there is. Read Modern SQL, sections concept and feature, subsection compatibility. Examples:

https://modern-sql.com/concept/three-valued-logic#compatibil...

https://modern-sql.com/feature/case#compatibility


I'm glad to be proven wrong about this! It's not a site I or any of my other heavily database-engineering-focused colleagues were aware of. It seems to be rather new (~2015), which might be why.

It seems to just be one fellow's blog, though, addressing only a haphazard collection of topics so far as he feels like writing about them. Whereas CanIUse.com (despite also mostly being the work of a single author) is more designed as a complete reference work, and is a FOSS project open to contribution (https://github.com/fyrd/caniuse). So, not quite the same.


> And SQL is supposed to be the be-all end-all Standard Query Language.

It's Structured Query Language.


if i have the data in the DB, why would i bother to load it to memory if i can just query it to get what i want...?

it's a generic DB which now started to provide DFS/BFS - you could argue that B-trees are way more advanced than this.


Replicating state to multiple places can add quite much complexity to application development and operations.


> Is it just me or is this a severe case of "if you only have a database, everything looks like a SQL query"?

Honestly I’m starting to feel like if it looks like tool flexibility is a hammer, all you see is a tool belt. Databases should be able to do search. If you need to pipe tools (and likely duplicate storage), probably one of your tools is dropping the ball or you’re being reflexive about your tools.


Yeah i see value but tend to agree it’s catering to an interesting segment that i don’t understand either.

More generally, I get concerned when too much functionality ends up in the db as it’s often the hardest thing to scale, so I’d prefer to spread those cycles out elsewhere. I feel this is likely to be something someone relies on and causes a production db to get pretty warm at a bad time.


It's not the hardest thing to scale anymore. Engineering teams are pretty hard to scale.


Ha, thats pretty funny. You're totally right.


I agree, but PostgreSQL is one of those projects where I'd rather have the kitchen sink in and believe it will work, almost blindly.


These iterative queries convert hierarchical (not graph) data into relational data. That means queries can perform aggregations and joins against other relations, which is obviously useful in a relational database.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: