Graph DBMSs need new join algorithms: Story of worst-case optimal joins

semihsalihoglu · on Feb 27, 2023

I wrote about a (relatively) new class of join algorithms for DBMSs that are called "worst-case optimal join algorithms". This is a topic that has kept me very busy for the last 5 6 years and the post aims to explain the simple ideas behind it and why it is the type of algorithms that graph DBMSs should integrate (and we have at KuzuDB: https://github.com/kuzudb/kuzu).

If you're curious about how come database researchers/developers still work on joins after 50 years, it's because what is possible and what is not is actually not that well understood. After many decades wcoj algorithms was probably the biggest algorithmic leap in the field. Many people are looking into information theory and computational geometry to improve our understanding. There is even more advanced but currently unpractical algorithms called "beyond worst-case optimal" join algorithms, which I briefly leave pointers to at the end of the post.

Enjoy reading!

metadat · on Feb 28, 2023

Damn, hadn't seen Kuzu before but the Cypher-esqe query language is beautiful, nearly on par with RedisGraph. This is a rare thing in the world of Graph databases, most of the query languages are horrid, even ones claiming cypher compliancy rarely live up to the hype. I'll have to check out Kuzu.

semihsalihoglu · on Feb 28, 2023

Please do :). We're a bit early but very actively developing the system!

namibj · on Feb 27, 2023

Tetris doesn't seem impractical for analytical joins, though?

Granted, it needs a good datastructure; I'm waiting for a friend to find time to discuss options there.

I'm firmly of the belief that beyond-wcoj is a question of implementation, not of theoretical suitability in practical scenarios.

semihsalihoglu · on Feb 27, 2023

If you have any good ideas about how to implement Tetris in a practical way, you can make a good impact here. This is the only work that I know that tried to implement these: https://arxiv.org/pdf/1503.04169.pdf (and excitingly published at GRADES workshop at SIGMOD, which I co-chaired twice and is very dear to my heart). But I talked to the authors and they all agree that despite the tone of the paper, they found them quite difficult to implement in a performant way.

So here's the problem. The core algorithmic step of "beyond wcojs" are "geometric resolutions". The core idea is to work with gaps in the space. So for example suppose we are joining two relations R(A), S(A) which is an intersection and suppose A is an integer domain. Suppose further than R's maximum A value is 100, and S's minimum value is 101. Then there is a gap of (100, \infty) in R's space and another graph (-\infty, 101) in S's space. If you "resolve/join" these gaps, you get a gap of (-\infty, \infty), which tells you in one operation that the join's output is empty.

On this simple query, this seems to work fine but if you have general relations (let alone non-integer data types) doing such geometric "resolutions" and finding efficient indices to index those "gaps" becomes quite challenging. But any good idea here will push the field!

mamcx · on Feb 27, 2023

I wonder if exist a byte-sized implementation of this algorithms.? As far I know all is part of big system...

semihsalihoglu · on Feb 27, 2023

Can you clarify what you mean by byte-sized implementations? There are several systems now that implement these algorithms, KuzuDB, Umbra, LogicBlox (the earliest was this) are examples I know. I'm sure more will come.

mamcx · on Feb 28, 2023

I mean one that could be seen in "one page"? Like a example using only arrays or similar, without the rest of the db?

sitkack · on Feb 28, 2023

What are your thoughts on "learned indexes"?

semihsalihoglu · on March 1, 2023

That's a topic quite separate from new join algorithms. I also don't know enough frankly but I don't yet see them being integrated in the systems that I'm aware of. I would be hesitant to put them inside KuzuDB since I don't see them resolving major performance bottlenecks. I think on the indices side, one interesting topic is to find more update-friendly disk-based CSR-based indices for graph DBMSs.

nerdfaktor42 · on March 1, 2023

Do you know of any research papers wrt "more update-friendly disk-based CSR-based indices"?

semihsalihoglu · on March 1, 2023

I think Dean De Leo's work in this space is good. It's certainly the right place to start. This work is on using packed memory arrays (pma) but is focused on in-memory versions of pma. I can recommend these two papers: Teseo: https://dl.acm.org/doi/abs/10.14778/3447689.3447708, and Packed Memory Arrays Rewired: https://ieeexplore.ieee.org/abstract/document/8731468. In Kuzu, I/we will be implementing a pma version of our disk-based CSR-baed join indices, which we also use to store relationship properties, so stay tuned for that!

nerdfaktor42 · on March 2, 2023

Thanks for the pointers, I'll look into them. I read your blog post series yesterday and found it very well written and interesting. Gonna read the CIDR paper, too. Fyi, I'm the author of https://github.com/s1ck/graph where we use a read-only CSR and my day job is https://github.com/neo4j/graph-data-science which is built on top of a read-only, compressed CSR. Maybe we can have a chat at some point :)

semihsalihoglu · on March 2, 2023

Of course :) Please feel free to write to me when you'd like to have a meeting. Happy to chat anytime! I'll also check out your graph library.

carterschonwald · on Feb 27, 2023

These are fun algorithms. I’ve found that some of my experiments in writing efficient linear algebra routines end up with exactly the same interfaces you’d want for implementing these algorithms

semihsalihoglu · on Feb 27, 2023

Interesting. I think there are groups that does quite interesting work on linear algebra that intersects with wcoj algorithms: This keynote abstract should give some pointers on this group's work: https://www.vldb.org/pvldb/vol13/p3502-olteanu.pdf.

lmeyerov · on Feb 28, 2023

Great writeup by the Kuzu team!

An especially interesting area to us is how to redo these to be accelerated with GPUs. Ex: See https://www.adms-conf.org/2014/adms14_wu.pdf

We've been taking a worse-is-better approach to use these kinds of ideas that seems worth writing up at some point. Super fun time for rethinking these :)

semihsalihoglu · on Feb 28, 2023

This is a good paper. It's likely to especially useful for people using GPUs in DBMSs. There is tons of research on this: https://www.nowpublishers.com/article/Details/DBS-076. It will be exciting to see when this could be the norm.

throwaway81523 · on Feb 28, 2023

Can't this be fixed by an SQL query planner that does those joins lazily? And don't some of them already do that?

semihsalihoglu · on Feb 28, 2023

Not sure what you mean by lazily but in general this sub-optimality can't be fixed by joining relations 2 at a time. You need a different algorithmic step that takes > 2 relations and joins them "column by column". That's why you need new "multiway join operators". So you cannot just use existing join operators in a traditional system and fix things.

For the latter part of your question: Yes, there is at least on RDBMS, Umbra, that implements wcoj's. I have a pointer to it in the blog post. Although, I'm convinced eventually many GDBMSs will integrate these algorithms, whether RDBMSs will broadly integrate these is less clear to me. The reason is not whether it's doable or not. Every DBMS, graph or relational, is relational at its core, in the sense that they compile high-level query languages to joins, groups by, filters, scans etc. In fact wcoj algorithms were invented assuming a relational system joining multiple relations. The reason is whether they care enough because wcojs are primarily for "cyclic joins" and they don't appear too often in traditional olap workloads, which are aggregation-heavy. Cyclic joins are more frequent on workloads on GDBMSs (e.g., finding a clique of users to develop a recommendation engine). So these algorithms are more critical for GDBMSs.

throwaway81523 · on Feb 28, 2023

What I mean is that the query planner can decide to do a multi-way join, just like a compiler can reorganize arithmetic expressions or vectorize loops. It is a high level transformation but databases are allowed to do that.

semihsalihoglu · on Feb 28, 2023

I see, that's for sure. That's what every system does in one way or another. Both of the papers I have listed there describe different ways to do it: the Graphflow paper (and Kuzu) currently does this inside their dynamic programming-based join order optimizer. Umbra does it after a good join order is picked.

semihsalihoglu · on March 1, 2023

Indeed pretty much everything is doable in every DBMS, graph or relational, simply because all DBMSs from enough distance are very similar in terms of their core features: a high-level query language that compiles to relational operators, support for transactions, a recovery mechanism etc. What differentiates them (aside from the data models and specific query language syntax they expose to users) are what type of applications they choose to provide optimizations for. Graph DBMSs immediately come into mind for wcoj algorithms because the applications they support frequently contain cyclic many-to-many join queries, which is the workloads wcojs can provide performance benefits. RDBMS, OLAP or OLTP don't assume their application workloads contain such queries. That is why every work done on wcojs so-far has used "graph patterns and workloads" in their evalution, not something like TPC-C or TPC-H.

throwaway81523 · on March 1, 2023

Right so I don't see where graph db's come into this. Regular SQL can handle it.

genman · on Feb 27, 2023

Interesting. Thank you for sharing your work.

semihsalihoglu · on Feb 27, 2023

You are welcome :)