I wrote about a (relatively) new class of join algorithms for DBMSs that are called "worst-case optimal join algorithms". This is a topic that has kept me very busy for the last 5 6 years and the post aims to explain the simple ideas behind it and why it is the type of algorithms that graph DBMSs should integrate (and we have at KuzuDB: https://github.com/kuzudb/kuzu).
If you're curious about how come database researchers/developers still work on joins after 50 years, it's because what is possible and what is not is actually not that well understood. After many decades wcoj algorithms was probably the biggest algorithmic leap in the field. Many people are looking into information theory and computational geometry to improve our understanding. There is even more advanced but currently unpractical algorithms called "beyond worst-case optimal" join algorithms, which I briefly leave pointers to at the end of the post.
Damn, hadn't seen Kuzu before but the Cypher-esqe query language is beautiful, nearly on par with RedisGraph. This is a rare thing in the world of Graph databases, most of the query languages are horrid, even ones claiming cypher compliancy rarely live up to the hype. I'll have to check out Kuzu.
If you have any good ideas about how to implement Tetris in a practical way, you can make a good impact here. This is the only work that I know that tried to implement these: https://arxiv.org/pdf/1503.04169.pdf (and excitingly published at GRADES workshop at SIGMOD, which I co-chaired twice and is very dear to my heart). But I talked to the authors and they all agree that despite the tone of the paper, they found them quite difficult to implement in a performant way.
So here's the problem. The core algorithmic step of "beyond wcojs" are "geometric resolutions". The core idea is to work with gaps in the space. So for example suppose we are joining two relations R(A), S(A) which is an intersection and suppose A is an integer domain. Suppose further than R's maximum A value is 100, and S's minimum value is 101. Then there is a gap of (100, \infty) in R's space and another graph (-\infty, 101) in S's space. If you "resolve/join" these gaps, you get a gap of (-\infty, \infty), which tells you in one operation that the join's output is empty.
On this simple query, this seems to work fine but if you have general relations (let alone non-integer data types) doing such geometric "resolutions" and finding efficient indices to index those "gaps" becomes quite challenging. But any good idea here will push the field!
Can you clarify what you mean by byte-sized implementations? There are several systems now that implement these algorithms, KuzuDB, Umbra, LogicBlox (the earliest was this) are examples I know. I'm sure more will come.
That's a topic quite separate from new join algorithms. I also don't know enough frankly but I don't yet see them being integrated in the systems that I'm aware of. I would be hesitant to put them inside KuzuDB since I don't see them resolving major performance bottlenecks. I think on the indices side, one interesting topic is to find more update-friendly disk-based CSR-based indices for graph DBMSs.
I think Dean De Leo's work in this space is good. It's certainly the right place to start. This work is on using packed memory arrays (pma) but is focused on in-memory versions of pma. I can recommend these two papers: Teseo: https://dl.acm.org/doi/abs/10.14778/3447689.3447708, and Packed Memory Arrays Rewired: https://ieeexplore.ieee.org/abstract/document/8731468. In Kuzu, I/we will be implementing a pma version of our disk-based CSR-baed join indices, which we also use to store relationship properties, so stay tuned for that!
Thanks for the pointers, I'll look into them. I read your blog post series yesterday and found it very well written and interesting. Gonna read the CIDR paper, too. Fyi, I'm the author of https://github.com/s1ck/graph where we use a read-only CSR and my day job is https://github.com/neo4j/graph-data-science which is built on top of a read-only, compressed CSR. Maybe we can have a chat at some point :)
These are fun algorithms. I’ve found that some of my experiments in writing efficient linear algebra routines end up with exactly the same interfaces you’d want for implementing these algorithms
Interesting. I think there are groups that does quite interesting work on linear algebra that intersects with wcoj algorithms: This keynote abstract should give some pointers on this group's work: https://www.vldb.org/pvldb/vol13/p3502-olteanu.pdf.
We've been taking a worse-is-better approach to use these kinds of ideas that seems worth writing up at some point. Super fun time for rethinking these :)
This is a good paper. It's likely to especially useful for people using GPUs in DBMSs. There is tons of research on this: https://www.nowpublishers.com/article/Details/DBS-076. It will be exciting to see when this could be the norm.
Not sure what you mean by lazily but in general this sub-optimality can't be fixed by joining relations 2 at a time. You need a different algorithmic step that takes > 2 relations and joins them "column by column". That's why you need new "multiway join operators". So you cannot just use existing join operators in a traditional system and fix things.
For the latter part of your question: Yes, there is at least on RDBMS, Umbra, that implements wcoj's. I have a pointer to it in the blog post. Although, I'm convinced eventually many GDBMSs will integrate these algorithms, whether RDBMSs will broadly integrate these is less clear to me. The reason is not whether it's doable or not. Every DBMS, graph or relational, is relational at its core, in the sense that they compile high-level query languages to joins, groups by, filters, scans etc. In fact wcoj algorithms were invented assuming a relational system joining multiple relations. The reason is whether they care enough because wcojs are primarily for "cyclic joins" and they don't appear too often in traditional olap workloads, which are aggregation-heavy. Cyclic joins are more frequent on workloads on GDBMSs (e.g., finding a clique of users to develop a recommendation engine). So these algorithms are more critical for GDBMSs.
What I mean is that the query planner can decide to do a multi-way join, just like a compiler can reorganize arithmetic expressions or vectorize loops. It is a high level transformation but databases are allowed to do that.
I see, that's for sure. That's what every system does in one way or another. Both of the papers I have listed there describe different ways to do it: the Graphflow paper (and Kuzu) currently does this inside their dynamic programming-based join order optimizer. Umbra does it after a good join order is picked.
Indeed pretty much everything is doable in every DBMS, graph or relational, simply because all DBMSs from enough distance are very similar in terms of their core features: a high-level query language that compiles to relational operators, support for transactions, a recovery mechanism etc. What differentiates them (aside from the data models and specific query language syntax they expose to users) are what type of applications they choose to provide optimizations for. Graph DBMSs immediately come into mind for wcoj algorithms because the applications they support frequently contain cyclic many-to-many join queries, which is the workloads wcojs can provide performance benefits. RDBMS, OLAP or OLTP don't assume their application workloads contain such queries. That is why every work done on wcojs so-far has used "graph patterns and workloads" in their evalution, not something like TPC-C or TPC-H.
If you're curious about how come database researchers/developers still work on joins after 50 years, it's because what is possible and what is not is actually not that well understood. After many decades wcoj algorithms was probably the biggest algorithmic leap in the field. Many people are looking into information theory and computational geometry to improve our understanding. There is even more advanced but currently unpractical algorithms called "beyond worst-case optimal" join algorithms, which I briefly leave pointers to at the end of the post.
Enjoy reading!