I wrote about a (relatively) new class of join algorithms for DBMSs that are called "worst-case optimal join algorithms". This is a topic that has kept me very busy for the last 5 6 years and the post aims to explain the simple ideas behind it and why it is the type of algorithms that graph DBMSs should integrate (and we have at KuzuDB: https://github.com/kuzudb/kuzu).
If you're curious about how come database researchers/developers still work on joins after 50 years, it's because what is possible and what is not is actually not that well understood. After many decades wcoj algorithms was probably the biggest algorithmic leap in the field. Many people are looking into information theory and computational geometry to improve our understanding. There is even more advanced but currently unpractical algorithms called "beyond worst-case optimal" join algorithms, which I briefly leave pointers to at the end of the post.
Damn, hadn't seen Kuzu before but the Cypher-esqe query language is beautiful, nearly on par with RedisGraph. This is a rare thing in the world of Graph databases, most of the query languages are horrid, even ones claiming cypher compliancy rarely live up to the hype. I'll have to check out Kuzu.
If you have any good ideas about how to implement Tetris in a practical way, you can make a good impact here. This is the only work that I know that tried to implement these: https://arxiv.org/pdf/1503.04169.pdf (and excitingly published at GRADES workshop at SIGMOD, which I co-chaired twice and is very dear to my heart). But I talked to the authors and they all agree that despite the tone of the paper, they found them quite difficult to implement in a performant way.
So here's the problem. The core algorithmic step of "beyond wcojs" are "geometric resolutions". The core idea is to work with gaps in the space. So for example suppose we are joining two relations R(A), S(A) which is an intersection and suppose A is an integer domain. Suppose further than R's maximum A value is 100, and S's minimum value is 101. Then there is a gap of (100, \infty) in R's space and another graph (-\infty, 101) in S's space. If you "resolve/join" these gaps, you get a gap of (-\infty, \infty), which tells you in one operation that the join's output is empty.
On this simple query, this seems to work fine but if you have general relations (let alone non-integer data types) doing such geometric "resolutions" and finding efficient indices to index those "gaps" becomes quite challenging. But any good idea here will push the field!
Can you clarify what you mean by byte-sized implementations? There are several systems now that implement these algorithms, KuzuDB, Umbra, LogicBlox (the earliest was this) are examples I know. I'm sure more will come.
That's a topic quite separate from new join algorithms. I also don't know enough frankly but I don't yet see them being integrated in the systems that I'm aware of. I would be hesitant to put them inside KuzuDB since I don't see them resolving major performance bottlenecks. I think on the indices side, one interesting topic is to find more update-friendly disk-based CSR-based indices for graph DBMSs.
I think Dean De Leo's work in this space is good. It's certainly the right place to start. This work is on using packed memory arrays (pma) but is focused on in-memory versions of pma. I can recommend these two papers: Teseo: https://dl.acm.org/doi/abs/10.14778/3447689.3447708, and Packed Memory Arrays Rewired: https://ieeexplore.ieee.org/abstract/document/8731468. In Kuzu, I/we will be implementing a pma version of our disk-based CSR-baed join indices, which we also use to store relationship properties, so stay tuned for that!
Thanks for the pointers, I'll look into them. I read your blog post series yesterday and found it very well written and interesting. Gonna read the CIDR paper, too. Fyi, I'm the author of https://github.com/s1ck/graph where we use a read-only CSR and my day job is https://github.com/neo4j/graph-data-science which is built on top of a read-only, compressed CSR. Maybe we can have a chat at some point :)
If you're curious about how come database researchers/developers still work on joins after 50 years, it's because what is possible and what is not is actually not that well understood. After many decades wcoj algorithms was probably the biggest algorithmic leap in the field. Many people are looking into information theory and computational geometry to improve our understanding. There is even more advanced but currently unpractical algorithms called "beyond worst-case optimal" join algorithms, which I briefly leave pointers to at the end of the post.
Enjoy reading!