Excellent learning material, thanks for sharing. I've noticed an interesting tre...

closeparen · on Jan 20, 2024

If you want arbitrarily powerful adhoc query support, you need to wait for data to land in an offline warehouse or lake environment where you have access to e.g. Presto/Trino and Spark. If you want a near-real-time view then you’re going to need to design the data layout around your query pattern - do a streaming join or other enrichment prior to OLAP ingestion.

necubi · on Jan 20, 2024

Yep. There are always going to be constraints about how well a system like clickhouse can support arbitrary joins. Queries in clickhouse are fast because the data is laid out in such a way that it can minimize how much it needs to read.

Part of this is the columnar layout that means it can avoid reading columns that are not involved in the query. However it’s also able to push query predicates into the table scan, using metadata (like bloom filters) that tell it what values are in each chunk of data.

But for joins, you typically end up needing to read all of the data and materialize it in memory.

For realtime joins the best option is to do it in a steaming fashion on ingestion, for example in a system like Flink or Arroyo [0], which I work on.

[0] https://github.com/ArroyoSystems/arroyo

closeparen · on Jan 20, 2024

Something I have found pretty annoying is that Flink works great for joining a stream against another stream where messages to be joined are expected to arrive within a few minutes of each other, but there is actually ~no platform solution for joining a small, unchanging or slowly changing table against a stream. We end up needing a service to consume the messages, make RPC calls, and re-emit them with new fields.

jgraettinger1 · on Jan 20, 2024

Our (Estuary; I'm CTO) streaming transformation product handles this quite well, actually:

https://docs.estuary.dev/concepts/derivations/

Fully managed, UI and config driven. Write SQLite programs using lambdas that are driven by your source events, in order to join data, do streaming transaction processing, and no doubt lots of other things we haven't thought of.

necubi · on Jan 21, 2024

There are a few options here, but I agree this is a weakness with existing systems.

One option in Flink is to load the entire fact table into the pipeline (using the filesystem source or a custom operator) and join against that. This provides good performance, but at the cost of managing additional long-running state in the pipeline (and potentially long startup times). This works pretty well for very small fact tables (stuff like currency conversions, B2B customer data, etc.).

The other option is to store the fact table in a database and query it dynamically. Flink SQL has explicit support for this (called "lookup joins") but this requires careful tuning to not overwhelm your database with high-volume streaming traffic (particularly when doing bootstrapping or recovery).

Doing these sorts of joins is a huge use case, and definitely something we're trying to improve in Arroyo.

gunnarmorling · on Jan 21, 2024

You can join against a static (or slowly changing) table in Flink, including AS OF joins [1], i.e. joining the correct version of the records from the static table, as of the stream's event time for instance. You need to keep an eye on state size and potential growth of course. It's a common use case we see in particular for users of change data capture at Decodable (decodable.co).

[1] https://nightlies.apache.org/flink/flink-docs-master/docs/de...

HermitX · on Jan 21, 2024

I think you're talking about doing denormalization before importing data into an OLAP system to avoid subsequent joins. However, this greatly limits the flexibility of data modeling. Moreover, denormalization can be a headache-inducing process. In fact, I have tested StarRocks (https://github.com/StarRocks/starrocks), and it is capable of performing joins while streaming data imports, and the speed is very fast. It's worth giving it a try.

totalhack · on Jan 20, 2024

I've also been frustrated when testing out tools that kinda keep you locked into one predetermined view, table, or set of tables at a time. I made a semantic data modeling library that puts together queries (and of course joins) for you as it uses a drill-across querying technique, and can also join data across different data sources in a secondary execution layer.

https://github.com/totalhack/zillion

Disclaimer: this project is currently a one man show, though I use it in production at my own company.

khusmann · on Jan 20, 2024

zillion looks fantastic. I've wanted for a long time to build something similar for warehousing the education and social science research data I work with, but have found it difficult to spend time building infrastructure as someone currently spending most of their time in academia. What does your company do, if you don't mind me asking? Any interest in expanding into research data management? I'd love to chat with you sometime if you're at all interested... (my contact is in my profile)

totalhack · on Jan 21, 2024

Thanks! I'm not sure if I'll have the bandwidth to help yet, but interested to hear about the problem you are facing. I'll reach out this week.

minitoar · on Jan 20, 2024

The reason those tools have more limited support for joins is mainly because they are making intentional trade offs in favor of other features, eg performance in a particular domain.

teunispeters · on Jan 20, 2024

A related note, PostgreSQL is very good at joins, but MySQL - with at the time much larger share - was never very good at them (at the time). (I last explored this 2016 and before). But a lot of web interfaces to data exploration (then) were based on MySQL and its quirks, and that colours perspectives a lot.

antoniojtorres · on Jan 21, 2024

I’m curious about the limitations you’ve encountered with ClickHouse’s JOINS, I’ve found it sufficiently robust for dealing with the typical operations