Hacker News new | past | comments | ask | show | jobs | submit login

Excellent learning material, thanks for sharing. I've noticed an interesting trend: JOINS are crucial for data analytics, yet many new open-source data analytics products, or open-source OLAP products, offer limited support for JOINS. Examples include ClickHouse, Apache Druid, and Apache Pinot. It seems that currently, only Trino and StarRocks provide robust support for JOINS. Commercial products tend to have better support in this area. I hope the various open-source projects will also enhance their JOIN capabilities.



If you want arbitrarily powerful adhoc query support, you need to wait for data to land in an offline warehouse or lake environment where you have access to e.g. Presto/Trino and Spark. If you want a near-real-time view then you’re going to need to design the data layout around your query pattern - do a streaming join or other enrichment prior to OLAP ingestion.


Yep. There are always going to be constraints about how well a system like clickhouse can support arbitrary joins. Queries in clickhouse are fast because the data is laid out in such a way that it can minimize how much it needs to read.

Part of this is the columnar layout that means it can avoid reading columns that are not involved in the query. However it’s also able to push query predicates into the table scan, using metadata (like bloom filters) that tell it what values are in each chunk of data.

But for joins, you typically end up needing to read all of the data and materialize it in memory.

For realtime joins the best option is to do it in a steaming fashion on ingestion, for example in a system like Flink or Arroyo [0], which I work on.

[0] https://github.com/ArroyoSystems/arroyo


Something I have found pretty annoying is that Flink works great for joining a stream against another stream where messages to be joined are expected to arrive within a few minutes of each other, but there is actually ~no platform solution for joining a small, unchanging or slowly changing table against a stream. We end up needing a service to consume the messages, make RPC calls, and re-emit them with new fields.


Our (Estuary; I'm CTO) streaming transformation product handles this quite well, actually:

https://docs.estuary.dev/concepts/derivations/

Fully managed, UI and config driven. Write SQLite programs using lambdas that are driven by your source events, in order to join data, do streaming transaction processing, and no doubt lots of other things we haven't thought of.


There are a few options here, but I agree this is a weakness with existing systems.

One option in Flink is to load the entire fact table into the pipeline (using the filesystem source or a custom operator) and join against that. This provides good performance, but at the cost of managing additional long-running state in the pipeline (and potentially long startup times). This works pretty well for very small fact tables (stuff like currency conversions, B2B customer data, etc.).

The other option is to store the fact table in a database and query it dynamically. Flink SQL has explicit support for this (called "lookup joins") but this requires careful tuning to not overwhelm your database with high-volume streaming traffic (particularly when doing bootstrapping or recovery).

Doing these sorts of joins is a huge use case, and definitely something we're trying to improve in Arroyo.


You can join against a static (or slowly changing) table in Flink, including AS OF joins [1], i.e. joining the correct version of the records from the static table, as of the stream's event time for instance. You need to keep an eye on state size and potential growth of course. It's a common use case we see in particular for users of change data capture at Decodable (decodable.co).

[1] https://nightlies.apache.org/flink/flink-docs-master/docs/de...


I think you're talking about doing denormalization before importing data into an OLAP system to avoid subsequent joins. However, this greatly limits the flexibility of data modeling. Moreover, denormalization can be a headache-inducing process. In fact, I have tested StarRocks (https://github.com/StarRocks/starrocks), and it is capable of performing joins while streaming data imports, and the speed is very fast. It's worth giving it a try.


I've also been frustrated when testing out tools that kinda keep you locked into one predetermined view, table, or set of tables at a time. I made a semantic data modeling library that puts together queries (and of course joins) for you as it uses a drill-across querying technique, and can also join data across different data sources in a secondary execution layer.

https://github.com/totalhack/zillion

Disclaimer: this project is currently a one man show, though I use it in production at my own company.


zillion looks fantastic. I've wanted for a long time to build something similar for warehousing the education and social science research data I work with, but have found it difficult to spend time building infrastructure as someone currently spending most of their time in academia. What does your company do, if you don't mind me asking? Any interest in expanding into research data management? I'd love to chat with you sometime if you're at all interested... (my contact is in my profile)


Thanks! I'm not sure if I'll have the bandwidth to help yet, but interested to hear about the problem you are facing. I'll reach out this week.


The reason those tools have more limited support for joins is mainly because they are making intentional trade offs in favor of other features, eg performance in a particular domain.


A related note, PostgreSQL is very good at joins, but MySQL - with at the time much larger share - was never very good at them (at the time). (I last explored this 2016 and before). But a lot of web interfaces to data exploration (then) were based on MySQL and its quirks, and that colours perspectives a lot.


I’m curious about the limitations you’ve encountered with ClickHouse’s JOINS, I’ve found it sufficiently robust for dealing with the typical operations




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: