Hacker News new | past | comments | ask | show | jobs | submit login

My naive interpretation - The canonical Apache Arrow implementation is written in C++ with multiple language bindings like PyArrow. The Rust bindings for Apache Arrow re-implemented the Arrow specification so it can be used as a pure Rust implementation. Andy Grove [1] built two projects on top of Rust-Arrow: 1. DataFusion, a query engine for Arrow that can optimize SQL-like JOIN and GROUP BY queries, and 2. Ballista, clustered DataFusion-like queries (vs. Dask and Spark). DataFusion was integrated into the Apache Arrow Rust project.

Ritchie Vink has introduced Polars that also builds upon Rust-Arrow. It offers an Eager API that is an alternative to PyArrow and a Lazy API that is a query engine and optimizer like DataFusion. The linked benchmark is focused on JOIN and GROUP BY queries on large datasets executed on a server/workstation-class machine (125 GB memory). This seems like a specialized use case that pushes the limits of a single developer machine and overlaps with the use case for a dedicated column store (like Redshift) or a distributed batch processing system like Spark/MapReduce.

Why Polars over DataFusion? Why Python bindings to Rust-Arrow rather than canonical PyArrow/C++? Is there something wrong with PyArrow?

[1] https://andygrove.io/projects/




Hi Author here,

Polars is not an alternative to PyArrow. Polars merely uses arrow as its in-memory representation of data. Similar to how pandas uses numpy.

Arrow provides the efficient data structures and some compute kernels, like a SUM, a FILTER, a MAX etc. Arrow is not a query engine. Polars is a DataFrame library on top of arrow that has implemented efficient algorithms for JOINS, GROUPBY, PIVOTs, MELTs, QUERY OPTIMIZATION, etc. (the things you expect from a DF lib).

Polars could be best described as an in-memory DataFrame library with a query optimizer.

Because it uses Rust Arrow, it can easily swap pointers around to pyarrow and get zero-copy data interop.

DataFusion is another query engine on top of arrow. They both use arrow as lower level memory layout, but both have a different implementation of their query engine and their API. I would say that DataFusion is more focused on a Query Engine and Polars is more focused an a DataFrame lib, but this is subjective.

Maybe its like comparing Rust Tokio vs Rust async-std. Just different implementations striving the same goal. (Only Polars and DataFusion can easily be mixed as they use the same memory structures).


Pandas supports JOIN and GROUP BY operators so you are saying that there is a gap between Apache Arrow and other mature dataframe libraries? If there is a gap, is there no plan to fix it in the standard Arrow API?

I understand the case for a SQL-like DSL and an optimizer for distributed queries (in-memory column stores, not so much). I'm trying to understand the value add of Polars. I don't mean to come across as critical; perhaps DataFusion is a poor implementation and you are being too polite to say so.

I also think that there is a C++/Arrow vs Rust/Arrow decision that has to be made. I associate PyArrow with the C++/Arrow library. Is Polars' Eager API a superset of the PyArrow API with the addition of JOIN/GROUPBY/other operators?


There is definitely a gap, and I don't think that Arrow tries to fill that. But I don't think that its wrong to have multiple implementations doing the same thing right? We have PostgresQL vs MySQL, both seem valid choices to me.

A SQL like query engine has its place. An in memory DataFrame also has its place. I think the wide-spread use of pandas proves that. I only think we can do that more efficient.

With regard to C++ vs Rust arrow. The memory underneath is the same, so having an implementation in both languages only helps more widespread adoption IMO.


Thank you for your work! I've decided to kick the tires after reading your Python book, I think you understimate the clarity of the API you have exposed which, honestly, looks a fair bit more sane than the tangled web that pandas is.


Thanks, I feel so too. There is still a hope work to do. I hope that I can also bridge the gap regarding utility and documentation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: