My naive interpretation - The canonical Apache Arrow implementation is written i...

ritchie46 · on March 14, 2021

Hi Author here,

Polars is not an alternative to PyArrow. Polars merely uses arrow as its in-memory representation of data. Similar to how pandas uses numpy.

Arrow provides the efficient data structures and some compute kernels, like a SUM, a FILTER, a MAX etc. Arrow is not a query engine. Polars is a DataFrame library on top of arrow that has implemented efficient algorithms for JOINS, GROUPBY, PIVOTs, MELTs, QUERY OPTIMIZATION, etc. (the things you expect from a DF lib).

Polars could be best described as an in-memory DataFrame library with a query optimizer.

Because it uses Rust Arrow, it can easily swap pointers around to pyarrow and get zero-copy data interop.

DataFusion is another query engine on top of arrow. They both use arrow as lower level memory layout, but both have a different implementation of their query engine and their API. I would say that DataFusion is more focused on a Query Engine and Polars is more focused an a DataFrame lib, but this is subjective.

Maybe its like comparing Rust Tokio vs Rust async-std. Just different implementations striving the same goal. (Only Polars and DataFusion can easily be mixed as they use the same memory structures).

sradman · on March 14, 2021

Pandas supports JOIN and GROUP BY operators so you are saying that there is a gap between Apache Arrow and other mature dataframe libraries? If there is a gap, is there no plan to fix it in the standard Arrow API?

I understand the case for a SQL-like DSL and an optimizer for distributed queries (in-memory column stores, not so much). I'm trying to understand the value add of Polars. I don't mean to come across as critical; perhaps DataFusion is a poor implementation and you are being too polite to say so.

I also think that there is a C++/Arrow vs Rust/Arrow decision that has to be made. I associate PyArrow with the C++/Arrow library. Is Polars' Eager API a superset of the PyArrow API with the addition of JOIN/GROUPBY/other operators?

ritchie46 · on March 14, 2021

There is definitely a gap, and I don't think that Arrow tries to fill that. But I don't think that its wrong to have multiple implementations doing the same thing right? We have PostgresQL vs MySQL, both seem valid choices to me.

A SQL like query engine has its place. An in memory DataFrame also has its place. I think the wide-spread use of pandas proves that. I only think we can do that more efficient.

With regard to C++ vs Rust arrow. The memory underneath is the same, so having an implementation in both languages only helps more widespread adoption IMO.

lordgroff · on March 14, 2021

Thank you for your work! I've decided to kick the tires after reading your Python book, I think you understimate the clarity of the API you have exposed which, honestly, looks a fair bit more sane than the tangled web that pandas is.

ritchie46 · on March 14, 2021

Thanks, I feel so too. There is still a hope work to do. I hope that I can also bridge the gap regarding utility and documentation.