Hacker News new | past | comments | ask | show | jobs | submit login
Scaling Pandas: Comparing Dask, Ray, Modin, Vaex, and Rapids (datarevenue.com)
92 points by FHMS on July 5, 2020 | hide | past | favorite | 33 comments



We need a better decomposition of scalability. Do you mean scalability in data or scalability in compute or scalability of both?

Definitions:

Scalability in Data (SD): doing fast computation on a very large number of rows

Scalability in compute (SC): doing slow computations on a large number of rows

For SD, I have found that a 16-32 core machine is more than enough for tens of billions of rows as long as your disk access is relatively fast (SSD vs. HDD). If you vectorize your compute operations you can typically get to within 10x the assembly compute time. This allows you to tap into in a 32 core machine for 10s of effective giga flops. These machines are rated at 100s of giga flops. For example I had to compute a metric on a 100 million row table (dataframe) which effectively required on the order of 10-20 tflops of compute. Single-core pandas was showing us 2 months of compute time. Using vectorization and using mp.Pool I was able to reduce to a few hours. The big win here was vectorization and not mp.Pool.

For Compute scalability - e.g. running multiple machine learning models which cannot be effectively limited to a single machine, nothing beats Dask. Dask is extremely mature, has seen a large number of real world cases and people have used it for hundreds of hours of uptime.

Vectorization is a oft unlooked realm of speedup which can easily give you 10-100x speedups in Pandas. Understanding vectorization and what it can and cannot do is a highly productive exercise.


Oh, I wrote this :) I submitted it last week but it didn't get much attention then.

Happy to answer questions as far as possible. I have used Pandas extensively but I don't have deep experience with all of these libraries so I learnt a lot while summarising them.

If you know more than I do and I made any mistakes, let me know and I'll get them corrected.


What are your thoughts on AWS Glue/Spark ? We’re starting to have problems with data frames that won’t fit into memory anymore on 32Gb clusters and upgrading to the next option, a 64Gb cluster, is an expensive thing. We plan to migrate to glue as a long term solution but I think we need to figure out a short term solution to the issue while the migration takes place.

Thanks for the article, before it I only knew of Dask as a real alternative.

P.D. I just remember that I wanted to try Pandarallel as well, so you have any insight on this library ? Thanks!


Not the OP, but moving to sparse matrices is probably going to give you the most bang for your buck. I would strongly suspect that those huge dataframes could be encoded sparsely in a much more efficient format.

To be fair, that's one of the reasons that Spark ML stuff works quite well. Be warned though, estimating how long a Spark job will take/how much resources it will need is a dark, dark art.


Be very, very sure that you understand how expensive Glue can get, especially with suboptimal code. I have seen bills 10x of same code running on emr spark clusters.


With Ray not having released a 1.0.0 version yet, does that give you any pause about adopting it for a professional project? In the article, you've given it an A for maturity, but the criteria didn't include versioning.

I've worked professionally with data scientists, and we've used both Dask and Ray with some success. Scaling pandas will be an issue for a long time to come with a lot of data science code being written in Python with Pandas.


That's such a tricky one, because the version number doesn't necessarily indicate the maturity.

For example, as of the beginning of this year, the latest version of Pandas was 0.25. ((t jumped to 1.0 in late January.) This despite it having been a core part of the standard professional toolset for years and years now.


Do you also have experience working with SQL databases? If so, how do they compare to Pandas in terms of performance? (with or without these extensions)


It Depends (tm). I think SQL is one of the most underrated and underused languages and can often significantly out-perform Python for basic operations such as filtering and pivoting data.

That said, it's hard to keep SQL readable when doing more complicated data analysis, and you'll probably want the flexibility of Python the moment you start to do anything more custom.


Therefore SQLAlchemy.


I work with sqlalchemy a _lot_. Love it. But it's not intuitive at all especially for the majority of folks with a DS background.


What you think about Spark and PySpark?


Im going to give you my slightly biased and annoyed answer. It seems like people that use python tend to look down on spark as "too complicated" being written in Scala. I come from Scala background and now feeling forced into using python for my data work due to the momentum it has now I am still amazed at how quickly some simple requests like using a different image or having to attach some jars can make python people be like "whooa that's complicated, how can anyone like spark." Personally I love spark (for all it's quirks),and I think that the spark dataframe is much more mature in many ways to pandas, and the sanity type driven programming brings to table, and im kind of sad that im probably going to have to use python the rest of my career because there are so many fires it causes and a real strong tendency to kick many things down the road. The community just generally strikes me as very impatient.


Spark is really, really good. It's a massive leap from the Python/R model of play around with a data.frame till I have a model, then wrap it up in a script for a lot of data scientists though, which causes problems.

Spark is ace as it has an SQL API available cross-language, which makes ETL much more effective, and ML models (though I've always been sort-of suspicious about their maturity).

tl;dr - demonstrate the speed of running regressions in Spark, and many (most) data scientists will invest the time in learning the tool.


Articles like these are interesting, but what surprises me is that they rarely set up a holistic use case, so most debates imagine how long it would take an expert user to use each tool. But time constraints (eg spent coding) separates expert from novice performance in many domains.

FWIW I have 2020 set aside to implement siuba, a python port of the popular R library dplyr (siuba runs on top of pandas, but also can generate SQL). A huge source of inspiration has been screencasts by Dave Robinson using R to analyze data he's never seen before lightning fast.

Has anyone seen similar screencasts with pandas? I suspect it's not possible (given some constraints on its interface), but would love to be wrong here, because I'd like to keep all my work in python :o.

Expert R screencasts: https://youtu.be/NY0-IFet5AM

Siuba: https://github.com/machow/siuba


David Robinson is great and you won't easily find material of the quality he produces elsewhere.

Take a look at Jake Vanderplas and Joel Grus's stuff for a general 'people doing cool things with Python' theme, but not quite comparable.


The subtitle is "How can you process more data quicker?"

NumPy. It scores an A in Maturity and Popularity, and either an A or a B in Ease of Adoption depending on which Pandas features you use (e.g. GroupBy).

When you're using NumPy as the main show instead of an implementation detail inside Pandas, it is easier to adopt Numba or Cython, and there are huge gains to be made there. Most Pandas workloads on small clusters of say 10 machines or fewer could be implemented on a single machine.

Even simple operations on smallish data sets are often much faster in NumPy than Pandas.

You don't have to leave Pandas behind, just try using NumPy and Numba for the hot parts of your code. Numba even lets you write Python code that works with the GIL released, which can lead to linear speedup in the number of cores with much less work than multiprocessing without the overhead of copying data to multiple processes.


This does not solve the issue of compute scalability - slow computations, which are fundamentally opaque, applied to large data frames . Given a series of data frames (or one large one that can be chunked) how do I apply a long running function to each chunk. For that you need scalability across cores and machines hence Dask.


Why do you consider computations to be opaque? Do you not have the source code?

There is a ton of low hanging speed in many computations that people treat as black boxes. Often as the result of knowing something extra about the specific input data rather than relying on a generic implementation.

In some cases all you need is to write NumPy code instead of Pandas code for a 2-3x speedup. Then suddenly your small cluster program runs on one machine.


Besides the speedup from using native numpy, theres also the potential for 50-100x speedup if your code isn't vectorized to begin with, and anywhere from 1-1000x if theres a couple of joins in there that you can optimize.

But for the latter, see discussion on shifting the pd compute to a RDBMS elsewhere in these comments.


SK Learn is the most popular ML libs. Well written, source code available, etc. But I am not opening it up to optimize it and neither should anyone unless they are already a SK Learn contributor OR have a ton of time on their hands.


It would be interesting to see koalas compared as well


Koalas is the Pandas API on top of Apache Spark for anyone that's interested: https://github.com/databricks/koalas

It works similar to PySpark and is scalable to massive datasets (hundreds of terabytes). Koalas is probably the best bet if you're working on a massive dataset and want the Pandas API. Or you can simply use PySpark which has a cleaner interface.


I'll just throw it in the discussion: pandas could just interface with and leave the heavy lifting to a RDBMS.


In my opinion, pandas is fundamentally broken and unsuitable for any production workload.

The heavy lifting should be left to a RDBMS like you say: something with a sensible, battle-hardened query planner. I've written and debugged too many lines of manual pd joins/merges; something declarative like SQL is much nicer because the query planner is almost always right.

Furthermore, as a user, I've always found the pandas API to be very confusing. I'm always having to interrupt my workflow to figure out boring details about the API (is it df.groupBy().rolling(center=True).median() or any other permutation?), whereas eg pyspark or sql are so much more ergonomic.

Finally, typing inside pd dataframes is a complete and utter nightmare. Int64 missing a null, or the idiocy around datetimes expressed as epoch nanoseconds...

Pandas is nice for noodling around in notebooks. But for me, it should never be used beyond that.


Pandas combines the intuitive nature of base-R with the excellent missing data model of Python.


Aye aye!


I use pandas with multiple sources and try to keep as much as possible in the db.

But many sources are outside the db, or run on multiple, disconnected dbs.

Loading data into a db is impractical and I only do it if necessary.

Also, different people run and manage dbs so it’s frequently easier to run in pandas.


The tie breaker here really is kubernetes. Most likely your company's infrastructure is run on k8s. As a data scientist you do not get control over that.

Dask natively integrates with Kubernetes. That's why I see a lot of people moving away even from Apache Spark (which is generally used through its inbuilt scheduler YARN) and towards Dask.

Second reason is that the dask-ml project is building seamless compatibility for higher order ML algorithms (sklearn,etc) on top of Dask. Not just Numpy/Pandas


I working on a project called CloudPy that's a 1-line drop in for pandas to process 100 GB+ dataframes.

It works by proxying the dataframe to a remote server, which can have a lot more memory than you're local server. The project is in beta right now, but please reach out if you're interested in trying it out! You can read more at http://www.cloudpy.io/ or email hello@cloudpy.io


https://github.com/weld-project/weld this is in the same domain of vaex I suppose, the scope is much larger though


Any sense of where Apache Arrow will fit in these libraries?


"Behind the scenes, RAPIDS is based on Apache Arrow..."

https://www.forbes.com/sites/janakirammsv/2018/11/05/nvidia-...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: