Hacker News new | past | comments | ask | show | jobs | submit login

I switched from pandas to polars[0] . I'm happy with the performance gain. You still need to use Spark/dask on "big" data, but polars give you more room before reaching to these solutions.

[0] -- https://www.pola.rs/




glad it works for you but this sounds like bad advice for most people.. why change to an outlier toolchain for some percentage increase in performance and raw data capacity, when you admit yourself that many Big Data sets are not fitting there? No computer on my desk has less than 32GB RAM now -- pandas works well with that.


I think that neither "outlier toolchain" nor "some percentage increase" are fair. This benchmark [0] show significant speedup while lowering the memory needs. You still need to reach dask/spark for really big data where you need a cluster of beefy computers for your tasks.

If you use an r5d.24xlarge-like[1] instance, you can skip spark/dask for most workflows as 768 GB is plenty enough. On top of that, polars will efficiently use the 96 available cores when you are computing your join, groupby, etc.

Also polars is getting more and more popular[2]

[0] -- https://h2oai.github.io/db-benchmark/ [1] -- https://aws.amazon.com/fr/ec2/instance-types/c6a/ [2] -- https://star-history.com/#pola-rs/polars&Date


It's not so much larger data than it is analyst velocity.

Polars runs orders of magnitudes faster than pandas does. Which means EDA can be completed quicker.


I think it depends on whether you use it for operations or for data analysis. Speed is only one concern and it may not always be the most relevant concern.

A statistician/data scientist wrangling data and making plots would not have cared whether loading a CSV file takes one second or one microsecond, because they may only do it a handful of times for a project.

A data engineer has different requirements and expectations. They may need to implement an operational component that process CSV files repeatedly for billions of time a day.

If your use case is the latter, then pandas is probably not for you.


1 s vs 1 ms is not a great comparison.

Polars excels when pandas operations take 30 seconds or a minute to complete. Bringing that time down to the second or ms mark is really amazing.


I love pandas and work with quite small datasets for EDA (10^(3..6) most of the time) but even then I run into slowdowns. I don't really mind as I'm pursuing my own research rather than satisfying an employer/client, and often figuring out why something is slow turns into a useful learning experience (the canonical rite of passage for new pandas users is using lambda functions with df.apply instead of looping).

I've definitely procrastinated doing some analyses or turning prototypes into dashboards because of the potential for small slowdowns to turn into big slowdowns, so it's nice to have other options available. I'm very interested in Dask but have also been apprehensive about doing something stupid and incurring a huge bill by failing to think through my problem sufficiently.


Some percentage? I haven't used polars, but I ditched pandas after observing x20 speedup by just rewriting a piece of code to plain csv.reader. Pandas is unacceptably slow for medium data, period.


Having had the pleasure of profiling pandas code, the main bottleneck I observed was the handling of different types and edge cases.


Polars looks interesting. I’ve also come across modin[0] which lets you keep your existing pandas syntax, and just change the import statement, to speed up pandas using either Ray or Dask.

[0] https://modin.readthedocs.io/en/stable/

I don’t have much experience with this though.


Does polars have N-D labelled arrays, and if so can it perform computations on them quickly? I've been thinking of moving from pandas to xarray [0], but might consider poplars too if it has some of that functionality.

[0] https://xarray.dev/


polars only has 1D columns, it's columnar just like pandas.

IME xarray and pandas have to be used together. Neither can do what the other does. (Well, some people try to use pandas for stuff that should be numpy or xarray tasks.)


Addendum: Polars doesn't even have index so no multiindex either. I haven't gotten deep enough into polars to understand why and what it breaks, but it feels wrong to replace pandas with something without index.


Indexes don't really do anything useful that's not covered by.group_by() and .over(). There is no loss of functionality.


Right, it's still possible. But less close to hand and never automatic. Series also become less useful when they are just a sequence instead of an almost mapping (with an index) or a sparse sequence (with subset of larger index)


No experience with polars, but I've had quite a positive experience with xarray.

I still use pandas for tabular data, but anytime I have to deal with ND data, xarray is a lifesaver. No more dealing with raw, unlabeled 5-D numpy arrays, trying to remember which dimension is what.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: