pandas is way more powerful that most people use it. when you have to deal with ...

IKantRead · on Oct 6, 2023

> essentially same pandas just for R

You are aware the pandas was designed to replicate the behavior of base R's dataframes?

I've been a heavy user of both and R's data frames are still superior to pandas even without the tidyverse.

Pandas is really nice for the use case it was designed for: working with financial data. This is a big part of why Pandas's indices feel so weird for everything else, but if your index is a time in a financial time series then all of a sudden Pandas makes sense and works great

When not working with financial data I try to limit the amount of time my code touches pandas, and increasingly find numpy + regular python works better and is easier to build out larger software with. It also makes it much easier to port your code into another language for use in production (i.e. it's quick and easy to map standard python to language X, but not so much a large amount of non-trivial pandas).

slt2021 · on Oct 6, 2023

with pandas2.0 and using arrow backend instead of numpy - pandas became "cloud datalake native" - you can essentially read from arrow files in S3 very efficiently and at any large scale - and store/process arbitrarily large amounts of files in a cheap serverless infra. Arrow format is also supported by other languages.

with s3+sqs+lambda+pandas - and you can build cheap serverless data processing pipelines and iterate extremely quickly

Karrot_Kream · on Oct 6, 2023

Do you have any benchmarks about how much data a given lambda can search/process after loading Arrow data? Not trying to argue, I'm curious because I never thought of this architecture myself, because I would think that the time it takes to ingest the Arrow data and then search through it would be too long for a lambda but I may be totally off base here. I've not played around in detail with lambdas so I don't have particularly robust mental model on their limitations.

slt2021 · on Oct 6, 2023

reading/writing Arrow is zero serde overhead operation to/from memory to disk.

I think of lambda as a thread, and you can put a trigger on S3 bucket on each incoming file - to get processed. This allows you to get around GIL, and lets you invoke your lambda for each mini-batch.

assuming you have high volume and frequency of data - you will need to "cool down" your high frequency data, and switch from row-basis (like millions of rows per second) to mini-batch basis (like one batch file per 100Mb).

This can be achieved by having kafka with high partition number on the ingestion side, and sink to s3.

from S3 for each new file your lambda will be invoked and minibatch will be processed by your python code, and you can right size your lambda's RAM, but usually I reserve 2-3x size of a batch file for lambda.

the killer feature is zero ops. Just by tuning your minibatch size you can regulate how many times your lambda will be invoked

Karrot_Kream · on Oct 6, 2023

Very cool. Do you then further aggregate and load into a DB or vector store or something?

palae · on Oct 6, 2023

R also has data.table, which extends data.frame and is pretty powerful and very fast

hermitcrab · on Oct 7, 2023

R + data.table is a lot faster than Base R.

See a benchmark of Base R vs R + data.table (plus various other data wrangling solutions, including our own Easy Data Transform) at:

https://www.easydatatransform.com/data_wrangling_etl_tools.h...

wheresmycraisin · on Oct 6, 2023

> Pandas is way more powerful

Only if you 1) don't know SQL and 2) working with tiny datasets that are around 5% of your total RAM.

faizshah · on Oct 6, 2023

I guess it depends on who you ask but personally I am able to write pandas much faster than loading data into a DB and then processing it. The reason is pandas defaults on from_ and to_ are very sane and you don’t need to think about things like escaping strings and stuff. It’s also easy to deal with nulls quickly in pandas and rapidly get some EDA graphs like in R.

The other benefit of pandas is it’s in python so you can use your other data analysis libraries whereas with SQL you need to marshal back and forth between python and SQL.

My usual workflow is: Explore data in pandas/datasette, if it’s big data I explore just a sample and use bash tools to pull out the sample -> write my notebook in pandas -> scale it up in spark/dask/polars depending on use case.

This is pretty good cause ChatGPT understands pandas, pyspark, and SQL really well so you can easily ask it to translate scripts or give you code for different things.

On scalability if you need scale there’s many options today to process large datasets with a dataframe api e.g koalas, polars, dask, modin etc.

slt2021 · on Oct 6, 2023

>>Only if you 1) don't know SQL and 2) working with tiny datasets that are around 5% of your total RAM.

this is only true only for newbie python devs that learned about pandas from blogs on medium.com. I have pipelines that process terabytes per day in a serverless datalake, and it requires zero DBA work that usually comes if you use anything *Sql

Helmut10001 · on Oct 7, 2023

I've processed TBs of CSV files with pandas. You can always read files in chunks and in the end, SQL will also need to read data somewhere from a disk.

esafak · on Oct 6, 2023

You can do that with other tools too.

https://duckdb.org/docs/data/csv/overview.html

https://duckdb.org/docs/data/parquet/overview

https://duckdb.org/docs/data/multiple_files/overview.html

slt2021 · on Oct 6, 2023

interesting, but I would still prefer pandas for data cleansing/manipulation, just because I won't be limited by SQL syntax - and can always use df.apply() and/or any python package for custom processing.

pandas using apache arrow backend also makes it high performance and compatible with cloud native data lakes

plus compatibility with sklearn package makes it a killer feature, with just few lines you can bolt on ML model on top of your data

aidos · on Oct 6, 2023

It definitely has its place. I like to get it to grab the data, clean it up and get out into python / Postgres. I don’t like to have spreading through the codebase.

culi · on Oct 7, 2023

nobody is denying that pandas is powerful. But their syntax and API uses very inconsistent and hard to reconcile patterns. It's painful because it's hard to memorize and most everything has to be looked up