Hacker News new | past | comments | ask | show | jobs | submit login

pandas is way more powerful that most people use it.

when you have to deal with thousands of text files, mish mash of csv, tsv, some rows overlap between the files, some files spread across multiple different locations (shared drive, s3 bucket, URL, SQL db, etc), with column names that look similar but not exactly similar - this is perfect use case for pandas.

read csv file? just pd.read_csv()

read and concat N csv files? just pd.concat([pd.read_csv(f) for f in glob("*.csv")])

read parquet or read_sql()? not a problem at all.

need to do some custom rules for data cleansing, or regex matching or fuzzy matching on column names, converting data from/to csv/parquet/sql - it will be pandas 1 liner

a lot of painful data processing/cleaning, correcting data is just 1-liner in pandas, and I dont know of better tool that can beat pandas - probably tidyR but it is essentially same pandas just for R




> essentially same pandas just for R

You are aware the pandas was designed to replicate the behavior of base R's dataframes?

I've been a heavy user of both and R's data frames are still superior to pandas even without the tidyverse.

Pandas is really nice for the use case it was designed for: working with financial data. This is a big part of why Pandas's indices feel so weird for everything else, but if your index is a time in a financial time series then all of a sudden Pandas makes sense and works great

When not working with financial data I try to limit the amount of time my code touches pandas, and increasingly find numpy + regular python works better and is easier to build out larger software with. It also makes it much easier to port your code into another language for use in production (i.e. it's quick and easy to map standard python to language X, but not so much a large amount of non-trivial pandas).


with pandas2.0 and using arrow backend instead of numpy - pandas became "cloud datalake native" - you can essentially read from arrow files in S3 very efficiently and at any large scale - and store/process arbitrarily large amounts of files in a cheap serverless infra. Arrow format is also supported by other languages.

with s3+sqs+lambda+pandas - and you can build cheap serverless data processing pipelines and iterate extremely quickly


Do you have any benchmarks about how much data a given lambda can search/process after loading Arrow data? Not trying to argue, I'm curious because I never thought of this architecture myself, because I would think that the time it takes to ingest the Arrow data and then search through it would be too long for a lambda but I may be totally off base here. I've not played around in detail with lambdas so I don't have particularly robust mental model on their limitations.


reading/writing Arrow is zero serde overhead operation to/from memory to disk.

I think of lambda as a thread, and you can put a trigger on S3 bucket on each incoming file - to get processed. This allows you to get around GIL, and lets you invoke your lambda for each mini-batch.

assuming you have high volume and frequency of data - you will need to "cool down" your high frequency data, and switch from row-basis (like millions of rows per second) to mini-batch basis (like one batch file per 100Mb).

This can be achieved by having kafka with high partition number on the ingestion side, and sink to s3.

from S3 for each new file your lambda will be invoked and minibatch will be processed by your python code, and you can right size your lambda's RAM, but usually I reserve 2-3x size of a batch file for lambda.

the killer feature is zero ops. Just by tuning your minibatch size you can regulate how many times your lambda will be invoked


Very cool. Do you then further aggregate and load into a DB or vector store or something?


R also has data.table, which extends data.frame and is pretty powerful and very fast


R + data.table is a lot faster than Base R.

See a benchmark of Base R vs R + data.table (plus various other data wrangling solutions, including our own Easy Data Transform) at:

https://www.easydatatransform.com/data_wrangling_etl_tools.h...


> Pandas is way more powerful

Only if you 1) don't know SQL and 2) working with tiny datasets that are around 5% of your total RAM.


I guess it depends on who you ask but personally I am able to write pandas much faster than loading data into a DB and then processing it. The reason is pandas defaults on from_ and to_ are very sane and you don’t need to think about things like escaping strings and stuff. It’s also easy to deal with nulls quickly in pandas and rapidly get some EDA graphs like in R.

The other benefit of pandas is it’s in python so you can use your other data analysis libraries whereas with SQL you need to marshal back and forth between python and SQL.

My usual workflow is: Explore data in pandas/datasette, if it’s big data I explore just a sample and use bash tools to pull out the sample -> write my notebook in pandas -> scale it up in spark/dask/polars depending on use case.

This is pretty good cause ChatGPT understands pandas, pyspark, and SQL really well so you can easily ask it to translate scripts or give you code for different things.

On scalability if you need scale there’s many options today to process large datasets with a dataframe api e.g koalas, polars, dask, modin etc.


>>Only if you 1) don't know SQL and 2) working with tiny datasets that are around 5% of your total RAM.

this is only true only for newbie python devs that learned about pandas from blogs on medium.com. I have pipelines that process terabytes per day in a serverless datalake, and it requires zero DBA work that usually comes if you use anything *Sql


I've processed TBs of CSV files with pandas. You can always read files in chunks and in the end, SQL will also need to read data somewhere from a disk.



interesting, but I would still prefer pandas for data cleansing/manipulation, just because I won't be limited by SQL syntax - and can always use df.apply() and/or any python package for custom processing.

pandas using apache arrow backend also makes it high performance and compatible with cloud native data lakes

plus compatibility with sklearn package makes it a killer feature, with just few lines you can bolt on ML model on top of your data


It definitely has its place. I like to get it to grab the data, clean it up and get out into python / Postgres. I don’t like to have spreading through the codebase.


nobody is denying that pandas is powerful. But their syntax and API uses very inconsistent and hard to reconcile patterns. It's painful because it's hard to memorize and most everything has to be looked up




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: