Hacker News new | past | comments | ask | show | jobs | submit login

I've found it generally best to push as much of that data prep work down to the database layer, as you possibly can. For small/medium datasets that usually means doing it in SQL, for larger data it may mean using Hadoop/Spark tools to scale horizontally.

I really try to take advantage of the database to avoid ever having to munge very large CSVs in pandas. So like 80-90% of my work is done in query languages in a database, the remaining 10-20% is in Python (or sometimes R) once my data is cooked down to a small enough size to easily fit in local RAM. If the data is still too big, I will just sample it.




Is this tangential advice, or an argument that the current tools are good enough?


It's an argument that Python being slow / single-threaded isn't the biggest problem with Python in data engineering. The biggest problem is the need to process data that doesn't fit in RAM on any single machine. So you need on-disk data structures and algorithms that can process them efficiently. If your strategy for data engineering is to load whole CSV files into RAM, replacing Python with a faster language will raise your vertical scaling limit a bit, but beyond a certain scale it won't help anymore and you'll have to switch to a distributed processing model anyway.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: