I've found it generally best to push as much of that data prep work down to the ...

blt · on April 9, 2020

Is this tangential advice, or an argument that the current tools are good enough?

kyllo · on April 9, 2020

It's an argument that Python being slow / single-threaded isn't the biggest problem with Python in data engineering. The biggest problem is the need to process data that doesn't fit in RAM on any single machine. So you need on-disk data structures and algorithms that can process them efficiently. If your strategy for data engineering is to load whole CSV files into RAM, replacing Python with a faster language will raise your vertical scaling limit a bit, but beyond a certain scale it won't help anymore and you'll have to switch to a distributed processing model anyway.