Hacker News new | past | comments | ask | show | jobs | submit login

Pandas does a lot, and often times most of it isn’t needed. Basic functionality like Map, Reduce, GroupBy, InnerJoin, LeftJoin, CrossJoin, row or column generators, and transformations between columnar and row based data structures, are often needed but come with a heavy weight library that is not performant when it counts.

Because I needed these operations, I wanted to work with Numpy directly, and didn’t want to write custom implementations each time, I created a library to do it. It also has constructor methods for Python Dicts, any kind of Iterable, CSV, SQL query, pandas DataFrames and Series, or otherwise. As well as destructor methods to generate whatever you need when done. It tries its best to maintain the types you specify, and offers a means to cast as easily as possible. All functions return a single type to allow static type checking. And for performance, there is a “trust me I know what I’m doing” mode for extremely fast access to the data which achieves about a 10x speed up by skipping all data validation steps.

Everything it does outperforms pandas, except for the Joins. It does allow inequality joins and multiple join conditions, but the general solution used isn’t very fast. Anyone reading this who would be interested in improving these component would be welcome to contribute!

https://tafra.readthedocs.io/en/latest/




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: