Hacker News new | past | comments | ask | show | jobs | submit login

Looks like a cool project.

It's better to separate benchmarking results for big data technologies and small DataFrame technologies.

Spark & Dask can perform computations on terabytes of data (thousands of Parquet files in parallel). Most of the other technologies in this article can only handle small datasets.

This is especially important for join benchmarking. There are different types of cluster computing joins (broadcast vs shuffle) and they should be benchmarked separately.




Yeah, nobody uses Spark because they want to, they use it because from certain data size, there's nothing else.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: