Hacker News new | past | comments | ask | show | jobs | submit login

What’s up with all the Dask benchmarks saying “internal error”? I expected at least some explanation in the post.



If you click through to the detailed benchmarks page (https://h2oai.github.io/db-benchmark/). A lot of them are that it's running out of memory, a few of them are features that haven't been implemented yet.

Inefficient use of memory is a problem I've seen with several projects that focus on scale out. All else being equal, they tend to use a lot more memory. This happens for various reasons, but a lot of it is the simple fact that all the mechanisms you need to support distributed computing, and make it reliable, add a lot of overhead.


I suppose there isn’t as much focus on squeezing everything possible out of a single machine if the major focus is on distribution.


There's that, but there's also just costs that are baked into the fact of distribution itself.

For example, take Spark. Since it's built to be resilient, every executor is its own process. Because of that, executors can't just share immutable data the way threads can in a system that's designed for maximum single-machine performance. They've got to transfer the data using IPC. In the case of transferring data between two executors, that can result in up to four copies of the data being resident in the memory at once: The original data, a copy that's been serialized for transfer over a socket, the destination's copy of the serialized data, and the final deserialized copy.


These are single machine benchmarks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: