What’s up with all the Dask benchmarks saying “internal error”? I expected at le...

mumblemumble · on March 14, 2021

If you click through to the detailed benchmarks page (https://h2oai.github.io/db-benchmark/). A lot of them are that it's running out of memory, a few of them are features that haven't been implemented yet.

Inefficient use of memory is a problem I've seen with several projects that focus on scale out. All else being equal, they tend to use a lot more memory. This happens for various reasons, but a lot of it is the simple fact that all the mechanisms you need to support distributed computing, and make it reliable, add a lot of overhead.

FridgeSeal · on March 14, 2021

I suppose there isn’t as much focus on squeezing everything possible out of a single machine if the major focus is on distribution.

mumblemumble · on March 14, 2021

There's that, but there's also just costs that are baked into the fact of distribution itself.

For example, take Spark. Since it's built to be resilient, every executor is its own process. Because of that, executors can't just share immutable data the way threads can in a system that's designed for maximum single-machine performance. They've got to transfer the data using IPC. In the case of transferring data between two executors, that can result in up to four copies of the data being resident in the memory at once: The original data, a copy that's been serialized for transfer over a socket, the destination's copy of the serialized data, and the final deserialized copy.

nojito · on March 14, 2021

These are single machine benchmarks.