If you click through to the detailed benchmarks page (https://h2oai.github.io/db...

FridgeSeal · on March 14, 2021

I suppose there isn’t as much focus on squeezing everything possible out of a single machine if the major focus is on distribution.

mumblemumble · on March 14, 2021

There's that, but there's also just costs that are baked into the fact of distribution itself.

For example, take Spark. Since it's built to be resilient, every executor is its own process. Because of that, executors can't just share immutable data the way threads can in a system that's designed for maximum single-machine performance. They've got to transfer the data using IPC. In the case of transferring data between two executors, that can result in up to four copies of the data being resident in the memory at once: The original data, a copy that's been serialized for transfer over a socket, the destination's copy of the serialized data, and the final deserialized copy.