I did not mean to imply that you are giving every process a copy of all the data. The main trick is decomposition of the application, data model, and operations such that every process may have a thousand discrete and disjoint shards of "stuff" it shares with no other process. The large number of shards per process mean that average load across shards will be relatively balanced. The "one shard per server/core" model is popular but poor architecture precisely because it is expensive to keep balanced.
However, in these models you rarely move data between cores because it is expensive, both due to NUMA and cache effects. Instead, you move the operations to the data, just like you would in a big distributed system. This is the part most software engineers are not used to -- you move the operations to the threads that own the data rather moving to the data to the threads (traditional multithreading) that own the operations. Moving operations is almost always much cheaper than moving data, even within a single server, and operations are not updatable shared state.
This turns out to be a very effective architecture for highly concurrent, write heavy software like database engines. It is much faster than, for example, the currently trendy lock-free architectures. Most of the performance benefit is much better locality and fewer stalls or context switches, but it has the added benefit of implementation simplicity since your main execution path is not sharing anything.
Don't you lose all of your performance gains in RPC overhead? How do you avoid latency in the data thread (do you have one thread per lockable object? won't that be more than 1 thread per core?) - these are the reasons lock-free is so popular.
> Don't you lose all of your performance gains in RPC overhead?
If one did, then why would anyone who knew what they were talking about (or even just knew how to write and use a decent performance test) advocate this method? :)
In a database engine, you ultimately need to move some data around, too. After all, you can't move a network connection to five threads at once, and not all aggregations can be decomposed into pieces that return small amounts of information back. Sharing memory often brings a quite substantial speedup.
However, in these models you rarely move data between cores because it is expensive, both due to NUMA and cache effects. Instead, you move the operations to the data, just like you would in a big distributed system. This is the part most software engineers are not used to -- you move the operations to the threads that own the data rather moving to the data to the threads (traditional multithreading) that own the operations. Moving operations is almost always much cheaper than moving data, even within a single server, and operations are not updatable shared state.
This turns out to be a very effective architecture for highly concurrent, write heavy software like database engines. It is much faster than, for example, the currently trendy lock-free architectures. Most of the performance benefit is much better locality and fewer stalls or context switches, but it has the added benefit of implementation simplicity since your main execution path is not sharing anything.