Is the invalidation O(N) or O(log N)? Great project. Congratulations on your dec...

pron · on July 11, 2012

If you distribute a balanced tree data structure with Galaxy (say, a B-tree), I think that the amortized number of invalidation messages per update comes down to O(M/N log N), with M being the number of machines in the cluster, and N the size of the data set. I'll do the math more carefully and write a blog post explaining it. In any case, an update operation does not need to wait for invalidation acks in order to proceed, so this cost affects throughput more than it affects latency.

paulsutter · on July 11, 2012

I was wondering about the root updated themselves, given a large number of nodes.

Ill admit that im thinking of a wide area applicstion. Maybe another version of the question is what are you targeting as a max number of nodes (30? 30,000? 30M?) What is your goal for invalidation latency? And, not to quibble, but how much invalidation latency can you have without being considered eventually consistent? I assume you do guarantee the order of the invslidations?

Perfectly cool if the answer to these question is "maybe later" or "maybe not".

Can't wait to see your deeper explanation. I really love this project because it's a whole different take on distributed memory. I'm embarassed that I has settled into thinking that a DHT was -the- way to distribute shared memory, and hope this project provokes lots of us to think more broadly. I don't know for sure that it will have a large number of applications, but I am really glad that we'll all be able to think more broadly about distributed systems as a result of the open work you are doing.

pron · on July 11, 2012

Each node will usually be hosted on a single machine (the only reason to host more than one node per machine is for testing - it isn't require for load balancing like with DHT solutions), so 30M is probably not feasible.

Most communication among nodes is peer-to-peer and uses UDP, so there are a few things that can limit you. First, is your data-structure itself and the amount of contention for items. Second, the first time a node requests an item, it has to find it (no well-known hashing scheme), so it can either multicast the cluster, or ask a central server. Multicasting with 30,000 nodes is probably not the right way to go, so you'll need a server, and that puts a practical limit. Third, Galaxy uses either ZooKeeper or JGroups for cluster management (peer detection, fault detection, leader election etc.), so either of these solutions might also have some limitations. Fourth, if your cluster is huge you might hit network bandwidth issues, though that seems unlikely.

EDIT: Forgot to address the latency question. Galaxy is consistent (and not eventually consistent), so to get a new version of an item from a node you'll need to wait for the invalidation acks. However, the amortized cost is O(M/N log N) no matter what - the larger and taller the tree, the more nodes you'll need and the more nodes sharing the root, but updates to the root will be less frequent (because the tree is taller).

What I find the most interesting thing about Galaxy is the distributed data-structures people will come up with. Some will be better suited to Galaxy and some will not. But because Galaxy's architecture mimics CPU architecture, I think that modern data-structures designed around caches and NUMA would be excellent candidates for distribution with Galaxy.