Parallel Universe open sources a novel in-memory data grid

zimbatm · on July 10, 2012

Here is my understanding on how it works:

SpaceBase is a distributed in-memory chunk store. Each chunk is referenced by an "id" and stores unstructured data (as far as the store knows). Each chunk is "owned" by a single node but other nodes might keep a read-only copy. If another node wants to write to that chunk, it first requires a transfer ownership and the new owner is broadcast. This design is inspired on how L1 caches in CPUs work. The theory is that with MMO, the data is geo-localized and each computer would take care of a part of the whole world. By keeping data local to the machine you avoid network I/O.

There are two things I didn't see in the article. The first is how read-caches are invalidated. Maybe data is eventually consistant or cache-invalidation is broadcast on write. The second is how data from a stale node is recovered. That's the difference I though when I saw the CPU reference.

pron · on July 10, 2012

Yep, except that this project is called Galaxy (we use it internally as part of our commercial offering - SpaceBase).

I will write another post in the coming weeks, explaining the cache-coherence protocol in detail, but let me just say now that Galaxy is always consistent (i.e. not eventually-consistency). Some more details: read-caches are invalidated when a node wishes to write (just like L1 caches), but the writer doesn't need to wait for acknowledgements in order to proceed - this helps with latency. Also, the invalidation doesn't need to be broadcast, because the current owner of a cache-line maintains a list of all readers.

zimbatm · on July 11, 2012

Ah right. Makes sense. You can re-use the readers list that you also use in case of ownership transfer.

Any hints you want to give on how to handle node failures ? For me that's the potential weak point.

pron · on July 11, 2012

Exactly.

Fault tolerance is really the hardest part of the implementation. The documentation contains a detailed explanation of the features and how to use them. A future blog post will explain all algorithms in detail (it's simply too long to post in a comment). In the mean time, you can take a look at the code on GitHub if you feel like it.

paulsutter · on July 11, 2012

Is the invalidation O(N) or O(log N)?

Great project. Congratulations on your decision to do real open development from the beginning. Looking forward to your next post.

pron · on July 11, 2012

If you distribute a balanced tree data structure with Galaxy (say, a B-tree), I think that the amortized number of invalidation messages per update comes down to O(M/N log N), with M being the number of machines in the cluster, and N the size of the data set. I'll do the math more carefully and write a blog post explaining it. In any case, an update operation does not need to wait for invalidation acks in order to proceed, so this cost affects throughput more than it affects latency.

paulsutter · on July 11, 2012

I was wondering about the root updated themselves, given a large number of nodes.

Ill admit that im thinking of a wide area applicstion. Maybe another version of the question is what are you targeting as a max number of nodes (30? 30,000? 30M?) What is your goal for invalidation latency? And, not to quibble, but how much invalidation latency can you have without being considered eventually consistent? I assume you do guarantee the order of the invslidations?

Perfectly cool if the answer to these question is "maybe later" or "maybe not".

Can't wait to see your deeper explanation. I really love this project because it's a whole different take on distributed memory. I'm embarassed that I has settled into thinking that a DHT was -the- way to distribute shared memory, and hope this project provokes lots of us to think more broadly. I don't know for sure that it will have a large number of applications, but I am really glad that we'll all be able to think more broadly about distributed systems as a result of the open work you are doing.

pron · on July 11, 2012

Each node will usually be hosted on a single machine (the only reason to host more than one node per machine is for testing - it isn't require for load balancing like with DHT solutions), so 30M is probably not feasible.

Most communication among nodes is peer-to-peer and uses UDP, so there are a few things that can limit you. First, is your data-structure itself and the amount of contention for items. Second, the first time a node requests an item, it has to find it (no well-known hashing scheme), so it can either multicast the cluster, or ask a central server. Multicasting with 30,000 nodes is probably not the right way to go, so you'll need a server, and that puts a practical limit. Third, Galaxy uses either ZooKeeper or JGroups for cluster management (peer detection, fault detection, leader election etc.), so either of these solutions might also have some limitations. Fourth, if your cluster is huge you might hit network bandwidth issues, though that seems unlikely.

EDIT: Forgot to address the latency question. Galaxy is consistent (and not eventually consistent), so to get a new version of an item from a node you'll need to wait for the invalidation acks. However, the amortized cost is O(M/N log N) no matter what - the larger and taller the tree, the more nodes you'll need and the more nodes sharing the root, but updates to the root will be less frequent (because the tree is taller).

What I find the most interesting thing about Galaxy is the distributed data-structures people will come up with. Some will be better suited to Galaxy and some will not. But because Galaxy's architecture mimics CPU architecture, I think that modern data-structures designed around caches and NUMA would be excellent candidates for distribution with Galaxy.

javajosh · on July 10, 2012

Perhaps this demonstrates a lack of imagination, but I can't see even the sketch of how this piece fits into building a real game. Presumably one has thousands, or even millions, of clients simultaneously interacting with the world and with each other. This system would imply that your (low latency) world-state would be kept in memory on a small cluster of Galaxy servers.

The novel quality of this system is that node data-locality is determined by how you access the data.

Let's say you have 10 server nodes and 10^6 clients. That means that, in the best case, each node deals with 10^5 clients. Presumably clients (or players) can move through the simulated world, hopping from node to node.

Here things get hazy in my mind. For example, how does the client know which server to connect to? Does it just make a guess and get redirected if it guessed wrong? When a player crosses a node threshhold, how does that work?

I'm thinking there must be a central character store - basically a traditional database, that handles initial node affinity. Position in the world is part of character state, and when you login, you'll be handed off to the correct node based on that state.

But if this is how things work, when would you ever need your "cache lines" moved from node to node? The world-state is spatial so why move it from node to node? I guess that's the crux - I can't think of any other data that would need to move between nodes other than player data, and in that case I don't see what data it would need to take with it.

pron · on July 10, 2012

I would really like people to treat this as an experiment in distributed systems design rather than a product for games, but because our main product is intended to be used by MMOs (as well as other industries), let me address your scenario.

> how does the client know which server to connect to?

Galaxy doesn't handle any client connections, only connections between cluster nodes, but if you were to build something on top of that connects to clients, then, yeah, starting with a guess and redirecting is ok. An initial node that simply directs connections might work, too. And if players move from one place to another, having your communication layer telling them to connect to a different node is pretty much what we had in mind.

> when would you ever need your "cache lines" moved from node to node?

Yes, player data, NPC data, vehicle data - anything that moves. BUT, another big reason for data migration is load-balancing. Continuing with your game example, if a lot of players congregate in one area, handled by one machine, you may decide to split it to two machines, and migrate half of the information there.

If you were to use Galaxy for a graph database (forgetting the MMO use-case for now), then while the graph vertices don't "move", changes in the edges might make you decide on a better distribution of the vertices over the cluster.

javajosh · on July 11, 2012

This response fills me with unease. You present MMOs as a core use-case, and then equivocate your support for MMOs, preferring to hand-wave about failover and graph databases.

The best way to experiment with distributed systems design is to build a real distributed system - or at least be able to sketch one out.

Frankly, this seemed like a solution looking for a problem, and your vague responses are supporting this. The interesting dynamic, of being able to move data between nodes as a side-effect of access patterns is interesting, but it's not clear how an MMO could really use this dynamic to good effect. Indeed, even the fault tolerant case, it's not clear how this dynamic would help failover - I mean, would you need to duplicate access patterns prior to node failure to ensure dual-local data?

Frankly, I think you should focus on one use-case (MMO, graph database, something) and hand-wave a complete solution that really leverages the novelty of your approach. Get specific and talk about what actually happens when "things move".

pron · on July 11, 2012

Alright, sorry for the confusion. Our commercial offering (SpaceBase) is very much targeted at MMOs and real-time LBSs. However, like many start-ups we really do like building cool stuff. And while it indeed the case that Galaxy will soon be a offered as a component of SpaceBase, it is has a very different design from other memory-grid projects/products, so we decided to open-source it to the community and let it explore other possible uses.

My post was meant to be an introduction to a series of very technical blog posts discussing theoretical and practical aspects of distributed systems. The post was not meant to serve a clear commercial purpose, so I was trying to steer the discussion away from commercial uses and more to its CS aspects. You know, we really find this stuff interesting. Some of my future posts will discuss the more theoretical sides of Galaxy and will drill very deeply into its design and algorithms, while others will discuss how SpaceBase will make use of Galaxy to help MMOs build huge, rich worlds, and LBSs track lots of moving objects in real-time. To be more concrete and give just a taste, I'll say this: when SpaceBase runs on top of Galaxy, objects are transferred from one node to another to create a dynamic area-of-responsibility for each node. This means that each node will be responsible for processing all objects in some region of the game world (or real world for LBSs). But the regions are dynamic - namely, they shrink and grow to accommodate non-uniform load, so that small busy areas will be split over several nodes, while large, relatively vacant ones will be handled by just one.

Daniel_Newby · on July 11, 2012

> ... when would you ever need your "cache lines" moved from node to node?

The node could be experiencing a soft failure—excessive ECC errors, power supply failure, kernel security update needed, etc.

azakai · on July 10, 2012

Sounds like Project Darkstar,

http://en.wikipedia.org/wiki/Project_Darkstar

epaik · on July 10, 2012

Looks pretty cool, although I don't know how useful this would be for games.

Isn't the only real bottleneck for large-scale real-time MMOs the network bandwidth needed between the server and client? While this tech would improve the efficiency of handling data on the server, it wouldn't be able to solve the inherent problem of network limitations.

Major props for making it open source though, I'll enjoy looking at the code.

pron · on July 10, 2012

Fact of the matter is that, other than EVE online, there aren't any "large-scale" MMOs out there. Most of them are limited to one server per world/shard/instance (and even EVE is limited to one server per solar-system AFAIK), and the number of concurrent players that entails. This is a first step in helping them build bigger worlds with more players, that seamlessly scale over a cluster.

jff · on July 10, 2012

IIRC they can run several solar systems on a server, but they have devoted one whole server to the unofficial trading hub, Jita, because it is so busy.

EVE has the nice property where the universe is already split up into these nice partitions connected by a network of jump gates. It maps very nicely onto a server farm.

epaik · on July 10, 2012

You're right that there aren't many truly large-scale seamless MMOs out there, and I agree that this project would help servers scale their worlds.

Sorry if my previous post came off as dismissive, as a game development enthusiast I was focused only on the application to games rather than the technology itself. The point I was trying to make is that even with this technology, we are unfortunately not closer to being able to create real-time action mmos. For an advancement in this area, we need a leap in network capabilities, rather than server efficiency.

Arelius · on July 10, 2012

Sharding like done in Eve is well established technically in the game development community. And I agree that this basically solves the problem we already know how to solve. The problem is in content, Most MMO's are designed to be content heavy, with zones of high-traffic, this causes a population density problem that creates the exact bandwidth problem you mention.

This doesn't solve any new problems, what it does do is provide a licensable technology that many studios have to re-invent themselves, there is value in that however.

pron · on July 10, 2012

So it's a good thing, then, that, a) it's open-source! and b) it's not even directed specifically at games!

Arelius · on July 11, 2012

I don't mean that it won't be useful in game development, just that I don't think it'll make games particularly more "massive".

epaik · on July 10, 2012

Definitely a good thing! :)

Arelius · on July 10, 2012

I'd argue that the lack of "large-scale" MMO's is more due content production problems, and activity density (I don't care how good this grid database is, there is a density that will bottleneck it, at the very least, it will bottleneck the client's bandwidth pipes if nothing else.) Rather than huge technical limitations. MMO developers are pretty familiar with sharding these days.

rdtsc · on July 11, 2012

I thought of that too. So if you have a large MMO presumable players wouldn't be uniformly distributed in the world. As in, here, there is a fairly constant 100 players per sq km. Rather I see hubs forming (maybe with some power law distribution) -- maybe a large city, market, planetary system, ring of hell, a battlefield that would disproportion ally hold a large number of entities vs other ares.

There then instead of node ownership based on the space grid, you'd want to somehow have node ownership based on clustering/density.

So maybe constantly iterate a K-means clustering algorithm, where servers are cluster centers and every player/client in the cluster belongs to that server.

That would be my back of the envelope approach. It probably has lots of terrible flaws that I haven't thought of yet.

Arelius · on July 11, 2012

I mean, I imagine that the system can handle the density by progressive balancing the tree. The problem is that gameplay doesn't do the same thing, A client wants to see all characters within 10(or w/e) meters, not the closest 10 characters.

pron · on July 10, 2012

This has nothing to do with sharding (except that if games wanted to use this it might help them do away with shards), and not related particularly to games at all (in fact, we think it could also be great for real-time graph databases).

Arelius · on July 11, 2012

I'm under the impression that the dependence on spacial locality in it's parallelization that it would actually perform pretty poorly as a general real-time graph database.

sp332 · on July 10, 2012

Sharding is a difficult design problem. It would be easier to make games (and nicer to play them) if you didn't have to shard the world up.

daeken · on July 10, 2012

That's actually not true. A large part of the reason MMOs typically split the userbase into many shards is that the world design itself won't cope with the full userbase in a single instance of the world. You need a lot more space, a lot more places where interactions happen, etc. Imagine how much bigger the world of WoW would need to be to make the entire userbase work in a single world.

One of my favorite articles is on this very subject: http://www.gamasutra.com/view/news/114192/Opinion_Designing_...

sp332 · on July 10, 2012

I know. Deciding where to make the boundaries, and actually handling the handoffs between shards is difficult. It would be easier to write a game if you didn't have to worry about sharding.

daeken · on July 10, 2012

I think you're misunderstanding what shards are. Shards in this context aren't pieces that interconnect, but rather completely independent worlds. A 'shard' is the entire world where players exist; players from shard A will never see players from shard B.

sp332 · on July 10, 2012

Oh, I've been referring to those as instances. Did I have the wrong term?

daeken · on July 10, 2012

In the case of MMO architecture here are the relevant terms:

    - Shard: Encompasses all parts of a single game world
    - Zone: Area of the game which can't be further broken down; the players in this area are all connected to the same node and performing actions there.
    - Instance: Area of the game which is unique for a player or group of players.  This is done so that players can be together and not be interrupted by others, e.g. someone killing a boss before the group gets to it and having to wait for it to respawn.  These are created on demand by players and go away after the players leave.
    - Node: Single machine that runs parts of a shard.  This could be one zone with a high average player count, or a bunch of zones with fairly low player counts (e.g. large open areas) and maybe some instances thrown in.

Some details here can change from game to game, but most of it's pretty general.

sp332 · on July 10, 2012

Cool, thanks for the breakdown :)

lallysingh · on July 10, 2012

I think you mean latency between client and server. Second, there's a small exponent (let's say 1.3) in your simulation work (collision detection and newtonian motion), that can saturate a CPU quickly. quickly

Your end-to-end latency from a user clicking something and seeing the results is 2x latency + simulation time.

While I like software implementations of hardware optimizations as much as the next guy, what kind of replication distribution do you expect? You're splitting up your data-set in RAM across many machines, yet you've got them replicating data from each other. How much of your data-set do you expect to be accessing frequently?

pron · on July 10, 2012

Each item can be replicated to many machines, but written by one machine at a time. Galaxy works best when items that are replicated to many nodes are updated infrequently. This works very well when distributing data structures like B-trees, when the root of the tree is read by all but updated rarely, while the leaves remain pretty much confined to only one node and are updated regularly.

This is all for latency reasons. Fault-tolerance is a different story.

lallysingh · on July 10, 2012

Ok. Makes perfect sense for B-Trees (as long as you have enough remaining memory/machine for a decent size cache). I think for a game's partitioned scene graph, the change rate would be rather painful.

wmf · on July 10, 2012

It's useful for games that aren't possible today.

ukd1 · on July 10, 2012

Awesome. I'm going to have a play with this when I get a sec.

fredsters_s · on July 10, 2012

love the design

nirvana · on July 10, 2012

I guess if you live long enough you see everything.

I was part of a team that invented "spacebase" back in the mid-1990s! We were able to support millions of simultaneous MMO players in the age when most people were using dialup modems and had vastly less bandwidth and vastly higher latency than they do now. This technology ended up being acquired by Sony, and used as part of the playstation network.

Like this company our work grew out of simulation programming done originally for the military (in this case the DoD) and like this company we provided an API and solution to rapidly partition the space so that the game client would only need to know about objects located near it according to in-game geometry. Like this product ours was fully distributed, etc.

Alas, we were ahead of the age of MMOs, though while World of Warcraft didn't yet exist, Ultima Online did, and there were a lot of other attempts at MMOs.

Nowadays people if there was less temporal difference people would say "They ripped us off!" but I can totally believe this company had the same idea... and they saw a green field because there were no competitors.

The problem is, there were no competitors because (at least back then) game developers were not interested in solutions they didn't invent themselves. Maybe that has changed.

Arelius · on July 11, 2012

> The problem is, there were no competitors because (at least back then) game developers were not interested in solutions they didn't invent themselves. Maybe that has changed.

I'd say that if you disregard indie development, that is still pretty much the case. There is certainly a bigger market for middleware, but I think it's generally met with skepticism.