A DHT is a fairly easy to understand technology that can give you a lot of insig...

sysk · on Nov 12, 2014

> this doesn't balance actual load

As someone who doesn't have much experience with distributed computing, this part confused me a little. What do you mean by that?

reubenbond · on Nov 12, 2014

I believe poster was trying to say that load balancing is probabilistic rather than deterministic: Load is levelled across the DHT only if the hashing function maps keys evenly over the nodes (assuming each key has an equal amount of work associated with it).

drifkin · on Nov 12, 2014

I think the main concern is that in many systems each key doesn't have an equal amount of work associated with it (this sort of thing is usually referred to as a "hotspot").

An example: suppose you have some distributed system storing article metadata and all of the sudden one of your articles becomes very widely shared. The machine that the popular key hashes to gets slammed. Perhaps we'd want to adjust it so that that particular machine is just dedicated to that one article, or some other way to distribute that one article across multiple machines. But we're just using a hash function, so without doing something fancier, we can run into problems when the load suddenly becomes wildly uneven.

reubenbond · on Nov 12, 2014

Solving for read hot-spots is not difficult if you're willing to accept a small read penalty:

Your typical Kademlia DHT has k-replicas of each piece of data, so you read near the target node (node closest to the target key) rather than directly from it. This way, nodes at different points in the network read from many different replicas.

Of course, this depends on your consistency requirements.

habosa · on Nov 12, 2014

Yep this is what I meant. In my example the URL hostname was the key space so all of the wikipedia URLs would go to one node. That's probably means that node had more work to do than some other node that got relatively less "interesting" domains.