Hacker News new | past | comments | ask | show | jobs | submit login
When Simple Wins: Power of 2 Load Balancing (fly.io)
140 points by mattdennewitz on June 26, 2017 | hide | past | favorite | 45 comments



The method is called "Power of Two Random Choices" (http://www.eecs.harvard.edu/~michaelm/postscripts/handbook20...). And the two-choices paradigm is widely applicable beyond load balancing. In particular, it applies to hash table design (e.g. cuckoo hashing) and cache eviction schemes (https://danluu.com/2choices-eviction/).


You're right, I updated the title. Got a little too clever with the whole "power" thing.


"while each additional choice beyond two decreases the maximum load by only a constant factor"

Mathemagical!


It also works for solving SAT. Try two literals and recurse on the one that can be made to satisfy more clauses.


I'm not an expert in this field, but an engineer of vimeo went into detail, why this approach did not work for them. [1]

Problem with consistent hashing:

  However, consistent hashing comes with its own problem: uneven distribution of requests.
  Because of its mathematical properties, consistent hashing only balances loads about as
  well as choosing a random server for each request, when the distribution of requests is
  equal. But if some content is much more popular than others (as usual for the internet),
  it can be worse than that.
Problem with Power of 2 Load Balancing:

  Why wasn’t there a way to say “use consistent hashing, but please don’t overload any
  servers”? As early as August 2015, I had tried to come up with an algorithm based on
  the power of two random choices that would do just that, but a bit of  simulation said
  that it didn’t work. Too many requests were sent to non-ideal servers to be worthwhile.
Instead, he used something called Consistent Hashing with Bounded Loads.

[1] https://medium.com/vimeo-engineering-blog/improving-load-bal...


Bounded load consistent hashing is interesting. It makes total sense that 2 random wouldn't work for vimeo, it actually doesn't work for us between visitors and our edge because we care a lot about cache data — we only use it between our load balancers and our customer's origin app instances.

For what we're doing, we actually need to consider more than just load. Since our LBs are distributed globally, we also want to make sure we're sending requests to backends that are geographically near them.

We can do this by tracking latency between the load balancer and origin servers, then using it to restrict the candidate pool we're going to choose two from at random.


They are different algorithm for different purpose.

Consistent hashing is used to always attach a request to the same host. It's the opposite of load balancing.

Load balancing algorithms (least connection, business, etc...) are used to distribute requests across servers as well as possible to maximize performances.


Usually both properties are desirable .. up to a point.

You want to minimize load on all servers, but you also want to pack things up efficiently (so minimize operational costs), but of course you want the benefits of caching, so you want requests from a sessions to land on the same node/server/box.

Basically a multi-dimensional optimization problem. Completely solvable with constraints. Let the business people decide what's more important, latency or throughput or low cost of operations.


It looks as though the approach proposed in the article is random, rather than attempting to use consistent hashing (as Vimeo investigated) - that may be why the results the Vimeo engineers found are worse than those the artcile suggests?


The simplest load balancing I've done is modulo the user ID by the number of servers then point at that server.

This solves caching too since you are only ever receiving and caching user data on a single server. No cache communication required. You can enforce it on the server side for security as well.

Doesn't require a load balance server - just an extra line of code.

Keep it simple.


What happens when the number of servers changes? The cache hit rate would likely drop to zero until it warms up again, which is a good way to accidentally overload your systems.

Load balancing based on consistent hashing is the better way to implement this.


When the number of server changes you slowly ramp up from mod n to mod (n+1). Flip a biased coin for each user to decide whether to use n or n+1, slowly crank up the bias to the n+1 side.


Consistent hashing is a bit cleaner way to do it, but pretty much the same result as modulo-ing the user id against number of servers. At least as I understand it, you consistently hash something (a user id, a request URL, etc) into N buckets, where N is the number of servers, so changing N re-shuffles all of the buckets anyway.

Short of something like cassandra's ring topology, how would you use consistent hashing add new servers and assign them requests?


You are missing a crucial piece here to have consistent hashing: you also need hash the names of the servers. With consistent hashing you hash both the names of the requests and of the servers, then you assign the request to the server with closest hash (under the modulus). With this scheme, you only need to remap 1/n of the keys (where n is the number of servers).


You're kind of right. You can also use something like jump consistent hash [0] which only requires you to have a consistent ordering of the hosts where you're sending the information. We (Facebook) use something similar for our caches. It requires a linear array of hosts but you've already got that if you're load balancing.

[0] https://arxiv.org/abs/1406.2294


That makes a lot of sense, thanks.

Better consistent hashing means that existing servers don't have their caches invalidated, but the new servers that were just added start with empty caches anyway so are fielding all uncached requests. Hopefully the bottleneck is actually with some shared layer behind it (a database or something) otherwise I guess you'd need to come up with a more complex way to slowly distribute more traffic to the new nodes.


Proper consistent hashing will only move (on average) K/N assignments when going from N to N+1 servers, for K original assignments.



though a clever way to be able to ignore the real problem eventually time will force you to revisit from base principles

user_id%numb_server may work early on when user activity and uptake are consistent,

but what happens when user activity becomes more complex: increase in users, some users abandoning the platform, others using it more; and that complexity lacks homogeneous distribution through this only concerned property: 'user id';

what if over time you gain more users but the majority of people who drop the platform have a user_id%numb_servers==2|11|13|17

in this case you would have some servers working hard while others sitting dormant

what is the real distribution of the relation between activity and user_id over time? asymptotic(o)? similar to the prime numbers(i)? a gaussian distribution(ii)? a benford distribution(iii)?

whichever future dada will show to be the best fit, most distributions show a strong trend toward eventual favoring of values

which i think implies, to ensure an even distribution of work across servers, the problem requires something with greater dimensionality than modulo on an immutable value that is defined serially

(o) https://en.wikipedia.org/wiki/Asymptotic_analysis

(i) https://en.wikipedia.org/wiki/Prime_number_theorem

(ii) https://en.wikipedia.org/wiki/Probability_density_function

(iii) https://en.wikipedia.org/wiki/Benford's_law


I can think of a few ways to rebalance things on the fly, but I would probably just hash some immutable values for the user, like Id and name, together with a nonce. If a server get's overloaded, slowly move people from it by changing their nonce.


if you are introducing a nonce why use a hash? with a performance reflective mutating nonce on the user id modulo works as is

if you are introducing monitoring on a per user precision why use modulo? with a per user scheduled monitoring moving users based on user ids works as is

maybe i was unclear in the above but i like the gp's simple solution.. especially because i personally have an affection for the modulo operator, but also because.. it only requires an operator that performs in a scale dependent finitely specific number of cycles and works as designed without any monitoring

the above was intended to bring attention to shortcomings and probable failures in an otherwise elegant attempt

the method is flawed but the direction is superb


You could use nonce to determine what server to use. But I didn't want to choose directly, just the ability to chance the output of the hash for whatever reason.

I did not want to use just any non random rebalancing mechanics to avoid advesaries attacking that implementation. With a hash the output is deterministic, but unpredictable.


What are your adversarial concerns?


Arguably, that's sharding, not load balancing. If you want to get picky with terminology at least.

Anyway, I do have a point beyond being pedantic: this offers two advantages that a fixed sharding scheme doesn't. #1: it doesn't need to identify a piece of data on the request to shard off of. #2: it actively (though imperfectly) attempts to achieve similar utilization on every server.


Facebook chat used to use this scheme, but eventually had to move away from it because it made it difficult to add and remove servers to the fleet in response to load.


User ID is not exactly random material. Might be better to do mod(md5(user_id), server_count) to scatter the bits using MD5.


That's very simple consistent hashing. Consistent hashing is great when you want to trade even load for localized data.

In fact, we use consistent hashing when we accept requests, and two random choices when we deliver them to the apps. This works much better for _most_ of the apps we see. We're typically worried about cache data for a particular app. The app instances themselves, though, tend to be mostly stateless and disposable.


One problem with this is the "long tail" issue: in many applications, you have a highly active small minority of users, and any server assigned to enough of those users will be overloaded while other servers are underutilized. Since these are also (typically) your most excited / engaged users, this effectively penalizes user behaviors you'd rather encourage.

The other main problem is that it's not a consistent hash: if you grow the server pool, you typically need to reshard a lot of content.

(It's still useful in a pinch, but it helps to be aware of the tradeoffs.)


But where is the modulo being calculated?


[removed, brain failure]


In the original comment, user mentions that the modulo logic would not require a loadbalancer server. So, I would assume what the user meant is that you do not require a high throughput loadbalancer. But you still need some entity to do the modulo work as well as health-checking servers to calculate modulo for active servers only.


This is how many horizontally scalable OLTP databases operate too (e.g. DynamoDB, Citus): picking a partition key, then deterministically routing work associated with that partition key to the proper, well, partition.


Regarding the math section, could someone please describe it like you were talking to a 5 year old?

1) Θ( log n = log / log n )

2) Θ(log log n)


There is a proof shown in this handout: https://people.eecs.berkeley.edu/~sinclair/cs271/n15.pdf

It's hard to understand why this technique works so well without digging deep in the math. Roughly speaking, if you throw n balls in n bins at random, the maximum of number balls in any bins will grow surprisingly quickly (because of the birthday paradox). However, if we allow ourselves to choose between two random bins instead of one, and put the ball in the one with the fewest balls in it, the maximum number of balls in any bins grow much more slowly (i.e., O(ln ln n)). Hence, having that one extra random choice allows us to get surprisingly close to the optimal approach of comparing all bins (which would give us O(1)), without doing all that work.


Thanks for the explanation! Much clearer and I get the concept. In the case of load balancing, we'd need a ton of servers (1000s?) for this to pay off vs just comparing all, right? Cache updating aside, most of the overhead would be in reading the load numbers in. Comparing a thousand numbers has to be quick in comparison, no?


The problem with load balancing is herd behavior. Stats for load are usually at least a little stale, because it's a distributed system where you can't afford to wait for consistency. When there are traffic spikes a whole herd of new connections will go to the least loaded server for a window of time where the cached "load" number is out of date. Picking two at random helps keep from a bunch of connections racing to one server, even when you're only running 3-4 of them.


That's a really intuitive explanation. Appreciate that.


Thank you sir!


1) Throw n balls into n bins, the bin for each ball chosen randomly

2) Throw n balls into n bins, two bin for each ball chosen randomly, always picking the bin with fewer balls in it

In both cases you will have n balls distributed over n bins in the end. But the number of balls in the largest bin will be different for the two processes above. In the first case the largest bin has more balls: O(log n / log log n) == O(log n). And the second case has just O(log log n) balls. So just adding an extra choice of bins made the expected largest bin exponentially smaller.

More rough intuition: if x of your bins are occupied, in the first case your next ball has x/n probability of queueing instead of finding an empty bin but in the second it's only (x/n)^2 chance to need to queue.


So that means the expectation value of the maximum scales as O(log n / log log n)?


Generally, yes, but I think `O(log n / log log n) == O(log n)` is wrong.

log(n) / log(log(n)) = logx(n) (where x = log(n), wasn't sure how to describe logarithm base in a better way). So you get O(logx(n)). In general the logarithm base doesn't matter for Big-O when it's a constant, but I'm not sure you can apply the same thing to a base of log(n).


small correction, it's Θ( log n / log log n ). I noticed though, when I copied the formula from the original paper, this what I got, too ;)


"Power of 2 Random Choices" ... has nothing to do with the "Power of 2" directly.

I like 2Choice because it is not dependent on hash function design & is temporal, but I have a positive aversion to the 2^n hash distributions when it comes to data, specifically for distributed systems which need to flex up/down [1].

[1] - http://notmysock.org/blog/hacks/1440


I've seen a paper doing the same thing directly at the network layer using IPv6 extension headers: http://www.thomasclausen.net/wp-content/uploads/2017/06/2017...


Can someone expand on the maths that the OP elided? What is the thing that comes out to O(log n / log log n)?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: