If a picture is worth a thousand words.... a well done animation is gotta be at least 10K words.
Thanks for making an old topic fun to read about again!
100% agree, although I have experience and knew the material, I read this through because it was a pleasure to read and the visualizations were engaging. Well done!
Great visual - very elegant! Although the denied requests look slightly wrong: perhaps quickly “bouncing” (less ease-in) and off the outside of the receivers would make it more understandable? But still a fantastic demo without any changes, bravo.
I feel like something missing from people's perspective when thinking of load balancing is to consider pull models. All push load balancing algorithms try to somehow predict how busy downstreams are. Some even go as far as having downstreams send back some utilization metrics.
But if you use a pull-based approach, this is all sort of moot. Downstreams will pull work when they're ready for it.
Leastconn is basically a pull model. The server sends the TCP close to the load balancer, which is its way of saying "I'm done with that request and I'm ready for the next one". It's essentially pulling the next connection from the LB.
I'm still looking out for a better explanation of why 'choose 2' is so much more empirically effective than intuition says it should be.
Other than the fact that randomization avoids pathological cases, nothing I know about distributed computing or queuing theory really gets to the bottom of it. It feels like there should be more to it than that.
> Best of 2 is good because it combines the best of both worlds: it uses real information about load to pick a host (unlike random), but rejects herd behavior much more strongly than the other two approaches.
https://brooker.co.za/blog/2012/01/17/two-random.html
I keep having to turn off keepalive between the service and the reverse proxies and I bet this is part of why.
I think Node 12 introduced LIFO queuing for connection pools because they found the cost of silent disconnects from the server to be too high. They got much better histograms by using the most recently freed connection instead of round robin of available sockets.
I don't remember the URL (sorry) but I saw something recently talking about Facebook's queueing approach for requests being FIFO under normal load and LIFO when overloaded, so that under an overload state you serve the number of requests you -can- serve reasonably quickly rather than -everything- timing out.
For node's situation that sounds probably workable but I think a lot of the time I'd prefer to send periodic keepalives to the idle pool connections. However, you can't exactly bake -that- into the runtime so I can see their argument and perhaps the best thing would be "both, plus periodically do a clean shutdown on excess pool members until you need them again."
> All push load balancing algorithms try to somehow predict how busy downstreams are.
They KNOW how busy they are. They are the ones tracking and forwarding connections to them. That's why leastconn works in the first palce
But, for example HAProxy have option to directly back-feed weights via healthchecks from app, so there is an option for app to signal back-pressure in RR balancing
> They KNOW how busy they are. They are the ones tracking and forwarding connections to them.
Well, they do when they're the only ones sending work to the workers.
The article uses a literal black box for the load balancer, but there are workloads that are too heavy for a single machine, so in those cases (and others, like HA) you have to have a pool of load balancers. You can try to make those load balancers know everything about what's happening in the whole system but it can be hard and expensive.
Or, you can have them operate on less-than-perfect knowledge. This is what the round-robin strategy does, and just like round-robin, has its tradeoffs (much simpler, worse 95%ile latency).
All of this is assuming the load balancers and workers servicing connections are the only things running on those machines. In real world usage there can often be other loads on the same hardware, belonging to tenants your team doesn't even have a relationship with, which can complicate things quite a bit.
That’s interesting I haven’t heard of this in HAproxy. I’ve seen that it has a static weight for how many requests to send to a backend in the config file, what’s this functionality called for dynamic weights?
And how do you expose that?
The pull model works beautifully for background work queues. For load balancing it seems it would take a bit of re-plumbing on the expectations of request/response. How would you envision this working with something like HTTP? It seems it would reverse the direction, which would be interesting.
Very similar to background queues but the load balancer holds the request and dispatches it to a waiting worker, or waits until someone comes to serve it. While the worker handles the request, the load balancer proxies.
The request serving is the same, the way the work is dispatched is just inverted.
Seems like this wouldn't work as well in situations where you care a lot about latency or resource utilization. If the workers are polling the LB every n ms, then that's an average of n / 2 ms added to _every_ request. Plus additional CPU cycles and network traffic on both ends due to the polling mechanism.
High volume low latency message queueing is already a problem people have had and implemented solutions for that (for their use case) cost little enough in terms of latency that the advantages were well worth it.
Also there's no reason to poll the LB/queue/etc., you just tell it "I'm ready" and it sends you something to handle when it's got something.
Pull/push here is about who decides when a server is ready to receive another request.
So ... I'm not an expert on actually implementing it, but I've seen systems in practice that -were- in such situations and it worked out extremely well.
If you care about that, having requests queue on a worker that’s already busy (while another worker is idle) is currently the most widely used alternative.
The pull model, is similar to having some metrics to choose which backend to push to.
I believe push/pull are the two faces of the same coin. You might physically initiate a connection one way or the other. You might in abstract push or pull information.
Ultimately you are trying build an oracle that predicts the future.
It's not difficult. loadbalancer knows how many requests are in-flight to server. The OP is plainly wrong.
That's why leastconn works in the first place. And leastconn is almost always one you want. It's almost magical. GC stall on one server ? That means it isn't processing, which means every new request will go to other servers.
One server processing 2x as fast as the other ? Well, it keeps its connection count low, so it gets more of them
What if a server has most of its request processing io blocked? It can process more requests because it has empty CPU RAM but load balancer will think the server is fully booked.
I like the idea. I guess you could do something with pubsub to simulate this model. A regular http stack (load balancer and reverse proxies in front of some http framework) would probably need major modifications to work.
It would be great if such a mechanism was part of the http standard, so you could easily connect compliant tools.
You can get a lot of traction with long polling. I use that in a couple of places where we talk to Consul to avoid a lot of polling nonsense. If I could figure out one last spot (where it's buried behind a service call due to separation of concerns), I could remove a lot of the remaining jitter from a rendezvous algorithm (tell me when all servers have made a change)
> It would be great if such a mechanism was part of the http standard, so you could easily connect compliant tools.
Not sure HTTP is the best approach to the suggested method. Since you control both ends, there are surely better protocols to use, like QUIC or something similar (or just straight up UDP).
This seems like it would be more resource demanding on the load balancer to manage that queue? If that's a single queue it's global state... if you partition it, well, that's just push-based load balancing moved inside your "outer" load balancer?
The latency of round-tripping "finished a request, give me another" is probably also best-case ever so slightly worse than "finished one of my requests, loading the next from the buffer" (the HTTP load balanced apps I've worked on have had request queues on the workers too (framework-level, usually)).
Moreso if you had multiple load balancers involved...
Indeed, and not only with HTTP. Stateless UDP protocols are the poster child for bad load balancing as a result of technology “tricking down” from HTTP. Push makes even more sense for those protocols.
Should be something that could quite easily be added to a uWSGI subscription router - subscribing node periodically says “I have N slots available” and the router keeps track of how many requests are in flight.
Least connections is intuitively very sensible algorithm, but it has one pitfall: if backend server starts returning errors for whatever reason (e.g. load shedding) quicker than typical responses, it can lead to situation where disproportionate number of requests are directed to that one server.
Also typically you'd have more than one load balancer instance, and at least naive least connections requires shared state (backend connection counts) between the instances, while with round-robin you can avoid that?
There is a talk about Fitbit load balancer on YouTube, where their algorithm is described. They used modified least connections algorithm, which also tracks failure rate of each node, decreasing the traffic when error rate is increased. The video is called "A Frontend Server, Front to Back by Zach Tellman", timestamp 33:34.
Least connections doesn't need shared state between the instances, but it would need shared state between the load balancers if you had more than one for sure.
You're totally right, I glossed right over errors. With PEWMA and weighted round robin you can make instances incur a "penalty" for serving errors, which can help you isolate bad servers. It would have been fun to visualise this, definitely.
It also strikes me that it might be interesting to model least connections with two or three LBs with and without such shared state.
My instinct is that you'd still get most of the advantages even with each LB keeping its own independent connection counts, but given the number of times I've fired up a profiler and gone "wait, seriously, -that-s the slow part?!" I'd suggest nobody believes me and measures it instead.
Inverting this is actually better because it avoids the stale data / herd behavior. i.e. randomly choose X candidate servers and then pick the one with least load. also, 2 may be better than 3
Excellent post, very informative and with nice animations (check the bonus interactive animation at the end of the page), my only comment is to use different colours/colors (or dashed lines) for the graphs, it’s a bit confusing having the same color for different percentiles.
Also take the time to check out other posts/pages by Sam! Well done mate ;-)
indeed, a really great post! In case you decide to change the colors or need another reason I also want to add, that in general red/green is difficult for color-blind people. About 8% of male population is affected by red/green color blindness.
Or are you talking about the animations? I reached out to some colourblind folks I knew before publishing and they didn’t flag those, just the graphs. Happy to change the animations as well if they’re problematic. :)
we are so stuck with this push request load balancing its crazy, if we just switch to pull instead of push things get much smoother, and resources get better utilized
you cant reliably guess if the instance where you will push your request actually has capacity to handle it, even using ML to guess it will still have thrashing properties
but if you just let instances pull work, things work out for themselves
in the same time, there is so much tooling for http, and its so natural to use it, that it is actually hard to switch to another transport layer, so now the best we have is some naive bayesian classifiers in the LB and some exponential backoff
here is the difference i had in timings using synchronous io queue vs http for 2 endpoints, one fast and one slow (endpoints had the same code, just transport was different)
you can see how the fast endpoint is always fast with the queue transport, but with classic push load balancing it spills latency a lot, because the instance is sometime busy servicing the slow request
PS: this post is absolutely amazing! and the animations are brilliant! thanks a lot for making it
Honestly, I disagree. It's not like queue systems are rare. They're fairly common and they're often seen as an over complication! Once and only once queuing is hard. Async programming is hard.
Streaming requests or responses is out the window! Now you're storing the full request and response blobs in your infrastructure instead of chunks at a time in the network layer. Is a whole Netflix movie queued? If we're talking about queued chunks then its just UDP you're describing.
Connection oriented designs tend to more transparency end to end. The synchronous nature means a failed call can be bubbled back through the remote call chain. Failed async calls can be dropped, which leaves ambiguity. The callers need to resort to timeouts instead of closed connection signals.
Not to mention the issue that this doesn't work for client calls. The response handling from the client is more complex. The async callback would need to be demuxed such that a response can be associated with a call. There's no open connection so you'd need to punch the firewall somehow...honestly its very messy.
'once and only once' anything is hard, including connection oriented designs, also sync calls can fail in quite similar ways as async calls as per FLP/CAP
the truth is that sync calls (in normal aws + k8s example) have like 50 queues between the user's kernel and your program actually doing the work, just considering the listen(2) queues, and the network card queues, and the reality is that every one of them can just drop packets on the floor whenever it wants
so in reality, network programming is hard, and even sync things are just a collection of many async pieces from the network card to the userspace
I honestly had never thought hard about reversing the relationship and having workers pull. This point about no longer requiring health checks is a real "woah" moment. Thanks for expanding my mind!
You still need health checks, though? Otherwise, how do you tell the difference between: "no traffic + server alive" vs "some traffic + server is dead". Yeah, you can monitor throughput on a load balancer, but if I ever again wake up from an alert about not traffic being served - I will throw hands.
Yeah, they’d have to exist but it would be for quite a different purpose. I wonder if that would mean we could implement them in different ways, e.g have a health check service that everything pings and if a ping isn’t received for N minutes, assume it’s dead and trigger some replacement routine or alert.
This begins to look a lot more like a software watchdog at that point, and you can even have each service provide a count of outstanding/processed requests per tick and if a server never gets outstanding or processed requests you could have the watchdog kill it off.
server health is necessarily a function of the actual production traffic it receives, determined at the application layer, as observed by a specific observer
it can't be known by the server itself, as (among many other reasons) the server can't know about network issues between itself and any upstream caller
it can't be determined by out-of-band health check queries, because those queries don't represent actual traffic, the simplifying assumption that they _do_ introduces many common failure modes that any seasoned engineer can speak at length about
health checks can be a nice additional signal on top of monitoring actual prod traffic, but they can't be used by themselves, they just don't capture enough relevant information
Agreed on the “whoa moment”. Not needing health checks seems pretty compelling. I’m curious if there are any off-the-shelf pull-based load balancer that don’t require the full HTTP request copied onto the queue before handing off to a worker?
a server can return 200 OK to every health check query and 5xx to every production request, is that server up or down? hopefully clear it is down
similarly, if it returns 5xx to every health check query but 200 to every prod request, is it up or down? hopefully clear it is up
if you want to monitor the status of a server (or application) for internal purposes, that's fine, but that's (at best) supplementary signal for the decisions made by a load balancer, not something you can make load balancing decisions on in isolation
Doesn't it just flip the registration/check need instead of eliminate it? You'd need to register the brokers/loadbalancers to the workers so they know where to pull work from?
oh yea, the amount of outages we had in my last company because of weighted roundrobbin registration/healthcheck/watermark/latch issues were way more than anyone would guess
In Caddy we've implemented "dynamic upstream modules" which allow an arbitrary (compiled-in) plugin to give the reverse proxy a list of upstreams that can take the request. A pull-based mechanism could relatively easily be constructed with this feature: https://caddyserver.com/docs/json/apps/http/servers/routes/h...
I could imagine a dynamic upstream module that receives UDP or TCP packets from backends with a number estimating how many connections it can handle at that time. The module then tells the reverse proxy which upstream has the highest number and the reverse proxy selects it.
Moving to async model adds new set of operational challenges as well as some interesting failure scenarios. (Edit) Also, in practice you would need at least one more system to enqueue request into the broker, as the latter would typically not be exposed to the outside
world.
Request/response on the other hand is much simpler to configure and operate.
the lb puts a request where it has some reply_to (ip:port) where it waits (blockingly) for response from whoever picked up the request, it just does now know who that is until a reply comes
As an example of a failure scenario, how does your system distinguish between a request timeout, a response that didn’t get sent back because of network failure and the consumer crashing and losing the message?
for {
select {
case reply := <-replyChannel:
if reply.Uuid == r.message.Uuid {
return reply
}
case <-timeout:
return makeError(r.message, MessageType_ERROR_CONSUMER_TIMEOUT, "consumer timed out")
}
}
not much different than what you do with normal http timeouts, you send a request, sometimes a response comes sometimes it doesnt, up to the load balancer to decide if it wants to retry or error out
also, queue does not mean async model, it means a queue, there are many queues in http requests responses (e.g. the listen(2) backlog queue itself) and it does not make it async :)
“Message queues implement an asynchronous communication pattern between two or more processes/threads whereby the sending and receiving party do not need to interact with the message queue at the same time.”
If we're not talking about an async model then the suggestion is much less drastic than it sounded at first. In that case the crux of your desire is simply allowing the hosts to signal readiness more directly.
You would almost never actually wait for host machines to dial in. You would have a list of hosts that are ready or not ready as they would almost always be ready for more. You want to assume readiness (as this lowers latency) and feed the fire hose.
But in this interpretation, in a world where an LB would be using an existing connection to host machines with HTTP/3 we're basically already there. I suppose its trivial and standard to signal unreadiness to the LB from the host with a 429 Too Many Requests response code.
Off the top of my head I'm trying to think how a host could actively signal to an LB that's its ready for more requests... I suppose its trivial and common to use a health check. Is it even a change to say that these need to be updated to achieve your goal of host to LB pulling?
...that's just push model. "Signalling" via "well, the loadbalancer have 10 sessions max per server" is enough.
Pull model just adds unnecesary RTT.
> Off the top of my head I'm trying to think how a host could actively signal to an LB that's its ready for more requests... I suppose its trivial and common to use a health check. Is it even a change to say that these need to be updated to achieve your goal of host to LB pulling?
Like this.
There is rarely a case where you decide to not serve the next request after serving previous one so push is most optimal for short ones. And if it doesn't want to it can just signal that via healthcheck.
Pull makes more sense for latency-insensitive jobs like "take a task from queue, do it, and put the results back", as if you say make video encoding service that dynamically scales itself in the background and "just do one encode and exit" is commonplace.
it is not possible for a remote destination host to signal to a sending host that it is ready, or not ready, for more requests, in a reliable way
readiness is not knowable by a receiver, it is a function of many variables, some of which are only knowable to a sender, one obvious example is a network fault between sender and receiver, there are many more
even the concept of "load" reported by a receiving application isn't particularly relevant, what matters is the latency (and other) properties of requests sent to that application as observed by the sender
health is fundamentally a property that is relative to each sender, not something that is objective for a given receiver
That's push model when per-server maxconn is full tho.
The biggest benefit from pull model is not having to update backend server list every time you add/remove one but outside of that it isn't really all that beneficial.
You also get added latency, unless each backend server is actively listening and connected but if it is, you're just wasting extra RTT to say "hey, there is a request in queue, do you want it?"
Some of these protocols are a good improvement when looking for the ultimate in speed and I used them where applicable, but as a first order approximation a simple load balancer approach is quite good enough. Especially when you do not have dedicated people vested in that domain for maintenance.
That sort pull architecture is interesting thought. I do see it working best when a worker nodes ability to do work is very clear cut, i.e. the response time does not significantly vary with the number of concurrent requests it is handling.
Here is an example where it would not work well: lets imagine workers are doing a CPU bound task and each worker has one cpu (for simplicity), but we have also identified that we can handle up to 10 requests concurrently on one worker, but they will be just slower (because they will get smaller slices of cpu). Ideally we'd wish the requests to be divided equally between the workers, but with naive pull model one worker could greedily grab requests up until its at capacity while the other workers stay idle, making the response times worse than they need to be.
With a pull architecture you wouldn't identify a request queue depth up front. Rather, each incoming connection gets delayed until a worker attaches to EG the TCP stream. If you're serving a webpage the client wouldn't see anything other than a little load lag, especially if your load balancer took care of the TLS connection before pausing to wait for a worker.
So if you can handle every incoming request with one worker that worker gets them all. Otherwise each worker pops off the stack as it becomes free. And if you can service more concurrent requests you just add more workers.
> So if you can handle every incoming request with one worker that worker gets them all. Otherwise each worker pops off the stack as it becomes free
As I alluded, that works great if "can handle" and "being free" are clear-cut binary properties. But for complex applications when you are driving for high utilization while keeping latency down those questions become complicated; a worker might have some free capacity to handle requests but it doesn't mean that it would produce response as quickly as some other (more idle) worker.
In other words, the problem is not just assigning requests to workers that can handle them but assigning requests to workers that can handle them with lowest latency.
Not an expert in load balancing, but in similar problems (work sharing / work stealing, MPI get/put) it makes sense to pull only if you can pull fast enough to avoid incurring prohibitive latency at every request/message.
Multithreading-based work stealing à la Cilk relies on extremely cheap thread mechanisms and implementation to minimize communication.
In another similar situation, HPC switches are credit based so that until you hit congestion, you can “instantaneously” know if a remote is ready to receive.
This isn't a formal explanation, of course.
Edit: after some thought, that's not really the distinction that is made for load balancing. There's already knowledge of the remote state required for pushing to the last loaded queue. So the difference between pull and push is about having one queue vs several. In that sense it is like supermarkets that implement the more efficient one queue to every cashier Vs the more traditional one queue per cashier. In supermarkets there is a choice to make because there's other constraints, but just optimising for load balancing it's strictly better to have a single queue, if you only have one input.
Having enough workers to do the work shouldn't be any more constraining than it was before.
The article does nicely mention that simple round Robin actually has lower latency, because some traffic gets lucky & goes to under-utilized machines. Unfairness helps some traffic go faster. The queue is probably going to eliminate this, but the unfairness advantage comes at the cost of a lot of other traffic getting put into long queues on workers, so it wasn't really a good thing anyways. The p90+ is usually awful.
A pull approach seems difficult to manage when you have many layers in your load balancing.
In small setups, you may just have one layer with a single load balancer (well, hopefully at least a hot-warm pair), but larger setups often have multiple levels. There may be a network level traffic split to multiple frontend load balancers via something like ECMP; those frontends may connect directly to the origin hosts, or maybe there are frontends in many locations and they connect to backend load balancers near the origins.
In this bigger case, managing pull requests becomes difficult, because balancing may be unequal at earlier layers --- if your origin can handle N concurrent requests, so it sends N pulls, how many should it send to which of the upstreams, and if some upstreams get many requests and some get zero, those many requests will have unnecessary delay.
There's also unnecessary delay when at capacity between when one request finishes and the round trip of sending a pull and getting the next request.
But, it's always tradeoffs. It depends on the volume of requests, the typical time to process a request, behavior at or near capacity, etc.
I also think a pull based system is more work for the load balancer, and load balancers are harder to scale --- I prefer to move the work to the origins as much as possible, because it's typically easy to add more of those --- that's what the load balancer enables. But, that doesn't seem to be a commonly held opinion, direct server return is rarely available, load balancers commonly do TLS termination, and often intense traffic inspection and manipulation; again, there's tradeoffs.
Nice to see someone mention direct server return or as BigIP called it nPath routing. This was an effective scaling method for handling small request that returned large payloads (audio and video files). I don't know how well known this configure is or whether it is still viable in an all TLS world.
It doesn't seem particularly well known. It works fine with TLS, but the origin servers need to do the TLS termination (IMHO, this is better for security than having your load balancers do it, but it does mean you have to work harder on key distribution). On a non-DSR load balancer, doing TLS termination on the origins means the load balancer has less application data to work with (request path, response status code, etc) in making load balancing decisions, but for DSR, the load balancer never had any of that, so adding TLS doesn't disadvantage the load balancer any more.
TLS session establishment is expensive, so why would I want my load balancers to do it anyway? :P
Should you wish to give DSR load balancing a go without having to invest in hardware/licenses you could try https://github.com/davidcoles/vc5
Put that in front of some HAProxy servers to do TLS termination and farm out requests to another layer of NGINX/uWSGI boxes and Robert is your cousins father.
It kinda fell out of favour when even midrange server can do tens of gigabits of TLS traffic. So you need to be very big traffic wise to make it worth.
Making a high reliability queue that can also soak load spikes is non-trivial.
Now that we have hundreds of Gbps ethernet and TB of memory the idea has more merit, can scale pretty absurdly high with mundane systems. Or maybe you have sharding, which means now you have a load balancing problem again, of picking which work queue to take work from.
The HA bit is still hard. You have to to figure out if there's a netsplit (some folks can't connect to one server) or if one server really is gone. Probably just multicast to each queue all the incoming work & all the incoming pulls. Ideally each queue could also hear all the outgoing traffic. If ethernet capacity were unidirectional this would be great, box #2 could autonomously detect faults & take over. But ethernet is bidirectional, and now it needs all box #1's incoming traffic and it's outgoing traffic too. So instead maybe have the clients fail over. We can iterate on resign but HA is non-trivial.
if your load balancer holds on to each received request until a downstream server happens to pull that request and process it, then it's not a load balancer, it's a queue
different concept, different semantics, different results
Interactive diagrams and visuals are worth so much with these topics. I just created documentation at ngrok (full disclosure, I work there!) for how to do load balancing with their edges product for this the other day! Now I have something to refer people to when describing these.
I'm going to guess the style of this excellent effort was inspired in part by this popular HN poster? Creating a post in this manner with animations does take a lot of work but it's very effective and instructional:
Yes! Bartosz Ciechanowski is a huge inspiration to me. This post was my first attempt to imbue some of his style into my writing. I'm not up to his standard by a long way, but I'm very happy with my results. :)
That page was indeed a masterpiece in visual communication.
Math and Music curriculum teachers and educators should spend time on making all topics approachable like this. It would be a great way to supplement learning by making it interactive. Instead of using iPads as glorified PDF readers.
Sadly the quality of e.g our Math Books (in Germany) is so low in many cases that I do not see this happening ever.
Several years ago, I implemented an application specific load balancer where the LB kept two queues. One was free_workers and the other was open_requests.
Initially, upon startup, each worker registers itself with the LB and gets added to the free_workers queue.
When requests arrive at the LB, the LB checks if there are free workers available. If yes, it dequeues a free worker and dispatches the request to that worker.
If no free workers are available, the LB adds the request to the open_requests queue.
When a worker finishes its work, it lets the LB know and the LB adds the free worker to the end of the free_workers queue and initiates another round of dispatching.
The parameter to watch for is the queue size for the open_requests queue.
(There were a few more nitty gritty details, but that was the concept at a high level)
But then your LB is a single point of failure! Even if you were to spin up duplicate copies of the LB that could be failed-over-to in real time, there is still the drawback that you are bounded by the number of requests you can store on a single machine.
I think what you actually built is a message queue engine that forwards the data to the consumers, not a load balancer.
A small note on "least connections" load balancing. The article says:
Because the load balancer sits between the server and the user, it can accurately keep track of how many outstanding requests each server has.
and this is the common case. But there are some cases, such as when serving audio/video streams (looking at you, porn :) where the server will send its response straight to the user, not back through the load balancer. So, the load balancer doesn't know how busy the server is, unless that information is communicated separately, through some other channel. One variation is "least sessions", where the number of active sessions rather than connections is used, and as mentioned, that info is stored separately from the "main" connection, and available on the load balancer for its decision-making.
I would in fact argue that in large systems it is infeasible to have a load-balancer sit in between. Especially if you adopt a microservice-based architecture: it doesn't make sense for every microservice to have its own load balancer to proxy all traffic. Instead the load balancer simply assigns traffic to specific services without being in the middle. It is the client's responsibility to obey the load balancer's decision.
> But there are some cases, such as when serving audio/video streams (looking at you, porn :) where the server will send its response straight to the user, not back through the load balancer. So, the load balancer doesn't know how busy the server is, unless that information is communicated separately, through some other channel.
Loadbalancer can still see it's an incoming connection as long as user is sending something (even if returning traffic goes directly) so at the very least it can keep number of connections same.
Of course one might stream 240p while other user streams 4k but for example in HAProxy there is an option to use external agent for health checks, that also have option to modify traffic weights and cut traffic to ones that are being swamped
I didn't really want to go in to direct server return. I think most people will be more familiar with using something like nginx as a reverse proxy and load balancer at the same time. It is very cool tech, though. :)
Yeah this was the most confusing part of the whole article to me, and seemed like the most important. I was like "how is it keeping track of how many connections are still being processed?"
In most situations (not all) the load balancer is actually a proxy. So when you type in example.com, the dns points to a load balancer. You connect to a load balancer, the load balancer opens a connection with the desired server (probably sitting in a private network somewhere), and proxies the traffic to the server.
You never actually connect to the server in this case. You stay connected to the load balancer the whole time, the load balancer makes a request to the server, which processes your request, and sends it back to the load balancer, which then proxies the data back to you in the response.
In this case the LB sits between you (the user) and the actual server running the application. In this case the LB knows how many connections each server has in the system because it is actually the one connected to the servers. If a server drops a connection, the LB knows because its the one that actually is losing connection. So it is fully aware of what all servers are doing in this scenario.
The only thing the load balancer might still need to do is ping servers for health checks to make sure they are online when the servers are not connected, this is also how it knows latency and other metrics.
Great post! I would have loved to see P2C (Power of 2 Choices) in there as well, which is typically a better alternative to Round Robin and Least Connections.
P2C is really cool, but it would have meant having to talk about load balancers with incomplete information. This felt like slightly too much to add to an already-quite-long post. It also would have added an extra layer of complexity to my already-quite-complex simulation code :sweat_smile:
One thing I found interesting, is it you go with PEWMA and create a scenario where the cluster is stressed, and then add 1 server, it pummels the shit out of the new server and you have a brief surge in failed requests.
Not sure if that is a real world issue, or just with the simulation...
This is very likely a bug in the simulation. My simplified implementation of PEWMA prioritises servers that have had no traffic, in order to send at least 1 request to all servers. There will be a window, until this new server serves its first request, where it is considered the highest priority server.
I doubt very much that this would be part of any real world implementation
I'm not familiar with PEWMA, but real load balancers sometimes have this problem. Either because of dynamic weighting that slams the new server which shows zero load, or because the new server needs to do some sort of cache warming, whether that's disk or code or jit or connection establishment or ???, a lot of times early requests are handled slowly.
Most load balancers should have a way to do some sort of slow start for newly added or newly healthy servers. That could be an age factor to weighting, or an age factor on max connections or ???. Some older load balancers are just not great at this, so you develop experienced based rules like 'always use round robin, leastconn will kill your servers with lumpy loads'. All that said, and a repeated theme across my comments in this thread, the more sophisticated your load balancing is, the harder your load balancer needs to work, and the sooner you need to figure out how to load balance your load balancers.
It should happen in the real world as well, at least that's what I've been told when I started my first job as a system admin.
The reason people cited to me back then was that the balancer usually isn't particularly smart when balancing, so they only see a free node, thus every free request is routed to it. The errors (mostly timeout) will happen once the request start to actually get processed.
Normally, the node gets a steady amount of requests over time, thus the load is constant (generally speaking, a request will require the most resources at the same relative time of their lifecycle). As all requests are fresh, they'll all hit the same load bottleneck at the same time, causing all the timeouts.
The answer is to both aggressively scale horizontally and then quickly decommission until you're back to baseline.
Or just accept the failed requests
Its been over 10 years though, it mightve been improved since.
I don't know anything about this subject, but my first thought (which may be wrong) would be to just set the weight of the new server to be the same as one of the other servers that are receiving messages (perhaps one of the lower ranks). In that way, it would not be overloaded so easily and adjust its ranking after a while
I guess my explanation was lacking then, as that wouldn't help. reducing the weight below the old nodes might work, but it would also extend the duration you're overloaded, which would also cause requests to fail.
Leastconn also have very nice characteristics, that if your app say gets into long GC it will near-immediately stop sending connections there
Also the "overrun" of dropped connections shouldn't really happen if
* incoming connections are < total capacity
* loadbalancer have server limits set properly and doesn't send too many connections for app to stgart dropping
* loadbalancer itself is one queuing (as it "knows" the next server that should get the connection)
So that part of the description is subtly wrong; you WANT to queue on loadbalancer, not on app servers.
The app server should have queue just long enough to feed all the threads but not enough to start dropping anything
Beautiful animations! I love how they make understanding this more intuitive and can be adjusted to play with different settings.
One thing that I generally felt is missing in these discussions is how TLS messes up load balancing strategies. I did my own research on what to do when you want to have TLS and not terminate at the load balancer. I covered that in a blog post available here: https://er4hn.info/blog/2023.02.18-tls-load-balancer/
I have a simple alternative method in mind for SAAS (software as a service) apps.
Manual/statistical load balancing --- assign users to a specific server based on their login credentials. A statistical model of server utilization can be maintained and users assigned or re-assigned as needed. Latency can be reduced to zero by simply forwarding the connection to the proper server once the login is complete.
The obvious downside is a custom load balancer implementation is required.
Does anyone have any experience using NodeJS as a load balancer for something like this?
That's very common with stateful applications. Lots of people use HAProxy or other application-aware LB's to keep the sessions "sticky" to a single app server.
As others have mentioned, this is a kind of sharding strategy. It doesn't get rid of the need for load balancing, and it's really going the opposite direction of what we know to be reliable.
> Manual/statistical load balancing --- assign users to a specific server based on their login credentials.
What happens when that specific server goes down? Needs an upgrade/deployment? You'll have to failover to a different server, which brings you back to an automatic load balancing strategy.
> Latency can be reduced to zero by simply forwarding the connection to the proper server once the login is complete.
Modern load balancers add a meaningless amount of latency per request. If you're truly forwarding it in the networking sense, then a load balancer/reverse proxy is still involved.
If you mean something like redirecting them to an endpoint that points directly at an individual server, you get back to the first problem. What happens when that server goes down?
> A statistical model of server utilization can be maintained and users assigned or re-assigned as needed.
This is one of those things that sounds _very simple_, but in practice is incredibly complicated.
You'll have to failover to a different server, which brings you back to an automatic load balancing strategy.
Or to a manual load balancing strategy. What I have in mind is being able to easily re-direct users from one server to another using a simple CLI utility. This won't entirely eliminate downtime issues but it will (hopefully) mitigate effects to a manageable level.
In the era of cloud computing, downtime has become less of an issue.
This is one of those things that sounds _very simple_, but in practice is incredibly complicated.
I like attempting to simplify supposedly complicated issues. What I have in mind is simply counting the requests each server handles and using this as a simple measure to compare utilization. It's true that all requests are not equal but statistically, over time, with all servers being similar, the differences will tend to balance out.
> Or to a manual load balancing strategy. What I have in mind is being able to easily re-direct users from one server to another using a simple CLI utility. This won't entirely eliminate downtime issues but it will (hopefully) mitigate effects to a manageable level.
In practice, this means that each time a server goes down someone has to be on-call to run a command to redirect them. It's also breaking your utilization-based sharding scheme.
> In the era of cloud computing, downtime has become less of an issue.
Well, yes and no. Downtime is less frequent because of robust, automated load balancing. Individual servers, whether VMs or containers or whatever you prefer, are far less reliable. That's intentional. It's cheap commodity hardware, designed to die, and 'cloud native' applications are supposed to handle that properly via things like automated load balancing.
> I like attempting to simplify supposedly complicated issues ... It's true that all requests are not equal but statistically, over time, with all servers being similar, the differences will tend to balance out.
This is an example of one of those simplifications that seems intuitive but just doesn't work. It is completely normal for there to be multiple orders of magnitude differences in request cost, between customers, and at different times. Even if you assume that your application is static (which it hopefully isn't), customer workloads are not. Their behavior will change, which means your sharding needs to change. This is already solved by existing load balancing algorithms described in the linked article.
What problem do you see with existing solutions that you're trying to solve?
If I'm understanding the ask, you can certainly do this with Caddy. You could use the `forward_auth` directive to proxy to one upstream to authenticate the connection (by looking at the request headers contents) and possibly ask it to give you an upstream address as a response header, then you can use `reverse_proxy` to proxy the original request to that address from the auth request's header. You could also implement your own dynamic upstreams module (in Go) to do custom upstream selection logic. And Caddy has plenty of load balancing policy options to choose from (and a few more improvements in PRs I opened last week).
There are lots of good replies already, but I'm curious why you would even want to do this. Web applications have been moving away from stateful servers for ages. Ideally your server instances are completely disposable. If you need to persist and share state between them, there are great mechanisms: the database, a queue, cookies, etc.
What I have in mind isn't really a "proxy" but more of a login/redirection server.
A "proxy" is middleware which directs all communication through a single server which adds to latency.
What I have in mind will run logins through a single server. But once the login is complete, any further communication is redirected to the proper work server to continue without any proxy middleware involved.
This won't entirely eliminate downtime issues but it does limit the effects to a reasonable level while offering increased efficiency and decreased latency.
We used that to make an SSO login site that works independently on what is on the backend. Logic was basically:
* if there is no/invalid SSO cookie, SPOA set a flag which made haproxy redirect to the SSO app
* if there is valid cookie, decode it and send the data (usually just logged user name) to the app in header
Once cookie is correct it doesn't need SSO server so it is pretty fast for users that already logged in.
It can be also used for blocking requests based on external engine, it's pretty flexible overall
Did this 20 years ago by having the name of the server as part of the user's profile.
User's 1 through 50 (light users) log in and their profile says they go to app-1.myapp.com.
User's 51 through 60 (heavy users) log in and their profile says they go to app-2.myapp.com.
A specific user may pay extra to have a non-shared environment, and this supports that as well.
i did something similar way back when only the server identifier was part of the session cookie iirc. Whichever server behind the LB started the session got all the requests for that session. It was more like a user load balancer vs a request load balancer.
This is called sharding, but the question is more about the assignment algorithm(s) supported by the sharding algorithms, which are not always flexible enough to support the gp's suggestion.
I really appreciate you saying this, you have no idea how much it means to me.
I spent many evenings and weekends tweaking this asking myself "is this intuitive to someone whose only experience with this topic is everything prior in this post?"
It's important to me that every section is grounded only in all of the previous sections. One of my fundamental beliefs is that anyone can learn anything, provided they're presented the material in the right order.
This is good teaching. You can get a concept across in a few minutes, or you can make someone struggle for hours to understand. I think classic academic teaching, leans towards the latter (imagine a latex pdf that introduces an alphabet of Greek letters), vs. this, which learns you the concept pretty quick. You can them do the Greeks letters version later.
AWS Application Load Balancers seem to default to round-robin, would be interested to hear how many people change this to their "least connections" equivalent called "LOR" [1]? And why they don't support any other options?
I'd guess AWS default to round robin because it's the least complex, and doesn't have issues like if a worker responds extremely quickly due to an error (eg. returns a 500 immediately) then using LOR it seems it would consume all the requests until it's taken out of service by any health checks. But maybe there are other potential downsides?
You point out a good gotcha that most people don't notice - many host-specific errors are often faster than the standard response and more traffic will route to those hosts with a performance-based heuristic. A well configured LB would have hosts with responses that take them out of rotation for known fault conditions - but at scale that's hard for companies to validate.
At scale, with also having scaled systems engineers, it's not, like, impossibly hard. A sidecar like Envoy can be configured to emit health stats which can then be read by the load balancer to consider a given server unhealthy. Again, at scale, but each team is already responsible for a dashboard with health metrics for their service, so the load balancer team doesn't have to try and determine everybodies health metrics, only their own.
Conspiratorially speaking, the way to handle AWS' load balancer algorithm shortcomings is to run more EC2 instances, which, naturally AWS profits from.
Depending on the app might not have better performance. The reason is that a host with more connections might not be using more resources than one with less connections. The busy connections might be waiting for a reply from an external service.
Good post. Although, this just scratches the surface. There's also caching mechanisms involved in handling requests that might make similar requests faster when served by the same host. Now you're tempted to add sticky sessions to the mix and deal with the perks and problems that comes with that.
Absolutely! I had to pick a stopping point in order to restore balance to my personal life :D
I'd love to see people taking this as inspiration and covering the more advanced topics. I've already started on my next post and it's about a completely different area to load balancing. Very excited.
You could go with sticky sessions, or consistent hashing to maximize cache hit rate. Then you may run into hotspots and look at bounded-load consistent hashing, etc.
This is so timely for me, I've been wondering how I can visualize load balancing algorithms for the last few weeks.
When trying to explain why round robin is bad, I often try to describe a situation where a backend cluster with a wide variance in performance can cause a cluster of clients using round-robin load balancing to start developing a sort of harmonic resonance where all the clients slowly synchronize on which servers they are hitting. However, it's hard to explain in words and a visual simulation would help me explain.
Do you think it would be easy for me to hack the source and add multiple clients to the simulation?
Shouldn't be too hard to add the clients since, in terms of the animation, they're just flipped versions of the server. The question is if this resonance is going to require some intentionality to it to demonstrate/replicate.
I love this simulation it's the best visualization I've seen of load balancing concepts.
One thing though - in my experience the biggest challenge I typically see is that different requests take different amounts of time for example you're running multiple instances of your monolith and there is an endpoint that returns a static response and another one that generates a huge report.
This is actually something that can be handled with a load balancer that can introspect layer 7 but that's a whole other thing
I do talk about variable request cost a fair bit, but I really don't think I did a good job visualising it. The rate at which each request shrinks while it's being processed by a server is meant to represent cost, but it's pretty subtle and easy to miss unless you're really watching closely.
If I were to do this again, making request cost more obvious would be something I'd like to do. I had initially had requests by variable in size on screen, and moved away from that because load balancers don't typically know the cost of a request up front.
I’ve always been more curious about how load balancers are supposed to be highly available? Presumably you run many instances, but then it seems like you need something to balance the load across your load balancer instances? Also, how do many load balancer instances listen on the same IP address (or do they use DNS to map a domain name to multiple IP addresses?)?
A classical load balancer runs in an HA hot-warm pair, with IP takeover --- when the secondary senses the primary has failed, it takes over the IP and begins serving. Depending on the type of load balancing and the software involved, this could be nearly seamless, or it could end all sessions in progress.
If you want to run hot-hot load balancing on a single IP, it's generally done with routing protocols. Equal cost multi-path (ECMP) will split traffic by hashing on some portion of the (source IP, dest IP, protocol, source port, dest port) 5-tuple; you'd configure your router to enable ECMP, and then your load balancers would advertise the IP via BGP or RIP or whatever is cool these days. Communication between load balancers to handle sessions that move during failover and bring-up is optional (if you don't do it, sessions will end abruptly); this setup is similar to anycast, although with anycast you may also see sessions move when external routing changes, and you really should manage that. You also should have a method to handle ICMP packets, most specificially needs-frag packets, as they will be sent from a different IP than the connection peer and will likely hash differently and may likely route differently for anycast, too.
You can use DNS to direct traffic to multiple load balancers, but DNS is not a precision instrument. It's useful for geographic balancing (in addition to anycast), but resolvers have a tendancy to cache results for longer than published TTLs and it takes significant effort to understand how much request traffic a given resolver will generate from one lookup. For balancing between two load balancers where you want roughly equal traffic, you also need to consider pathologic behavior like RFC 3484 and RFC 6724. These two RFCs suggest preferentially using IPs with a larger common prefix when multiple options are available. This only makes sense when the common prefix is meaningful. If your ISP was assigned 10.1.2.0/24 and I have service IPs of 10.1.7.3 and 10.2.4.5 and return both of those as A records, your resolver shouldn't really prefer one or the other, because beyond your ISP prefix, there's no actual network closeness implied by a similar IP. 'Smart' resolvers that follow this RFC can cause large scale traffic imbalances, so fun times there.
You've got it, DNS is what's often used to solve this problem. Do a few resolutions of reddit.com, you can see them returning 4 different IP addresses in a randomised order. :)
CDNs such as Cloudflare also do anycast routing which means you can hit the same IP in different places in the world and get a response from the nearest point-of-presence.
Floating/Virtual IPs are one networking solution for HA. With Corosync/Pacemaker a cluster of hosts can decide how the ips are then assigned to physical hosts. CARP maybe/probably can do the same. I just googled and apparently there is also some IPVS/LVS project for Linux for HA loadbalancing?
VRRP, philosophically,
must ipso facto standard be
But standard it
needs to be free
vis-à-vis
the IETF
you see?
But can VRRP
be said to be
or not to be
a standard, see,
when VRRP can not be free,
due to some Cisco patentry..
I always thought that for hundreds of backends and above 10K RPS those algorithms scale poorly. First of all, scaling beyond single load balancer becomes a problem. The cost of updating counters becomes significant. If responses are quick, the cost additional bookkeeping adds doesn't pay off.
Of you use weighted random all these problems go away.
> First of all, scaling beyond single load balancer becomes a problem.
If you are in DC or have L3 access to the underlying network ECMP is VERY easy way to scale to 4-16 loadbalancers. Over that just... use multiple IPs. We use ECMPed setup of 4 nodes since forever and it works very well.
There are theoretical improvements that could be made by doing L4 loadabalancing into L7 loadbalancing, but you'd need a hell lot of traffic to ever need that
> The cost of updating counters becomes significant. If responses are quick, the cost additional bookkeeping adds doesn't pay off.
Essentially, yes but if you have some architecture that talks very little (say just few packets per connection, let's assume some IoT garbage), you probably just want different architecture altogether; like "first request gets you assigned server, then you talk to it directly without LB". L4 balancing is also an option.
In essence, if somehow performance of leastconn algorithm is your bottleneck, you're both big enough and specialized enough to go with another approach than "just dump all traffic at few loadbalancers"
Weighted random is one of the algorithms available in the playground at the end. I did originally have a section talking about it but cut it for length reasons.
updating a counter (in memory on a single host) is typically a single- or double-digit-nanoseconds operation, basically never the bottleneck in anything
Why is there not a queue system where a server can pull requests when it's ready to handle one?
Why must the work to determine where to send a request go to the load balancer, when you could just pull from a queue and save yourself the trouble of finding the exact right load balancer strategy for your application?
The short answer is that it's more complicated to engineer and has its own set of trade-offs. With the push-based models shown in the article, everything works regular HTTP(S), no fancy routing is needed, and the LB needs to keep very little state to do its job.
seems like a request could be pulled from a queue with HTTPS easily enough. and if I am misunderstanding what you're saying, I don't understand why a purely request-driven path is so vital; it's just a way to me ve data around.
seems like the queue could accept the request, keep the https connection open, wait for a response from whatever server accepts the request, and pass the response to the client upon receipt just fine.
I dunno. web stuff is poorly done imo. poorly designed.
This and slow-start were on my original to-do list for this post, but I wanted to keep it to a certain length and complexity so they ended up getting cut. There's a tonne of cool stuff to visualise in this space, I was spoilt for choice. Slow start, server flapping, session stickiness, dozens more algorithms, multiple load balancers with incomplete state, multiple levels of load balancers, the list probably goes on. :D
I think it might be worth adding to this some discussion of how the load balancer itself handles the load, and the related fact that _within_ some backends, the responsibility for load-balancing falls on rich clients that maintain knowledge of the server sets to which they can route requests.
What I don't understand is since it's the same service can't we just run a performance test, determine the ballpark of the needed resources and just have them capped like in k8s? why would some servers be more powerful than others in the first place?
One perfectly homogenous server gets 20MB POST with image that then needs to be reencoded/cropped
Other gets tiny GET for CSS.
Boom, now your equal servers are inequally loaded. The beauty of leastconn is that if 1st server is loaded it will just naturally get less connections than the lucky ones that only got easy jobs
Yeah, this for sure. Traffic mix changes over the life of an application. In k8s performance isolation certainly isn't a solved problem, and noisy neighbours can affect the "power" of a server.
Any change to the application at all could upset the delicate balance of its performance. You would need to be doing this sort of profiling on every change, using up to date and accurate production traffic.
It ends up being much easier to put some smarts in your load balancer.
The simulations are really cool. I'm trying to think of a real-world situation where you'd have differently-sized backends though. I guess it might help if your server runs hourly cron jobs that take away system resources from requests.
I don't think many people are intentionally deploying their apps with replicas of different sizes, you're right.
But this is it: noisy neighbours, inherent physical differences in even identical hardware, using different node sizes in your clusters. I think incidental differences in servers are very common, even within the same AWS instance category.
I thought of another variant: with AWS autoscaling groups using spot instances you can list a number of different instance sizes and say "give me whatever is cheapest" and you'll often get a mix.
If the requests are similar (i.e. don't have unique user data in them), it's best to put a high performance reverse proxy in front that has some caching. This way you don't have to execute lots of code for every request.
iirc there was a famous case where an outage occurred at a company using some least connection or least resource algorithm in the load balancers. The load balancer queue'd up requests and when a new server was brought online it immediately dos'd it out of existence. Then another server was brought online and the same thing occurred and another and another. They eventually switched to round-robin to give the new servers coming online a fighting chance to survive. Eventually, enough were online to process the queue.
the take away was complex load balancing can make problems worse when bad things happen unexpectedly.
Just throwing another positive comment into the fray: Amazing work on this article. Informative, and easy to digest despite being a complex subject. Well done! I learned a lot from this article.
I used https://pixijs.com/. This was my first project with it and it was really nice to use. I fell in to a few traps here and there, but generally it wasn't too bad to get the results I wanted with it. :)
Yeah, it does! There are a few options, I’ll mention 2:
1. You can use DNS to return IP addresses to multiple load balancers.
2. You can have N load balancers that are themselves behind another, beefier load balancer.
Plenty of other magical things that have been done to solve this problem, but I think those 2 are quite common and flow naturally from the ideas the post covers.
Queuing theory is nice, but then again its completely abstract and mathematical. Real life issues get in the way.
In particular: for many parallelization problems, its simply not worth the effort to ever load balance. Tasks are too small and the "time spent calculating where to go" is more expensive than just accomplishing the task to begin with. (Ex: Matrix multiplication can be seen as a parallelization of multiplies, followed by a set of additions. The multiplies are too cheap to perform any form of load-balancing, executing in just one clock tick).
So in reality, we have "fine grained parallelism", and "coarse grained parallelism". Coarse grained is where load balancing is feasible (the task is "big enough" that its worth spending a bit of CPU time figuring out load balancing).
"Fine grained" is very difficult, and is in the realm of CPU design (CPUs will execute your instructions out-of-order to discover parallelism)... though software also plays a role.
----------
That being said, I probably should study more queuing theory myself. The math seems less about "how to design a good parallel system" and more focused on "how to measure a parallel system" (ex: throughput * latency == state... which means you can measure latency == state/throughput, and other such mathematical tricks).
The math is simple, albeit abstract. Its not there for "deeper understanding", its there for "even basic understanding" (but much much faster to calculate if you already know queuing theory).