The playground simulation is pretty cool. One thing I found interesting, is it y...

samwho · on April 18, 2023

This is very likely a bug in the simulation. My simplified implementation of PEWMA prioritises servers that have had no traffic, in order to send at least 1 request to all servers. There will be a window, until this new server serves its first request, where it is considered the highest priority server.

I doubt very much that this would be part of any real world implementation

toast0 · on April 18, 2023

I'm not familiar with PEWMA, but real load balancers sometimes have this problem. Either because of dynamic weighting that slams the new server which shows zero load, or because the new server needs to do some sort of cache warming, whether that's disk or code or jit or connection establishment or ???, a lot of times early requests are handled slowly.

Most load balancers should have a way to do some sort of slow start for newly added or newly healthy servers. That could be an age factor to weighting, or an age factor on max connections or ???. Some older load balancers are just not great at this, so you develop experienced based rules like 'always use round robin, leastconn will kill your servers with lumpy loads'. All that said, and a repeated theme across my comments in this thread, the more sophisticated your load balancing is, the harder your load balancer needs to work, and the sooner you need to figure out how to load balance your load balancers.

411111111111111 · on April 18, 2023

It should happen in the real world as well, at least that's what I've been told when I started my first job as a system admin.

The reason people cited to me back then was that the balancer usually isn't particularly smart when balancing, so they only see a free node, thus every free request is routed to it. The errors (mostly timeout) will happen once the request start to actually get processed.

Normally, the node gets a steady amount of requests over time, thus the load is constant (generally speaking, a request will require the most resources at the same relative time of their lifecycle). As all requests are fresh, they'll all hit the same load bottleneck at the same time, causing all the timeouts.

The answer is to both aggressively scale horizontally and then quickly decommission until you're back to baseline.

Or just accept the failed requests

Its been over 10 years though, it mightve been improved since.

throwawayc3ff · on April 18, 2023

I don't know anything about this subject, but my first thought (which may be wrong) would be to just set the weight of the new server to be the same as one of the other servers that are receiving messages (perhaps one of the lower ranks). In that way, it would not be overloaded so easily and adjust its ranking after a while

411111111111111 · on April 18, 2023

I guess my explanation was lacking then, as that wouldn't help. reducing the weight below the old nodes might work, but it would also extend the duration you're overloaded, which would also cause requests to fail.

throwawayc3ff · on April 18, 2023

That makes sense. I guess there's no simple fix for it.