Not an expert in load balancing, but in similar problems (work sharing / work stealing, MPI get/put) it makes sense to pull only if you can pull fast enough to avoid incurring prohibitive latency at every request/message.
Multithreading-based work stealing à la Cilk relies on extremely cheap thread mechanisms and implementation to minimize communication.
In another similar situation, HPC switches are credit based so that until you hit congestion, you can “instantaneously” know if a remote is ready to receive.
This isn't a formal explanation, of course.
Edit: after some thought, that's not really the distinction that is made for load balancing. There's already knowledge of the remote state required for pushing to the last loaded queue. So the difference between pull and push is about having one queue vs several. In that sense it is like supermarkets that implement the more efficient one queue to every cashier Vs the more traditional one queue per cashier. In supermarkets there is a choice to make because there's other constraints, but just optimising for load balancing it's strictly better to have a single queue, if you only have one input.
Having enough workers to do the work shouldn't be any more constraining than it was before.
The article does nicely mention that simple round Robin actually has lower latency, because some traffic gets lucky & goes to under-utilized machines. Unfairness helps some traffic go faster. The queue is probably going to eliminate this, but the unfairness advantage comes at the cost of a lot of other traffic getting put into long queues on workers, so it wasn't really a good thing anyways. The p90+ is usually awful.
Multithreading-based work stealing à la Cilk relies on extremely cheap thread mechanisms and implementation to minimize communication.
In another similar situation, HPC switches are credit based so that until you hit congestion, you can “instantaneously” know if a remote is ready to receive.
This isn't a formal explanation, of course.
Edit: after some thought, that's not really the distinction that is made for load balancing. There's already knowledge of the remote state required for pushing to the last loaded queue. So the difference between pull and push is about having one queue vs several. In that sense it is like supermarkets that implement the more efficient one queue to every cashier Vs the more traditional one queue per cashier. In supermarkets there is a choice to make because there's other constraints, but just optimising for load balancing it's strictly better to have a single queue, if you only have one input.