> So we went from a latency of 1 second to 0.091 seconds which is an 11 times improvement.
There's your problem -- you should never allow unbounded queue growth at high utilization. Going from 80% to 90% utilization doubles your average wait times. We could similarly make this number arbitrarily large by pushing that utilization closer to 1, e.g. "We halved service time at 99.99% utilization and saw a 10000x improvement". But that's not interesting, as your users will complain that your service is unusable under heavy load far before you get to that point.
The typical "fix" is to add load shedding (e.g. based on queue length) combined with some intelligent backoff logic at the client (to reduce congestion), and call it a day. This has its own downsides, e.g. increased latency for everyone in cases of overload. Or, if your configured queue length is too large, you get bufferbloat.
(I have seen an argument for using LIFO instead of FIFO, which achieves much more predictable median performance at the expense of causing unbounded badness in the worst case. For this, your client needs to set deadlines, which it should probably be doing anyways.)
Why is this counterintuitive? You're describing scheduling -- your high priority service has the ability to preempt the low priority service and use more of your CPU (or whatever your bottleneck is) if there are pending requests.
If you have a high-availability, latency-sensitive service that's processing incoming requests in realtime, you want it to be overprovisioned (for limiting queue lengths, the ability to soak up DoS before other systems kick in, etc). But that's wasteful if you're idle half the time. (Or generally, off-peak -- you might explore spinning up / down nodes dynamically, but you probably still want some baseline amount of capacity in each of your availability zones.)
Another dimension you could explore (if your problem space allows it) is varying the amount of work you perform depending on your current load -- for instance, you could imagine a video server that serves 4k at light load, then degrades to 1080p at moderate load, and then degrades even further to 720p at heavy load.
That said, it depends on your cost model. If you're dealing with bare metal (or you are the cloud) where your principal operating expenses are space and power, then you might be fine with servers idling if power is expensive. If depreciation of your servers drives your costs (or if you get charged the same amount regardless of your actual utilization), then you might want to run your servers hotter.
If you have two workloads with very different operational profiles (e.g. low latency low volume vs high latency high volume) you could also process each workload in a different copy of the system, sorta-kinda the hot-path/cold-path pattern.
It might sound like duplicating your system would make it more complex, but having two systems each doing one very specific thing tends to be simpler than one single system doing very different things.
Another of those somewhat counter-intuitive results is the answer to the question "how much do we need to scale up to avoid response time regression when the request rate is doubled?"
It is very easy to blurt out "well obviously we need twice the processing power!" but if we scale to twice the processing power, then start accepting twice the request rate – we will actually be serving each request in half the time we originally did.
To many people that sounds weird; it sounds like we got something for nothing. If I invite twice as many people to a party and buy twice as many cookies, it's not like each guest will get twice as many cookies – that just leaves the originally planned number of cookies for each guest.
But for response time it comes back to the first equation in TFA:
T = 1/μ · 1/(1 - ρ)
Doubling both arrival rate and maximum service rate leaves ρ – and the second factor with it – unchanged, but still halves the 1/μ factor, resulting in half the response time.
The appropriate amount to scale by is the k that solves the equation we get when we set the old response time T at request rate λ equal to the one at request rate 2λ and kμ processing power. This is something like
T = 1/μ · 1/(1 - λ/μ) = 1/kμ · 1/(1 - 2λ/kμ)
but rearranges to the much simpler
k = ρ + 1
which a colleague of mine told me interprets intuitively as "the processing power that needs to be added is exactly that which will handle an additional unit of the current utilisation on top of the current utilisation, i.e. twice as much."
This is mostly good news for people doing capacity planning in advance of events etc. If you run your systems at reasonable utilisation levels normally, you don't actually need that much additional capacity to handle peak loads.
> but if we scale to twice the processing power, then start accepting twice the request rate – we will actually be serving each request in half the time we originally did.
Not necessarily. If processing power is increased by doubling the clock of a processor or using a hard disk that spins and seeks twice as fast, this may be the case.
But we all know that when you have a single threaded work item, adding a second core does not cause the single threaded work to complete in half the time. If the arrival rate is substantially lower than 1/S, the second core will be of negligible value, and maybe of negative value due to synchronization overhead. This overhead is unlikely to be seen when doubling from 1 to 2, but is more likely at high levels of scaling.
If the processor is saturated, service time includes queue time, and service time is dominated by queue time, increasing processing power by doubling the number of processors may make it so that service time can almost be cut in half. How close to half depends on the ratio of queue time to processing time.
This is a useful addition. Yes, the above reasoning was assuming that we
(a) had located a bottleneck,
(b) were planning to vertically scale the capacity of that bottleneck, and
(c) in doing so won't run into a different bottleneck!
It is still useful for many cases of horizontal scaling because from sufficiently far away, c identical servers looks a lot like a single server at c times the capacity of a single one. Many applications I encounter in practise does not require one to be very far away to make that simplifying assumption.
> How close to half depends on the ratio of queue time to processing time.
I don't think this is generally true, but I think I see what you're going for: you're using queue length as a proxy for how far away we have to be to pretend horizontal scaling is a useful proxy for vertical scaling, right?
> It is very easy to blurt out "well obviously we need twice the processing power!" but if we scale to twice the processing power, then start accepting twice the request rate – we will actually be serving each request in half the time we originally did.
I don't understand what you are saying. Are you talking about the time the request is buffered in some queue on average assuming they are arriving randomly? Or something like that?
Here is what I'm thinking. We are operating a hospital which does only one type of surgery which last an hour exactly. (Presumably it is a veterinary practice for spherical cows.) A fully staffed operating room can operate on 24 spherical cows a day. If due to a spherical cow jamboree we expect more patients and we set up a second operating theatre we will be able to serve 48 of them a day. But we are still serving them for an hour each. (because that is how long the operation takes.)
Even if we are talking about latency when 24 cows show up at the stroke of midnight to the one operating room hospital they each will be served on average in 12.5h. Same average if 48 cows show up to the two operating room hospital.
So what am I thinking wrong here? Is "scale to twice the processing power" not the same as getting a second operating room? I'm not seeing where "we will actually be serving each request in half the time" comes from.
in queue theory, you don't expect "operating rooms" to operate 24 hours per day - spherical patients may have a gap, which causes the room to not work for some time, but then jamboree happens and it averages out
doubling the cows input doesn't mean each "burst" becomes twice as big - some of the new cows can simply fall into periods that previously weren't used
thus the second portion of patients don't need a whole second copy of all operating rooms - part of them get gobbled into inactive timeslots of already existing resources
(so it seems putting two stochastic processes on top of each other is not like putting two "solid" things on top of each other, right? intuitively they mesh, I guess their stacking height is their "expected value"?
and their worst case will be the sum of their worst cases, where there's no averaging, right? so again intuitively a the larger a flow is the more dangerous is, even if it's smooth, because if it backs up it fills up queues/buffers faster, so to plan for extreme cases we still need "linear" thinking)
Kind of yes. Stacking two stochastic processes simply adds up the expectation value but not the noise/dispersion/volatility. That variation adds as a sum of squares.
If you push utilisation towards 1, what you essentially do is push the next "free" slot farther and farther into the future. This, essentially, means that you always buy higher utilisation with longer latency (at least in the upper bound). But the good thing is: If you have enough numbers, then the maximum latency grows slower with utilisation.
> So what am I thinking wrong here? Is "scale to twice the processing power" not the same as getting a second operating room? I'm not seeing where "we will actually be serving each request in half the time" comes from.
Single Core vs Multi Core (ish).
With a Single thread you must work twice as fast to handle the increased load which also means the work is done at half the speed.
With Multi threading you can shuffle two units of work out at the same time so it's twice the load but the same speed.
To go back to the cow analogy, rather than adding 24 rooms (more threads) you give each surgeon a powersaw and they work twice as fast.
So if you scale the processing power per thread up then the time goes down, if you scale the processing power by adding threads (cores) the time stays the same (ish).
> To go back to the cow analogy, rather than adding 24 rooms (more threads) you give each surgeon a powersaw and they work twice as fast.
I was thinking that maybe that is what we are talking about. But convinced myself otherwise. Surely we don't need that much math to show if we cut the processing time in half the processing time will be cut in half. But if that is all we are saying I guess I can accept that as quite trivial.
Cut the processing time in half while doubling the load!
The average cow, pre-jamboree, spends one hour in the hospital, including time waiting for a slot in the OR. Then you give the surgeon a power saw that allows him to complete the job in half the time, but he also gets twice as many cows to work on.
Most people's intuition would tell them the cows would still spend an hour in the hospital (the doubling in work rate canceled by the doubling in work amount), but actually now it takes half an hour -- regardless of how swamped the surgeon was initially.
> but if we scale to twice the processing power, then start accepting twice the request rate – we will actually be serving each request in half the time we originally did.
People usually add processing power by adding more parallelism - more machines, VMs, pods, whatever. In this case, the "blurted out" answer is correct.
If I take one second to serve a request on a machine then I add another machine and start serving twice the requests, the first machine doesn't get faster.
Maybe what you're saying is true if you make your CPUs twice as fast, but that's not usually possible on a whim.
A simple way to view the first statement is that by both doubling the request rate and halving the processing time, you're effectively speeding up time by a factor of 2. If you were to record a video of the entire system, someone switching the parameters as you describe and someone putting the recorded video on a 2x fast forward would be fundamentally indistinguishable. So, naturally, the response time gets cut down on half as well.
If you google for "TFA meaning", which I'm sure you did, you should have seen the first result, which is a link to hackernews where someone is complaining about TFA not being defined.
> This is mostly good news for people doing capacity planning in advance of events etc. If you run your systems at reasonable utilisation levels normally, you don't actually need that much additional capacity to handle peak loads.
Stop using math that will put the hyperscalers/serverless/microservices bros out of business! /s
In mathematical queueing theory, Little's law (also result, theorem, lemma, or formula[1][2]) is a theorem by John Little which states that the long-term average number L of customers in a stationary system is equal to the long-term average effective arrival rate λ multiplied by the average time W that a customer spends in the system. Expressed algebraically the law is
L=λW
The relationship is not influenced by the arrival process distribution, the service distribution, the service order, or practically anything else. In most queuing systems, service time is the bottleneck that creates the queue.
Sure! In a professional capacity - our customers were dissatisfied with the length of time and flakiness of a particular workload job[0], and I was party to the discussions on how to solve this problem. Knowing Little's Law allowed me to dissent against the prevailing opinion that we should customise our job processing queue to prioritise these jobs[1], arguing instead to provision more resources (i.e. beefier servers).
The decision was still made to alter the priority. The change went into production and was found to unacceptably degrade the performance of other jobs. Thankfully, one of the engineers who I had convinced that "processing time is the only factor that matters" had spent all their time optimizing the heavy task, to the point where it was no longer a heavy task, thus saving the day.
0. The job was some kind of "export CSV" form, and it somehow involved both 'traversing dozens of highly normalised tables' and 'digging into json blobs stored as text'.
1. I specifically remember one of the arguments was that if you have 3 heavy tasks A B and C, best case is "in parallel" which takes max(A, B, C) time whereas worst case is "sequential" which takes (A) + (B + A) + (C + B + A) time, our current priority approximated the "sequential" scenario, and the priority change would instead approximate the "parallel" scenario. I use scare quotes because I felt it was a resource issue (the "sequential" pattern was a byproduct of the most common way a heavy task got enough resources... which was an earlier heavy task finished and freed up a lot of resources).
what a predictable result. changing to a priority queue in a heavily utilized system will ALWAYS increase wait time for other tasks if independent tasks.
The ONLY time thats not true is if higher priority tasks would eliminate the need for other tasks.
Many electronic queueing systems in hospitals etc. show you how many customers are ahead of you. If you observe the time it takes for a few people to receive service you can estimate how long you'll be stuck waiting. This is a number they rarely show you because it's often not encouraging...
It's the most primitive application but immediately useful for anyone.
Every time I'm in a queue at a store I compute the arrival rate while I wait in line. (Little's Law is one of my very favorite pages in all of Wikipedia!).
I am a bit worried of the overuse of Little's formula, especially in these catchy educational presentations. In reality queue sizes will dictate your consumer observable latency profile, which is in turn is dictated by the actual distribution of the service time - it is not a constant.
If you think about it, if you have an ideal system that serves users like a clockwork, every X ms with no jitter, while your arrival is also completely regular, every Y ms (Y < X), then basically a queue length of 1 is sufficient. In reality, just like we all observe in real-life queues, service time is far from constant, and outliers result in queue buildup. This is why often cutting the tail of service-time latency results in better overall latency than simply reducing average service-time.
Little's formula of course holds also in the above scenario, but it handles long-time averages and does not give you any indication what extreme behavior is lurking under the mask of these averages.
> ...the actual distribution of the service time - it is not a constant.
I'm concerned by the number of misunderstandings expressed in short time here.
1. Nobody claims service time is constant.
2. Little's law is one of the few parts of queueing theory that remarkably does not depend on service time distribution.
3. Many results for typical simplified M/M/c systems apply well also to any other service time distribution provided (a) arrivals are Poisson, and (b) the server uses time slicing multiprocessing. These are not very severe requirements, fortunately!
Long-term average sounds restrictive but it really just means a period long enough to exhibit some statistical stability. Most systems I see sit mainly in those regimes.
> I'm concerned by the number of misunderstandings expressed in short time here.
I have a feeling you misread my comment completely, and the misunderstandings are on your part?
> Nobody claims service time is constant.
Neither did I. Neither did I claimed that Little's Formula requires a constant service time.
> Little's law is one of the few parts of queueing theory that remarkably does not depend on service time distribution.
I did not say otherwise either. My point is that it is way less useful and enlightening than these edutainment posts make it. Two systems with the exact same parameters and results by Little's Formula might behave completely differently, and in many cases, counterintuitively.
> Many results for typical simplified M/M/c systems apply well also to any other service time distribution provided (a) arrivals are Poisson, and (b) the server uses time slicing multiprocessing. These are not very severe requirements, fortunately!
This was not my point. Or do you claim that queue size distributions DO NOT depend on service time distribution? Because that WAS my point. Averages do not tell the story you are most interested in. The whole point of queues is that service and arrival times have distributions with deviation. I personally think queues and buffers are very-very important and I am a huge proponent of predictable user-observable latencies as they improve general system health and resiliency under load.
> Long-term average sounds restrictive but it really just means a period long enough to exhibit some statistical stability. Most systems I see sit mainly in those regimes.
Long-term averages do not talk about pathological transient behavior, do not help you with queue sizing - or setting ideal timeouts. Also, statistical stability is misleading, the convergence time to the given statistic might be arbitrarily slow. Also, if we talk about real-world systems (which you do), they exhibit feedback do to clients retrying, throwing off the formula and potential ergodicity.
With these clarifications I realise we are in violent agreement and indeed the misunderstandings were on my part. I apologise, and an grateful you took the time to expand!
I don't think he's claiming that it's causally a function of service time (although this can be the case if demand is elastic). Rather, utilisation is calculated from service time and arrival rate, and they're just stating that this equation can be rearranged to calculate arrival rate if you know the other two.
Utilization is Service Time times Arrival Rate, ie if requests arrive twice as fast utilization is doubled, and also if a service takes twice as long to process.
When the theoretical service is optimized to take half a long, the assumption is that the request rate doesn't actually change. So they use the above relationship to calculate the new utilization, given half the service time and the same arrival rate.
Industry standard utilization is 50-60%, not 90%. If it is 90% utilization, you are destined for failure as 10% extra requests even for few seconds would kill your service. And it should be 60% even if service time increases or decreases, so decreasing the service time by half should ideally halve the resources.
> So we went from a latency of 1 second to 0.091 seconds which is an 11 times improvement.
There's your problem -- you should never allow unbounded queue growth at high utilization. Going from 80% to 90% utilization doubles your average wait times. We could similarly make this number arbitrarily large by pushing that utilization closer to 1, e.g. "We halved service time at 99.99% utilization and saw a 10000x improvement". But that's not interesting, as your users will complain that your service is unusable under heavy load far before you get to that point.
The typical "fix" is to add load shedding (e.g. based on queue length) combined with some intelligent backoff logic at the client (to reduce congestion), and call it a day. This has its own downsides, e.g. increased latency for everyone in cases of overload. Or, if your configured queue length is too large, you get bufferbloat.
(I have seen an argument for using LIFO instead of FIFO, which achieves much more predictable median performance at the expense of causing unbounded badness in the worst case. For this, your client needs to set deadlines, which it should probably be doing anyways.)