Hacker News new | past | comments | ask | show | jobs | submit login
Why disaster happens at the edges: An introduction to queue theory (thenewstack.io)
245 points by ayewo on Nov 14, 2021 | hide | past | favorite | 56 comments



Queue theory is almost a distraction because it overcomplicates the situation. The underlying insight is that rates matter. If a system can handle 100 events a second, then 99 events per second everything is fine and 101 events per second the system is down. There is a threshold where everything falls apart.

Queue theory encourages people to think in terms of flow rates and has the technically correct de-rates to nominal capacity to account for variance. But normally that is overkill except in highly engineered systems.

The battle in an organisation is convincing people to look at flows at all. In my experience people love metrics that track stock (we have X widgets or can handle Y orders) and not flows (we built A widgets and sold B widgets). Anyone who has training in queue theory knows when it is appropriate to look at flows rather than stocks and that is where the value is. The formulas and variances tend to just scare people away from monitoring flows because they don't understand what is happening.


Why is 101 events per second == system down?

Of course there are systems where not being able to process everything right when it comes in means that you're "down". It's not a given though and queuing systems are actually perfect for use cases where not being able to process at the speed of incoming requests is completely fine. Eventual consistency is a thing.

I have the same experience though that it seems to be hard for folks to accept that yes, 100k messages in _that_ queue is totally fine. Some process dumped 100k messages to then be processed over the next hour and that's totally fine. That way we don't have to actually spend a small fortune to process them at the speed at which they can be generated.

Of course it depends on the problem at hand. If that particular flow in the system is one that is highly user interactive, then yes, this means 'system down' for your users. Never mind that the system will eventually process it all. I always smile when I see another system than ours and I can spot distributed processing w/ queues in between in how the system behaves in certain circumstances.


If you can only handle 100 events/second and you are getting 101 events/second continuously the queue depth is going to go up by 1 event/second up to infinity, causing your latency to go to infinity.


The moral of that story is don't let your queue go up to infinity. Drop 1/101 events and serve the other 99% with a typical latency. Instead of latency going to infinity to everyone, your service gracefully-ish degrades by serving as many requests as it is able to and quickly reporting an error to those it is not. Once load drops below capacity, you are immedietly back to a fully functioning state.

Granted, depending on how well behaved your clients are, the number of events each second may as the failed requests are retried.


So we should solve the problem at the port of Long Beach by randomly dropping cargo into the ocean until we no longer get container ships piling up off the coast?


Or build sufficient back pressure mechanisms so that the cargo doesn’t get loaded in the first place or better still the goods doesn’t get manufactured and so on.

Send the feedback as far up the source as possible.


The problem there is signal lag. When the backlog at the port is cleared and you signal the factory to start production again, it might take 3 months for the products to start arriving at the port again. If the port gets congested again and you stop production it might take a month for the goods already made and on their way to stop arriving.

But you can also redirect cargo to different ports, or increase the price you charge for servicing cargo. Even if you dump containers you don’t have to do so randomly. Customers could even mark containers as rejectable, or non critical and therefore first to go to a backlog for a discount.

I found the article really interesting. For most IT use cases jobs are generic but thanks for raising the port example.


The signal lag is a huge problem. When you go a few stages through a supply chain, a gradual variation in consumer demand turns into huge whiplashes in demand. This is called "the bullwhip effect" and is a huge problem in supply chain management.

See https://www.cips.org/knowledge/procurement-topics-and-skills... for more.


Right, and now you're using the argument of the top level comment.

Look at the port. Is ingress consistently larger than egress? Then limit ingress.

Now look at the ocean. Is ingress consistently larger than egress? Then limit ingress.

Now look at the exporting ports. Is ingress consistently larger than egress? Limit ingress.

Now look at trucks going to the port. Is ingress consistently larger than egress? Limit ingress.

Now look at the factories that load trucks. Is ingress consistently larger than egress? Limit ingress.

----

Back pressure builds on looking at one queue and asking if arrivals come faster than departures go -- exactly what the top-level comment suggested.


AKA "there are no truly unbounded queues, only queues you do not know the bounds yet". Back Pressure is good (but also hard).


I know you mean it sarcastically, but I am guessing this could solve the problem… the throughput of the ports would increase and the demand shock ameliorated.


I am not talking about infinity. I am not talking about a scenario where you have a constant event stream, which will never vary. You are correct that if you have such a system, then at 101 events/second "you're dead".

I am talking about a bazillion other use cases in which there is variance in the number of events produced at any given time and where the immediate processing of those events is not of utmost importance. "Slap a queue in between" and suddenly you are no longer bound by what your systems can handle, you are only bound by what the queuing system can handle and that is usually much easier to scale (or can already handle this kind of load).

E.g. think about a system needing to synchronize with another external system. Back in the old days (and well I'm sure those systems are still out there today) synchronizing two systems may well have been an overnight job. A nightly sync job. Nowadays many such systems probably try near realtime synchronization but you don't want to make it part of the regular flow either and block the actual operation.

But sometimes there's too many synchronizations that need to happen at once, so a queue builds up over some period and goes down again a bit later. It's still better user experience than a nightly job. E.g. your regular load might be at 10 e/s. During some particularly busy time or specific bulk operations being executed in parallel by multiple people, you might have an influx of 200 e/s for a minute or so.


Assuming you have queues and stuff that can hold the requests in the intermediary time. Some systems will just start dropping the data or crashing and loose data.

If you have a queue or something in the middle then the failure is potentially recoverable or never happens in the first place.



I don’t think anyone was suggesting that they did. (Seems like a bit of a strawman to me)

However they do allow you to smooth or average spikes that could otherwise be problematic.

In your articles analogy I am able to drain a pot of pasta down the sink even though I might not be able to pour the same pot down the drain.


It lowers the load if the consumer is able to process batches of requests efficiently (ex. for vector operations).


The term here is literally queues, so yes, there's a queue in there. I.e. HornetQ, RabbitMQ etc. I'm talking about distributed systems that are built with queuing in mind and that analyze the specific use case to understand when it is OK to have a queue build up and when it isn't.


That is absolutely not what queuing theory says. In equally simplified terms, queuing theory says that for a system that handles 99 events per second if you size it for 100 it will probably fail because 99 is just the mean, and when you overshoot (in events per second) you don’t have the capability to cover it during the undershoot (in events per second) period because events have piled up.

For a numerical example of queuing theory 101 and why variance is very important rather than something that “scares people away” [0]. Flow tracking is not enough.

Also this discussion is relevant [0]:https://www.johndcook.com/blog/2008/10/21/what-happens-when-...


It is my example, I can set the variance to whatever I like. It is a D/D/1 queue in case that isn't abundantly clear.

Which, in practice, is a model of queuing systems that often gets most of the value of queue theory and is easy to explain without jargon. Then apply an 80% fudge factor. Unless the situation needs care.


I don’t know anything about queue theory, but I’m not sure how your point differs from GP? You both said that a system designed for 100 and being run at 99 falls apart at 101?

I didn’t understand GPs comment about flows - perhaps that’s where the dispute is.

Genuine question, looking forward to learning more :)


I am saying that the system should fail in terms of SLA even at 99.

In the link I shared there is a simple worked example of 10.3 capacity/hr for 10 events/hr that piles up 28 events. It also shows the problem can be mitigated by analysing it in queuing/probabilistic terms. In real life it is even more complicated obviously!


> The battle in an organization is convincing people to look at flow at all.

I guess that's the point... an organization needs enough people to know about queue theory "to look at flow" and understand that queues are ubiquitous and anticipate that these queues experience a latency phase change at some point depending on utilization (and job duration variance). If an organization doesn't have anyone who understands queue theory, then it's probably more likely to have queue-related failures.


Thanks! I observed two points.

In order to look at flow - "should learn about queue theory"!

"queue theory" itself would be useful to design queuing related systems.


Another random software engineer claiming a fundamental theoretical problem irrelevant...

Queuing theory deals with the stochastic behavior of a discreet event in at a single queue, and possibly multiple ones connected in some structured manner.

It's importance is that it can determine before hand what are the stochastic boundary of the system.

Your whole idea of 99 events and 101 events are so strawman that I don't know where to start poking at the holes...

For one thing, the rate of something has its definition, there is jitter or busyness. To say something has a rate is plainly a impractical concept. As there is nothing in the network that had steady rate...

Edit: I should stand corrected that the parent does not really miss the meaning of queuing theory. It was just that I only read the first paragraph.


But things do have an average rate. And in my experience, this is very practical information. Almost daily I plug mean rates into Little's law and get useful numbers out. (Yes, I do verify my predictions.)


> Little's law

I guess that's the OP's point that you can't escape from talking about queues and which you sort of validated by bringing up Little's law.

Queues are literally everywhere around us in the physical world. It needs to be understood at a basic level -- Little's law is good enough and only requires primary school math. If execs aren't able to understand such a ubiquitous and simple concept then I don't know what to say.

Average flow rate is indeed useful but please explain that along with a queue and show whoever needs to know how queues can grow unboundedly if output rate is consistently less than input rate. And if needed explain further the upstream consequences of unbounded queue growth.


> The underlying insight is that rates matter. If a system can handle 100 events a second, then 99 events per second everything is fine and 101 events per second the system is down.

Not really. You need actual theory to inform you whether 99 events per second is okay. See https://www.johndcook.com/blog/2008/10/21/what-happens-when-...


If you want to maintain 99.99989% uptime or are about to spend $10 million, yes. If you want the cheap option then measure (arrival rate / service rate) and try to keep that below 80%. Then if that doesn't work adjust accordingly. That isn't really queue theory because it doesn't consider variance and requires effectively no knowledge of literature, what rho means or why people use M/M/1 queues as a default. Don't have to spell Poisson. But it will usually get you to where you want to be.

Most of the time feeling it out in production is a lot cheaper than keeping an in house specialist. If it isn't then sure, maintain a specialist to monitor the situation. It is a relatively rare role though.


There's another advantage to feeling it out in production - you're measuring the actual system, not your theoretical model of the system, so all the extra things you've forgotten are automatically included (assuming you measure the right things, of course!)

And keeping your safety margins nice and wide isn't just inefficiency. It's also building in resilience for when something unexpected does happen. Yes, if you need your six-nines performance on a 1% margin then maybe you should try to optimize this to the nth degree, but realistically almost none of us have to sail that close to the wind.


> The battle in an organisation is convincing people to look at flows at all. In my experience people love metrics that track stock (we have X widgets or can handle Y orders) and not flows (we built A widgets and sold B widgets).

This reminds me of an entrepreneurship class I took in college where the professor emphasized the importance of cash flow, something I hadn't considered much and found somewhat counterintuitive.

I'd love to learn more about this perspective on when to focus on the flows over stocks, are there any resources you'd recommend to help me better understand it?


Queue theory is a lot of fun. There aren't a lot of variables needed to characterise a queue (rate in, rate out, variance of both the rates encoded into a probability distribution). But there are a lot of observable variables (queue length, wait times, utilisation of the servers, probability of there being a queue when a new customer enters the system, probability of the queue overflowing, etc). That means the system is highly over-determined, you need a very small number of observed variables to be able to calculate everything.

Basically all you are doing when you "study queue theory" is training your brain to go very quickly from looking at some random metric to thinking "oh, this is a queue with X distribution, I need to work out the arrival rate and service rate then I know everything".

My personal reference is "Fundamentals of Queuing Theory" by Gross, Shortle, Thomson & Harris. But to learn, just pick some interesting queues (ie, queues with variance in the rates) and start calculating (if cars arrive at a light with X mean rate, what variance means that there will sometimes be a queue extending around the street? Would adding another server at this shop have a big or small difference on how long it takes to prepare my order? Based on the number of checkouts at this supermarket, what rate are their planners expecting - is this a busy period?).

No individual idea in queue theory is that powerful, the advantage is really that when you see a queue you don't waste hours trying to link variables in O(n^2) ways and instead work with arrival/service rates. And you realise that the system works by phase changes rather than anything else, which is in a sense obvious but not actually very intuitive if you haven't identified the queue.


I'll check out that resource and re-read your post a few times. Thank you very much for replying.


The main question posed to business is the following.:

Do you want to provision your system for the worst case scenario volume, and what is this worst case scenario you are willing to pay to insure your system against.

It’s an easy question, but I don’t see how queueing theory can provide answers to. It is at its root a subjective business decision.


There are secondary non-obvious effects of caching and queueing.

Cold start is an important scenario that lots of people overlook - “we won’t restart it during busy times”, etc. however if you have a bug or outage then that might be precisely when you restart. During a cold start you will often find your system behaves as it does at the tails of the performance curve, either because you are getting 100% cache misses, or because all your queues are full due to a client pile-on or queued upstream work, or some combination.

The article described back pressure; I can’t emphasize that enough as part of any resilient queuing system. Chains of queues are only as performant as their weakest link.


Queueing theory-- my notes for programmers:

https://github.com/joelparkerhenderson/queueing-theory

Queueing theory is the mathematical study of waiting lines, or queues. We use queueing theory in our software development, for purposes such as project management kanban boards, inter-process communication message queues, and devops continuous deployment pipelines.


The notes don’t explain what problems queue theory solves, i.e. why would I want to use it, as a programmer. The Wikipedia article’s second sentence is a bit more helpful in that respect: "A queueing model is constructed so that queue lengths and waiting time can be predicted."


We typically track many things about the activities in the queue, and we want to summarize the results by choosing a shortlist of the most relevant ones for our projects.

    Dμ = Delivery service rate. Devops teams may say "deployment frequency" or "we ship X times per day".

    Rτ = Restore lead time. Site reliability engineers may say "time to restore service" or "mean time to restore (MTTR)".


All kinds of Queue analysis is also done extensively in Six Sigma practice


TheNewStack needs more articles like this one and less articles that talk about some new feature where by adding extra five lines of YAML you get something or other in your Kubernetes.


I worked at IBM in earlier days, and a couple of the Research people in the data storage group at Almaden and Poughkeepsie gave some great lectures and made excellent tools for modeling cache disk storage array performance that were based heavily on queue theory. This brings back great memories, thanks!


I'm being pedantic but,

>Every distribution can be described as a curve. The width of this curve is known as its variance.

Not every distribution has a variance. Some notable examples include the Cauchy distribution (or Lorentz lineshape in physics),

https://en.m.wikipedia.org/wiki/Cauchy_distribution

the Power law distribution for powers lesser or equal to 2,

https://en.m.wikipedia.org/wiki/Pareto_distribution

and the Levy distribution,

https://en.m.wikipedia.org/wiki/Lévy_distribution

These are not mathematical curiousities but actually describe real physical systems.

Additionally, the variance works as a "width" only for unimodal distributions. Any variance based metrics or analysis should only be used after checking for multimodality and, for multimodal data, with extreme caution thereafter.


A frustrating headline. Where else would it happen? The middle? Crystals fracture on their faces. Things tend to break along their boundaries.


I feel like you are understanding the title as I first understood it, meaning like, "the external facing portions of infrastructure". However, reading the article, it seems clear that he's referring to the edges of a distribution curve (i.e. Infrequent events that impact experience nonetheless).

From the article: "It’s tempting to focus on the peak of the curve. That’s where most of the results are. But the edges are where the action is. Events out on the tails may happen less frequently, but they still happen. In digital systems, where billions of events take place in a matter of seconds, one-in-a-million occurrences happen all the time. And they have an outsize impact on user experience."

EDIT: IMO, the title is still a little annoying in this respect. I think everyone would agree if a request to your site fails 5% of the time, that is unacceptable, even though it "usually works." The discussion of the distribution curve simply to make the point that spikes in usage cause backed up queues which impact performance isn't necessarily helpful as far as I can tell, and it seems done largely in service to the title. In my mind while reading this, I'm thinking, "Okay, cool, but how does the fact that this interesting issue exists at the edge of the curve help me identify it?" Answer: It doesn't. If you see errors occurring, you will investigate them once they are noticed. Being at the edge of the curve may mean it takes longer to notice, but like, what kind of alerting system are you using that discriminates against rare issues in favor of common ones?

Discussing queues, over provisioning, back pressure, etc. are all super interesting and helpful.


Bitten by queues in every team I was at AWS:

> As a rule of thumb, target utilization below 75%

This is one good reason to rely on serverless. Simply outsource the problem to a system that knows how to handle this better.

> Steer slower workloads to paths with lower utilization

This is fraught with all sorts of perils [0], so take caution going down this valid but tricky route.

> Limit variance as much as possible when utilization is high

This is key, and one of the most elegant solutions to this problem I know of comes from a Facebook talk on CoDel + Adaptive LIFO. [1]

> Implement backpressure in systems where it is not built-in

Making downstream dependencies behave is never an option. Selectively isolating noisy neighbours, if possible, from other well behaved clients, tends to work well for multi-tenant systems [2], in addition to monitoring long work (by dropping work that takes forever) [3], or better yet, doing constant amount of work [4] (aka eliminating modes) [5].

> Use throttling and load shedding to reduce pressure on downstream queues.

One way is to impl admission control (ala Token Bucket) to minimise the impact of thundering herds. Though, clients do not take kindly to being throttled. [6]

A great article; many have been written at this point. Very many still have been burned by queues. Though, in my experience, without an exception, some component somewhere was always building up that backlog! [7][8][9]

[0] Interns with Toasters: How I taught people about Load Balancers, https://news.ycombinator.com/item?id=16894946

[1] Fail at scale: Controlling queue delays, https://blog.acolyer.org/2015/11/19/fail-at-scale-controllin...

[2] Worload isolation, https://aws.amazon.com/builders-library/workload-isolation-u...

[3] Avoiding insurmountable queue backlogs, https://aws.amazon.com/builders-library/avoiding-insurmounta...

[4] Constant work, https://aws.amazon.com/builders-library/reliability-and-cons...

[5] Cache, modes, and unstable systems, https://news.ycombinator.com/item?id=28344561

[6] Fairness in multi-tenant systems, https://aws.amazon.com/builders-library/fairness-in-multi-te...

[7] Using load shedding to avoid overload, https://aws.amazon.com/builders-library/using-load-shedding-...

[8] Treadmill: Precise load testing, https://research.fb.com/publications/treadmill-attributing-t...

[9] Cascading failures, https://sre.google/sre-book/addressing-cascading-failures/


I would love to read more on queue theory + web applications - is there anything out there worth reading for a more in-depth understanding (applied to applications) but with key takeways/rules of thumb presented like in this article?


> The TCP protocol, for instance, generates backpressure with code 503

Surely they mean 'HTTP protocol'? TCP doesn't do back pressure but IP does at the router level, IIRC.


Throughout my career, the one thing I've noticed engineers/developers consistently have trouble modeling in their heads is queuing.


They absolutely love adding them to previously functional systems even so!


Don’t canceled/rejected requests have infinite latency by definition?


For the one request that is cancelled/rejected, but not for every other request behind them in a queue.


Seems like this guy is about to discover six sigma...


> The TCP protocol, for instance, generates backpressure with code 503

Doesn't TCP typically use sliding window flow control? I'm not sure what "code" is referring to in this context.


I think this might be an error in the text, HTTP status code 503 can be used for backpressure, which I think is what is being referred too

But yes, TCP uses windows for back-pressure, but that isn't really useful for application level backpressure as the OS controls the queues sizes, so pretty much most systems have their own backpressure on top.


It's not that useless. A lot of applications these days are really using HTTP. And in HTTP/1 if you read from a response body from the socket (or write it), you are really already using the backpressure from TCP instead of having anything on top of it. With HTTP/2 and /3 that's different due to mulitplexing of multiple streams on the same socket - in that case there exists the additional flow control windows on top of TCP. These actually make those protocols rather hard to implement correctly.


It's useful still! If you have an application that talks to another application on the other side of the world, you get to control how fast it sends you stuff with your reading speed. If you stop reading from the socket, the sending application will stop sending before long. It seems like spooky action at a distance but it's just backpressure propagating through many layers.


That seems like a typo or confusion and they actually mean HTTP.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: