Using load shedding to avoid overload

tyingq · on Oct 10, 2021

The article does mention prioritization, but doesn't mention my favorite pattern with this. A priority queue that favors end users "farther down the process" is nice for load shedding.

Like, for an ecommerce site, being able to prioritize users in the checkout path first, not-in-checkout but non-empty cart second, etc.

Or prioritizing features in a similar way. Turning off, for example, "people that bought this, bought that", under load.

dmurray · on Oct 10, 2021

> doesn't mention my favorite pattern with this. A priority queue that favors end users "farther down the process" is nice for load shedding.

The (fantastic) article suggests precisely this:

> let's say a service has two APIs: start() and end(). In order to finish their work, clients need to be able to call both APIs. In this case, the service should prioritize end() requests over start() requests

dyanacek · on Oct 10, 2021

This is a great way to describe it! I gave a similar example of pagination and how the later pages might be better to prioritize over initial pagination requests, but your example is a nicer illustration. Thanks for that!

There’s also someone I was talking to after writing the article who said they can fall back to statically rendered versions of certain pages on Amazon.com during overload. The trick is to have a page that is still useful!

And for the “turning off features” idea - this happens today on Amazon.com. If a feature on the site fails to render successfully or on time, it’s left off of the page. Critical functionality can be left off, so it’s a judgement call on what’s allowed to fail the page render.

tyingq · on Oct 10, 2021

Ah, yes, you're right...I missed the pagination example fitting that pattern.

"If a feature on the site fails to render successfully or on time, it’s left off of the page. Critical functionality can be left off, so it’s a judgement call on what’s allowed to fail the page render."

Oh, that's useful also, but I meant a step farther where the page doesn't ask for those widgets if (load > X). Which avoids calling it at all.

dyanacek · on Oct 10, 2021

Good point around avoiding the call in the first place. This is a very tricky topic, I’ve found. Things that try to guess the nuanced health of a dependency can lead to outages when they guess wrong. These circuit breakers are helpful if they’re right, but harmful if they’re wrong.

For example, say a service is backed by a partitioned cache cluster, where the data is hashed to a particular cache node. Now let’s say one node has a problem, causing requests to data that lives on that node to fail, but others to succeed. If a client is making requests for data that happens to live on all nodes (the client doesn’t know about these nodes by the way, it’s just an implementation detail of the service) and sees an increased error rate, should it start failing some requests? It could take a single partition outage and increase the scope of impact into a full outage.

Anyway I’ve been meaning to write an Amazon Builders’ Library article on this topic, or to convince someone else to do it (looking at you, Marc Brooker!)

vinay_ys · on Oct 10, 2021

Graceful degradation by turning off optional features and making the critical path (order completion funnel) as lightweight as possible by reducing costs (# of results fetched/scored, turning off reco etc) was the best way to scale systems seeing 3x higher peaks year over year; and 10x higher peak during flash sales compared to normal periods, especially without breaking the bank trying to provision that much capacity.

p.s. - btw, no cloud has that much elastic capacity.

zodvik · on Oct 10, 2021

I saw this comment and thought this exactly described what my $previous_company did. Then, I saw the username :-)

kawsper · on Oct 10, 2021

> turning off reco

What is reco?

iampims · on Oct 10, 2021

Recommendations.

ignoramous · on Oct 10, 2021

> A priority queue that favors end users "farther down the process" is nice for load shedding.

Another pattern from the networking world, one which Facebook wrote about, is CoDel aka Controlled Delay (switches tasks to lower timeouts when the work-queue begins to build-up) combined with adaptive LIFO (processes last-in tasks first when under load, but FIFO otherwise): https://queue.acm.org/detail.cfm?id=2839461

noisy_boy · on Oct 11, 2021

This makes me think about APIs that face a variety of queries (probably not e-commerce where patterns are more predictable but think internal warehouse) - some queries may be more "expensive" than others (e.g. they query data that touches a big percentage of the underlying data based on criteria where database index is not feasible). What if instead of allowing all requests to go to the API directly, we let them pass through a grader, "good" queries (say ones that are using indexed columns) get a higher "grade" and then we put the requests in a priority queue sorted by grades and descending time (so if grades are same, newer queries get processed first). That becomes the work queue. Interested to hear what people think about it / if someone is using something similar and their experience with it.

withinboredom · on Oct 11, 2021

TBH, the specific approach sounds like a bad one. You can easily swamp a server with small requests and the big ones will never run, which are probably important. It would just incentivize engineers to write smaller queries while your older (Unmaintained ones) ones would never run.

Same with deprioritizing queries that are older in the queue. You’d end up deprioritizing people who’ve already been waiting, so they’d likely try again instead of waiting.

Thus you end up with the feedback loop this article talks about, where you’re amplifying the problem instead of addressing it.

lokar · on Oct 10, 2021

The generic goal I think you are describing is to maximize goodput.

Another simple approach is to serve queues (of like requests) in LIFO order.

dyanacek · on Oct 10, 2021

Yes! LIFO is a fantastic improvement. This suggestion is buried in the article a bit. Maybe I should have elevated it, or maybe broken it into more pieces. There’s so much to talk about on this topic. But yeah, LIFO is totally “This one weird trick that will make your service bulletproof to overload! Chaos Monkeys hate it!”

tome · on Oct 10, 2021

I'm really confused about how and why this works. Why is it a good idea to keep really old requests unhandled? I feel like I must be missing something obvious.

8note · on Oct 10, 2021

A queue distributes the latency increase to all requests whereas a stack only increases the latency for some requests when you're overloaded.

This means that if you catch up to the incoming new requests, a queue keeps every request running slower due to the time spent in the queue.

The steady state stack on the other hand, gives the same couple items worse and worse latency, while all the new items go back to normal.

Chances are the long latency requests will be retried, and there's a roughly fixed number of items to be retried, so you don't have to worry too much about the ones stuck in the stack. For a queue, the retries lengthen the queue, and that waiting time is added to every item, making them more likely to retry too, lengthening the queue further etc

Having the stack leak items at the bottom gets the same benefit - you don't have to fulfil them, but if you can get the items back out of the stack quickly, it's still worth working on and completing them before the client needs to send a retry. The more of them you get in, the more 9s you get to look more like your p50 than your p100

lokar · on Oct 11, 2021

Consider the two steady states:

You are keeping up and the queue is mostly empty, the order does not matter

You are not keeping up, the queue is growing. If nothing changes, the age of items removed from the front will grow and grow, and eventually all be timeouts or abandoned.

tome · on Oct 11, 2021

Oh, I think I see. With a queue the failure state is that everything times out. With a stack you risk sacrificing only the oldest requests and keep the newest alive.

lokar · on Oct 11, 2021

And you end up not servicing some random sample that get buried in the stack before you can get to them.

oriolid · on Oct 10, 2021

It guarantees that in overload situation the requests that get handled are handled quickly. In the same situation a FIFO would grow until all requests are really slow or time out without increasing throughput. The reason to keep old requests in LIFO instead of dropping then right away is that they can be served when load drops, just in case there's still someone waiting for the page to load.

tyingq · on Oct 10, 2021

I suppose the older requests are less likely to have actual people doing a page reload/retry that stacks up yet more demand.

throwaway984393 · on Oct 10, 2021

> Goodput is the subset of the throughput that is handled without errors and with low enough latency for the client to make use of the response.

"Back in my day..." we used HPS (hits per second) and CPU load as the "goodput". A 'hit' was only recorded via an access log at the end of a successful request<->response. Of course this doesn't measure whether the latency of the responses is unacceptable, just that there was a completed response... so that's where CPU Load came in. If you're doing a lot of HPS and load gets too high, you know latency is just going to get worse to the point of unavailability, so you start load shedding.

In our less-advanced old-school practice, load shedding was merely lowering the maximum requests per second option of the HTTP servers, waiting, and lowering again, until the load recovered. Our tools could reconfigure the server's settings without restarting, but required issuing an admin command to the server and "waiting in line" while the server processed all the other CPU requests under load, which meant the problem might continue for longer than we'd like.

To get around "waiting in line", you would use a transparent network filter (either a proxy, or IPTables) to load-shed. This can be called a "middlebox" solution. But sometimes termination of connections isn't feasible without "dirty" terminations that might cause bugs in clients or servers... and that's [one of the reasons] why people hate middleboxes. But they're great when they work!

You would also probably use Apdex today instead of CPU Load. If your API has contractual minimum latency guarantees, you'd skip Apdex and just use latency itself.

dyanacek · on Oct 10, 2021

These remain as great techniques! Even iptables like you mention - it’s extremely good at cheaply shedding new handshakes, vs later on in processing the request. You lose a little visibly, but it’s a powerful outer “layer of the onion”.

And good callout on middle boxes. Even high level abstraction ones like Amazon API Gateway. In fact this is my favorite feature of it. API Gateway can reject a very high rate of excess traffic for a small overloaded service behind it.

sfkgtbor · on Oct 10, 2021

I think the blog post about load shedding by Netflix has some good examples.

"Keeping Netflix Reliable Using Prioritized Load Shedding" https://netflixtechblog.com/keeping-netflix-reliable-using-p...

athoscouto · on Oct 10, 2021

I also like "Performance Under Load" by Eran Landau, William Thurston and Tim Bozarth (https://netflixtechblog.medium.com/performance-under-load-3e...) very much.

zinekeller · on Oct 10, 2021

It is also a concept on electricity grid management: https://en.wikipedia.org/wiki/Rolling_blackout

edge17 · on Oct 10, 2021

Indeed. This is the first I've heard the term used outside electric grid management.

throwawaythekey · on Oct 11, 2021

Load shedding as a technique seems to make sense but to me it never seems to be the correct tool to reach for. Of course you need some basic load shedding to handle ddos type scenarios, but anything beyond that seems like overkill.

At (work) our services achieve high availability (99.99%) simply by being composed of reliable parts. Most effort is focused here and it seems to have a good payoff ratio.

When we do try to add anything like graceful degradation it seems to strongly hurt code readability/ease of understanding because it requires additional code branches and either a fair bit of boilerplate or extra abstractions.

This could be a result of the scale we are at/the maturity of our tools but I'm interested in if anyone else has some hands on experience.

motohagiography · on Oct 10, 2021

Related concept in a recent HN thread about a paper describing how to use randomization to increase system resiliance: https://news.ycombinator.com/item?id=28702821

londons_explore · on Oct 10, 2021

Most real world services have dependencies, and load tests don't accurately represent the fact that the dependencies have other users and changes in performance of a dependency will dramatically change the performance of the service being tested.

That pretty much means you can't have hard coded '100 requests per second per instance'.

Instead, I suspect the future of load shedding is automatic maximum "goodput" tracking. For example, the load balancer can alternate between 90 and 100 parallel requests, and if more requests are completed with fewer parallel requests then start alternating between 80 and 90 to figure which of those is better...

jeffbee · on Oct 10, 2021

This is a great article but leaves out a key idea. Load shedding is really a key topic in cost utilization. You need load shedding to be able to serve closer to your capacity red line. You can always buy your way out of overload. Load shedding is a feature that, if you have it, allows you to more comfortably dial back your resources and serve closer to the limit.

dyanacek · on Oct 10, 2021

Agreed - utilization is an important consideration here. The capacity red line will still be there, but when load shedding is effective, the impact of crossing that red line is les. It’d be an error rate linearly proportionally to the excess, rather than the service falling off a cliff. But for the services this is talking about, neither case is okay, so we put a ton of emphasis on auto scaling models to make sure we don’t get into the situation.

A key sort of “continuation” to this article is the one on fairness: https://aws.amazon.com/builders-library/fairness-in-multi-te... . This gets into the topic of utilization a bit more.

But you’re right - good load shedding gives a business a tool to make an easier trade off when it comes to capacity management. A slight error rate until autoscaling kicks in is an easier pill to swallow than a worse outage.

ozzythecat · on Oct 10, 2021

> You can always buy your way out of overload.

Not really. My service may depend on other services that I have no control over. Perhaps I have extra money to scale up my own service, but those other services may be owned by different teams or organizations entirely.

shukantpal · on Oct 11, 2021

With money, you can in-house those services and scale them up if those "other" organizations won't scale up.

ozzythecat · on Oct 11, 2021

With infinite money, sure. Realistically you won’t have the budget. And even if it’s “in house”, it’s going to be some service run by another team or organization within your company that has their own priorities, roadmap, and deliverables. They’re not just going to scale up their service because you asked them. They basically don’t have to do anything for you at all. And even if they agree - everything has a cost. There’s simply the overhead of even sending an email or a Slack message and setting up a meeting, getting people ramped up on your use case, why you need them to scale, for how long, for what use cases, etc etc. everything has a cost and money and budget are always constrained.

shukantpal · on Oct 11, 2021

That's why the "with money" not "with little money"

vander_elst · on Oct 10, 2021

Related: https://sre.google/sre-book/addressing-cascading-failures/#x...

ccannon · on Oct 10, 2021

This is a master class. So many good tidbits.

raister · on Oct 10, 2021

Since we are using the power abstraction, why not instead of load shedding, start up secondary machines servicing, which are slower, but could handle a bunch of simultaneous requests for some clients, or even tertiary ones?

That could at least postpone the load shedding, or handle localised surges in service demand due to user behaviour more effectively.

kevindong · on Oct 10, 2021

Sometimes there's limits to how high you can scale without overloading dependencies. For example, the database that the service accesses might only support n connections and the service is already using approximately n connections.

Aachen · on Oct 10, 2021

The article is in Germany for me so I've only read a part of it before I got tired of trying to read German, but I think they mention as one of the first things that what you describe, preventing an overload in the first place, is their primary strategy.

The request dropping (less jargon-y and more descriptive than "load shedding", which refers to consuming more rather than less) only kicks in when that fails or isn't fast enough.

zinekeller · on Oct 10, 2021

This official PDF should be in English: https://d1.awsstatic.com/builderslibrary/pdfs/using-load-she...

5e92cb50239222b · on Oct 10, 2021

You can change language in the footer, waaay down.

samwisedum · on Oct 10, 2021

Denying requests to avoid fulfilling requests

dyanacek · on Oct 10, 2021

It’s a tricky topic because load shedding is a last resort that kicks in when there’s already a problem. So until auto scaling catches up or the issue is mitigated some other way, we try to make as many customers happy as possible, rather than making everyone equally unhappy.